Apache Nutch®
From Oxxus Wiki
The Apache Nutch® is an Open source developed web-search software project.
It provides all its strength if configured to crowl in local mode
and post its results to Apache Solr or it can be completely based on Hadoop.
Getting started with Apache Nutch®
To start, please first obtain the latest, Stable 1.3, release. It's available in binary or source releases. Once desired release is
downloaded, it has to be unpacked at desired hosting destination.
As it's developed in Java, it has classes for Command line options for /bin/nutch.
Each configuration, as local mode crowler or Hadoop project based, is explained in details at Notch with Solr or Hadoop based.
It can be assumed as a subproject in both configurations, as the results are being posted to Solr or Hadoop for final processing.