Environmental Scan: Spiders/CrawlersA spider is a program that crawls the Internet in a specific way for a specific purpose. The purpose could be to gather information or to understand the structure and validity of a Web site. Spiders are the basis for modern search engines, such as Google and AltaVista. These spiders automatically retrieve data from the Web and pass it on to other applications that index the contents of the Web site for the best set of search terms.{{ http://www.ibm.com/developerworks/linux/library/l-spider/ }}
Scan List
The following projects have been identified as part of a an environmental scan of potential open source web spiders for possible use in the MashingText? proof of concept. Where possible comments about applicability are noted:
Available for non-commercial use on a case by case basis. The open-source version of the crawler used by the Internet Archive. Stable and active. Easy Eclipse .project and .classpath included in source download.
Heretrix is being used by Library and Archive Canada to crawl the Government of Canada site, as per their legislated mandate. I suspect that this project is stalled, unfortunately. User documentation available, but developer docs do not seem to have ever emerged. Las update was 2003. Specs sounded useful. "Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities." If a more thorough testing of potential candidates is undertaken in round two, this would be a good contender for testing. Another inactive java framework that looks promising, but has gone dark. Mature product that offers some useful functions such as ability for automated form filling when authentication is required. Another dark project - but flagged as stable and active. C++. Interesting. Very good web crawler documentation from SUN. A mature java class library. From Apache for indexing and searching ARC files This a PERL-based script used by Stanford's WebBase? Project. A Chilean project in C++. There is an additional library available for Ruby. C#-based A python-based crawler Another Python-based web crawler. A final Python Scripted web crawler. Further Exploration
The two spiders/crawlers that I feel it worth doing additional exploration with are:
Heretrix - Was looking most promising and when LAC suggested they were pleased with it, raised estimation of potential even further. It is active, apparently in wide use and as with JoBo Java-based.
Note: a new version 2 of Heritrix has been released and was available for testing purposes. This proved fortuitous as it helped avoid a number of lingering issues which other users attempting to use H within a java workflow, such as those working with DSpace.
Heritrix 2 differs from the initial release in offering:
Jobo - It has the ability to do form-filling which is potentially advantageous and it has freely accessible Java source code available. For a concise description of Heritrix refer to its Wikipedia article. Testing Plan
Shawn will setup Heritrix to run against the Globalisation Compendium and see what sort of results we get.
Am also running it against a simple, but substantial static site to measure results, and a smaller dynamically-generated site to see how well it is able to crawl.
Types of Crawls
Test Results
A new version 2.0.o of Heritrix in alpha has just been released, so I decided it would be fun to run with this one.
Heritrix offers a console and GUI interface. API is via JMX. For this initial test, I decided to use the GUI
Test Discussion
Heritrix was installed on dev-tapor for testing purposes. Installation is simple and Heritrix consumes few resources when in operation. For testing purposes I initially ran it as a single thread and then added multiple threads (3). CPU load remained minimal. Operation of the spider is slow. Heritrix is richly configurable and one of the parameters that you can adjust is how polite it is to the site being spidered. This delay in request intervals is the biggest determinant of how fast Heritrix is able to return results. When spidering, Heritrix returns html files into a compressed archive of specified size. It creates sequential archives as it is pidering, thus you can begin further operations on archives as it moves from one to the next.
In summary, Heritrix is the web crawling program used by the Internet Archive. The program can be run from any user account on a Linux or Unix system with Java. The program can be run with or without a web-based interface. The web interface can be used to start, pause, and end Heritrix crawls, generate configuration files for later use, and view logs related to Heritrix crawls. The crawler can also be run without the web interface if a configuration file is available. It is highly configurable and performs consistently and stabily. The effectiveness of a spidering operation is dependant upon input of appropriate seed information and configuration of depth of spidering and of the duration that the spider is set to run. When it was tested on a largely static site (www.niche.uwo) running on Drupal, it completed the spider within two hours in polite mode and returned a complete archive of the site. When run against a more dynamically generated site (www.globalautonomy.ca) is was less successful and after 24 hours of operation it largely colected the static sidebars and ome of the dynamically generated facets of the site. Further tweaking of the spider may allow for more effective content acquisition.
Heritrix generates a number of files in addition to the actual crawl content. The crawl results are contained in the archive (.arc) files. Heritrix generates additional files about the crawl process. There is a log of the URIs for the pages crawled that indicates when the pages were crawled, the number and names of domains crawled, the errors encountered in a crawl, etc. These files can be used to help determine the authenticity of the documents in the arc files and can also be used as informational tools for persons interested in the web site. They might, for example, indicate where web page links no longer work.
Crawls with Heritrix are focused via the “scope” options. “Domain Scope” is the default. This choice will crawl all servers in a given domain. The initial scope for testng was to crawl the narrower scope on the domain www.globalautonomy.ca. Even a crawl such as this actually goes beyond the documentation as Heritrix crawled all sites on www.globalautonomy.ca and pages and media linked to those pages, so included several domains outside of www.globalautonomy.ca. After eight hours Heritrix had gathered approximately 1.5 Gigabytes and 70,000 items.
Limiting the scope to www.globalautonomy.ca and the depth of the crawl from twenty-five to three ran for 1 hour and 53 minutes and gathered 31.7 megabtyes of material in 1,376 documents.
Nutchwax is recommended for viewing and indexing ARC files such as those generated by Heritrix.
InputsThe web spider in its most basic takes the following arguments:
| |