Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Environmental Scan: Spiders/Crawlers

A spider is a program that crawls the Internet in a specific way for a specific purpose. The purpose could be to gather information or to understand the structure and validity of a Web site. Spiders are the basis for modern search engines, such as Google and AltaVista. These spiders automatically retrieve data from the Web and pass it on to other applications that index the contents of the Web site for the best set of search terms.{{ http://www.ibm.com/developerworks/linux/library/l-spider/ }}

Scan List

The following projects have been identified as part of a an environmental scan of potential open source web spiders for possible use in the MashingText? proof of concept. Where possible comments about applicability are noted:

  • UbiCrawler
  • Available for non-commercial use on a case by case basis.

  • Heretrix
  • The open-source version of the crawler used by the Internet Archive. Stable and active. Easy Eclipse .project and .classpath included in source download. Heretrix is being used by Library and Archive Canada to crawl the Government of Canada site, as per their legislated mandate.
    A user manual is available.

  • JSpider
  • I suspect that this project is stalled, unfortunately. User documentation available, but developer docs do not seem to have ever emerged. Las update was 2003. Specs sounded useful.

  • WebHarvest
  • "Web-Harvest is Open Source Web Data Extraction tool written in Java. It offers a way to collect desired Web pages and extract useful data from them. In order to do that, it leverages well established techniques and technologies for text/xml manipulation such as XSLT, XQuery and Regular Expressions. Web-Harvest mainly focuses on HTML/XML based web sites which still make vast majority of the Web content. On the other hand, it could be easily supplemented by custom Java libraries in order to augment its extraction capabilities." If a more thorough testing of potential candidates is undertaken in round two, this would be a good contender for testing.

  • Arachnid
  • Another inactive java framework that looks promising, but has gone dark.

  • JoBo
  • Mature product that offers some useful functions such as ability for automated form filling when authentication is required.

  • Spider
  • Another dark project - but flagged as stable and active.

  • Open Web Spider
  • C++. Interesting.

  • Java Web Crawler
  • Very good web crawler documentation from SUN.

  • WebSPHINX
  • A mature java class library.

  • Nutch
  • From Apache for indexing and searching ARC files

  • LARM
  • Aperture
  • larbin
  • WebVac
  • This a PERL-based script used by Stanford's WebBase? Project.

  • WIRE - Web Information Retrieval Environment
  • A Chilean project in C++. There is an additional library available for Ruby.

  • Noviway Web Crawler
  • C#-based

  • YaCy
  • A java-based P2P web crawler.

  • Ruya
  • A python-based crawler

  • Univeral Information Crawler
  • Another Python-based web crawler.

  • Squzer Distributed Crawler
  • A final Python Scripted web crawler.

Further Exploration

The two spiders/crawlers that I feel it worth doing additional exploration with are:

Heretrix - Was looking most promising and when LAC suggested they were pleased with it, raised estimation of potential even further. It is active, apparently in wide use and as with JoBo Java-based. Note: a new version 2 of Heritrix has been released and was available for testing purposes. This proved fortuitous as it helped avoid a number of lingering issues which other users attempting to use H within a java workflow, such as those working with DSpace. Heritrix 2 differs from the initial release in offering:

  1. A more rigorous separation of the Web UI from the 'crawl engine', giving greater flexibility to control crawlers remotely.
  2. A new settings system, easing module development and offering new opportunities for dynamic configuration construction.
  3. A new mechanism for custom override settings for sets of related URIs, extending beyond Heritrix 1.x's domain-centric overrides.
  4. A new system for ordering URIs within a single URI-queue, and for allocating frontier effort among different URI-queues, based on assigned integer 'precedence' values.


Jobo - It has the ability to do form-filling which is potentially advantageous and it has freely accessible Java source code available.

For a concise description of Heritrix refer to its Wikipedia article.

Testing Plan

Shawn will setup Heritrix to run against the Globalisation Compendium and see what sort of results we get. Am also running it against a simple, but substantial static site to measure results, and a smaller dynamically-generated site to see how well it is able to crawl.

Types of Crawls

  1. Broad: Broad crawls are large, high-bandwidth crawls in which the number of sites and the number of valuable individual pages collected is as important as the completeness with which any one site is covered. At the extreme, a broad crawl tries to sample as much of the web as possible given the time, bandwidth, and storage resources available.
  2. Focused: Focused crawls are small- to medium-sized crawls (usually less than 10 million unique documents) in which the quality criterion is complete coverage of some selected sites or topics.
  3. Continuous: Traditionally, crawlers pursue a snapshot of resources of interest, downloading each unique URI one time only. Continuous crawling, by contrast, revisits previously fetched pages – looking for changes – as well as discovering and fetching new pages, even adapting its rate of visitation based on operator parameters and estimated change frequencies.
  4. Experimental: The Internet Archive and other groups want to experiment with crawling techniques in areas such as choice of what to crawl, order in which resources are crawled, crawling using diverse protocols, and analysis and archiving of crawl results.

  • Note that Heritrix in particular carries out what are termed snapshot scans (periodic scans) rather than continuous scans. Snaps shot scans provide for broad large scope scans.

Test Results

A new version 2.0.o of Heritrix in alpha has just been released, so I decided it would be fun to run with this one. Heritrix offers a console and GUI interface. API is via JMX. For this initial test, I decided to use the GUI
  1. I seeded Heritrix with http://www.globalization.ca to start to spider from and left all default settings.
  2. Interface is smooth, and provides access to the rather rich configuration options for Heritrix
  3. Spider is currently running...
  4. Output from Heritrix is a series of logs and reports as well as a tar ball of the crawl.
  5. The resulting crawl is 31.7 Mbs in size and took 1h 54m to generate.
  6. Next step is to determine what the crawl actually provided. This is to be accomplished by installing a working Nutchwax environment.

Test Discussion

Heritrix was installed on dev-tapor for testing purposes. Installation is simple and Heritrix consumes few resources when in operation. For testing purposes I initially ran it as a single thread and then added multiple threads (3). CPU load remained minimal. Operation of the spider is slow. Heritrix is richly configurable and one of the parameters that you can adjust is how polite it is to the site being spidered. This delay in request intervals is the biggest determinant of how fast Heritrix is able to return results. When spidering, Heritrix returns html files into a compressed archive of specified size. It creates sequential archives as it is pidering, thus you can begin further operations on archives as it moves from one to the next.

In summary, Heritrix is the web crawling program used by the Internet Archive. The program can be run from any user account on a Linux or Unix system with Java. The program can be run with or without a web-based interface. The web interface can be used to start, pause, and end Heritrix crawls, generate configuration files for later use, and view logs related to Heritrix crawls. The crawler can also be run without the web interface if a configuration file is available. It is highly configurable and performs consistently and stabily. The effectiveness of a spidering operation is dependant upon input of appropriate seed information and configuration of depth of spidering and of the duration that the spider is set to run. When it was tested on a largely static site (www.niche.uwo) running on Drupal, it completed the spider within two hours in polite mode and returned a complete archive of the site. When run against a more dynamically generated site (www.globalautonomy.ca) is was less successful and after 24 hours of operation it largely colected the static sidebars and ome of the dynamically generated facets of the site. Further tweaking of the spider may allow for more effective content acquisition.

Heritrix generates a number of files in addition to the actual crawl content. The crawl results are contained in the archive (.arc) files. Heritrix generates additional files about the crawl process. There is a log of the URIs for the pages crawled that indicates when the pages were crawled, the number and names of domains crawled, the errors encountered in a crawl, etc. These files can be used to help determine the authenticity of the documents in the arc files and can also be used as informational tools for persons interested in the web site. They might, for example, indicate where web page links no longer work.

Crawls with Heritrix are focused via the “scope” options. “Domain Scope” is the default. This choice will crawl all servers in a given domain. The initial scope for testng was to crawl the narrower scope on the domain www.globalautonomy.ca. Even a crawl such as this actually goes beyond the documentation as Heritrix crawled all sites on www.globalautonomy.ca and pages and media linked to those pages, so included several domains outside of www.globalautonomy.ca. After eight hours Heritrix had gathered approximately 1.5 Gigabytes and 70,000 items. Limiting the scope to www.globalautonomy.ca and the depth of the crawl from twenty-five to three ran for 1 hour and 53 minutes and gathered 31.7 megabtyes of material in 1,376 documents.

Nutchwax is recommended for viewing and indexing ARC files such as those generated by Heritrix.

Inputs

The web spider in its most basic takes the following arguments:

  1. A seed URL, or series of URLs
  2. Your site info to provide as part of the spidering process
  3. Depth of links to follow
  4. Minimum time between requests to site being crawled - level of politeness
  5. Type of meta information to be attached to the crawl log

Web Scrapers -->

%FOOTNOTELIST%

-- ShawnDay - 19 May 2008


Use this box to quickly add a comment to the page.

more options...