Archives of the Humanist Discussion Group
Introduction
From the
Humanist website:
Humanist is an international electronic seminar on humanities computing and the digital humanities. Its primary aim is to provide a forum for discussion of intellectual, scholarly, pedagogical, and social issues and for exchange of information among members.
The
listserv archives of Humanist offer a unique perspective on more than 20 years of humanities computing scholarship. In the context of
T-REX 2008, I cobbled together a set of scripts that scrape the web-based Humanist archives (in a variety of formats) and convert them into a homogeneous form (as XML or plain text).
Please note that as of this writing (May 28, 2008), the web-based archives stop in February 2008 – presumably missing content will be made available soon.
Available Archives
The following formats are available (each file is between 19-23MB)
Please note that email messages containing quoted content have been stripped. I used this option because I'm interested in what contributors to Humanist are writing (and it avoids misleading duplication of content when a previous Humanist message is being quoted). However, in some cases it might be relevant content quoted from elsewhere; see the section below for options available when generating the archives.
Available Code
Please note that the code in these files no longer works as the location of the Humanist Archives has changed – I'll fix it and post revised code eventually
The code used to generate the archives above can be
downloaded – this allows you to update the archives (if new content becomes available) as well as tweak the options used when cleaning the archives. Once downloaded, you can expand the directory using a command like:
tar -xvzf humanist_archiver.tgz
The distribution contains the following files:
- humanist/LICENSE.txt: the license for the scraper and cleaner code
- humanist/README.txt: this description of code
- humanist/archives/: an empty directory for storing cleaned archives
- humanist/cleaner.rb: the script used to clean the raw archives
- humanist/lib: a directory containing various supporting resources
- humanist/raw_archives: an empty directory for storing raw archives
- humanist/scraper.rb: the script used to acquire the raw, web-based archives
Usage of the Scraper
The easiest way to use the scraper from the command line is something like this:
ruby -e 'require "scraper"; Humanist::Scraper.new.scrape'
There are several options available for the scraper, including:
- quiet (boolean): whether or not to report progress (default is
false)
- base url (String) in which to look for volumes (default is
'http://lists.village.virginia.edu/lists_archive/Humanist/')
- volume_numbers (Enumerable): volume numbers to grab (default is
(1.99)
- volume_directory_format (String): format of directory (appended to base_url, default is
'v%02d')
- archive_directory (String): the directory in which to store raw archives (default is
'raw_archives')
As an example, this command would scrape only the first ten years of Humanist:
ruby -e 'require "scraper"; scraper = Humanist::Scraper.new; scraper.volume_numbers = (1..10); scraper.scrape'
The current defaults make significant assumptions about the location and directory structure of the online archives – tweaking will obviously be required if things change.
Usage of the Cleaner
The easiest way to use the cleaner from the command line is something like this:
- for XML:
ruby -e 'require "cleaner"; Humanist::Cleaner.new.clean_as_xml'
- for text:
ruby -e 'require "cleaner"; Humanist::Cleaner.new.clean_as_text'
Both
clean_as_xml and
clean_as_text take an optional parameter (as a Symbol) which determines the structure of the cleaned archive:
- :single: create only one archive file
- :volume: create one archive file per volume
- :message: create one archive file per message
There are several options available for the cleaner, including:
- quiet (boolean): whether or not to report progress (default is
false)
- base url (String) used for URL pointing back to raw archive file (for XML only; default is
'http://lists.village.virginia.edu/lists_archive/Humanist/')
- raw_archives_directory (String): directory in which to look for raw archives (default is
raw_archives)
- raw_archives_file_prefix (String): limit cleaning to raw archive files with this prefix (default is
'v')
- cleaned_archives_directory (String): directory in which to store cleaned archives (default is
archives)
- keep_quotes (boolean): whether or not to keep quoted text (default is
true)
- css_href (String): location of CSS file to include (for XML output only)
- xsl_href (String): location of XSL file to include (for XML output only)
As an example, this command would clean the archives as a single file while suppressing quoted content,
ruby -e 'require "cleaner"; cleaner = Humanist::Cleaner.new; cleaner.keep_quotes = false; cleaner.clean_to_xml(:single)'
The cleaning code is fairly brute in filtering content, stripping unwanted characters, and escaping others for XML. I haven't verified the output much, so anomalies should be expected.
If the content is cleaned as XML, the XmlValidate tool can be used to check for problems (you will need to have
Xalan-Java installed and configured for use):
java -jar lib/XmlValidate.jar archives
The code is copyright under the
The Educational Community License 1.0, please see the LICENSE.txt file for the full text.
Any questions or suggestions? Please contact me through
http://stefansinclair.name/
--
StefanSinclair - 28 May 2008