To clarify: Similar to a spider, but with more of a focus on simply compiling rather than indexing the web content encountered during the process.
The following seem worthy of investigation:
Heritrix
Heretrix is being used by LAC to crawl the GOvernment of Canada site, as per their legislated mandate.
A user manual is available.