Aggregate text from different sources
See
http://taporware.mcmaster.ca/~taporware/otherTools/aggregator.shtml
Description
This tool aggregates texts/subtexts into a single text. The original texts can be from different locations, such as the internet or/and on your local machine. Aggregating subtexts requires all documents to share a common subtext tag, i.e. limiting the subtext to body requires all texts to have a <body> tag. The aggregator tool will grab the contents from the texts to form a single new text.
Note If you use this through the
TAPoR portal you have to put something in the two source text boxes. You can choose "Text" and just enter a word or two in the text boxes.
Pseudocode
- Get the URLs from user input
- Get local text if user specify a local file to be aggregated with the text from the URLs
- Get all the text by the URLs
- Strip tags based on user selection
- Generate output based on user selected format
Ways of Using
- Enter valid URLs you want the related text to be aggregated into the "URLs" text area (one URL per line) if it is selected
- Or enter (browse) the local file path into the "Local file" field if your URLs are in a file. The format of the URLs in the file should also be one URL per line
- Browse or enter the local text path into the "Plus ..." file field if you wnat this TEXT to aggregated with the texts indicated above
- Select how to treat the tag from the retrieved texts if there are any
- Select the aggregated text format.
- Click submit button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| urls | | text area | | URL list that the pointed text to be aggregated -- one url per line |
| urlfile | | file | | The path to your local url list file |
| localfile | | file | | path to your local text file you want to aggregated with the above retrieved text |
| howtoaggre | striptag/formxml | radio button | striptag | Indicate how to treat the tags in the retrieved text. If "Strip tag" is selected, all tags will be stripped. Otherwise, generate an XML corpus with non-xml text being commented out |
| strip | 1/2/3/4 | selection | 1 | Output format that corresponding to HTML/XML text in HTML/XML tree/Plain text |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Web Service Interface
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: aggregator_HTML
- parameters:
- source1 -- any source text
- source2 -- any source text to be aggregated with source 1
- urls -- valid urls delimited by "\n" that their pointed text to be aggregated with the two above
- element -- This is the element name that applies to all aggregated text. Only those text contained in the element will be extracted and aggregated. Default is body
- outFormat -- output format: 1 -- plain text, 2 -- html text
Known Bugs
To Do
--
MattPatey - 15 Oct 2005