HTML Text Extractor
See
http://taporware.mcmaster.ca/~taporware/betaTools/textextractor.shtml
Description
Web pages that use css to format the page's layout are increasing rapidly. One of the popular ways is to put different parts (functionally) of the pages into <div> tags with different attribute "id".
In text analysis, the HTML pages are processed usually according to the text contained in some specific tags. This may contain not only main text but navigation part etc. Therefore, the results will be distorted especially for the "small" text in size.
This tool enable user to extract the text contained in a tag with specific attribute name and value. (Note: If you have many of the same tag, attribute name and attribute value units which are for the purpose of font, color etc., The tool will extract the first one.)
Pseudocode
- Obtain HTML string by URL or from user's local disk
- Extract text based on user specified element, attribute name and attribute value (if no attribute name and value are specified, the function of the tool is the same as the HTML Extract text tool.)
- Generate output based on user specified display format
Ways of Using
- Enter a valid URL which points to a HTML text in the URL field or enter a local path to upload HTML text
- Enter a valid HTML tag, an attribute name and an attribute value in the corresponding text fields
- Select output format
- Click "submit" button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Discription |
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| htmlurl | | text | | A Valid URL that the pointed document should be an html text |
| localFile | | file | | The path to your local html text file |
| tagword | | text | div | Valid html element (tag) name or multple html element name |
| attribute | | text | class | A valid attribute name within the element above |
| attrvalue | | text | item | A valid attribute value corresponding to the attribute name above |
| listdisp | 2/1 | select | 2 | Output display format, corresponding to HTML/HTML tags in HTML respectively |
Use Extract HTML TAPoRware Tool in Your Web Page
You can add a button in your web page to extract text contained in specified HTML tags in that page by call
TAPoRware cgi script.
Here is the code for the tool interface:
(Note: you need to change the values of HTML Tag, Attribute, and Attribute Value to fit your page requirement)
<form method="post" name="htmlForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/textextractor.cgi" onsubmit="document.htmlForm.htmlurl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="htmlurl" />
HTML Tags: <input type="text" name="tagword" value="div" /><br>
Attribute: <input type="text" name="attribute" value="class" />
Attribute Value: <input type="text" name="attrvalue" value="ds-contentcontainer"/><br>
<input type="hidden" name="listdisp" value="2" />
<input type="hidden" name="taporface" value="same" />
<input type="submit" name="doit" value="Extract Text" />
</form>
--
LianYan - 01 Jun 2007