Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

HTML Text Extractor

See http://taporware.mcmaster.ca/~taporware/betaTools/textextractor.shtml

Description

Web pages that use css to format the page's layout are increasing rapidly. One of the popular ways is to put different parts (functionally) of the pages into <div> tags with different attribute "id".

In text analysis, the HTML pages are processed usually according to the text contained in some specific tags. This may contain not only main text but navigation part etc. Therefore, the results will be distorted especially for the "small" text in size.

This tool enable user to extract the text contained in a tag with specific attribute name and value. (Note: If you have many of the same tag, attribute name and attribute value units which are for the purpose of font, color etc., The tool will extract the first one.)

Pseudocode

  • Obtain HTML string by URL or from user's local disk
  • Extract text based on user specified element, attribute name and attribute value (if no attribute name and value are specified, the function of the tool is the same as the HTML Extract text tool.)
  • Generate output based on user specified display format

Ways of Using

  • Enter a valid URL which points to a HTML text in the URL field or enter a local path to upload HTML text
  • Enter a valid HTML tag, an attribute name and an attribute value in the corresponding text fields
  • Select output format
  • Click "submit" button

CGI Interface

If you want to use this tool from your web site, here is the CGI Interface: (Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Discription
source url/local radio button url Let user select input text (either a url or upload local html text)
htmlurl   text   A Valid URL that the pointed document should be an html text
localFile   file   The path to your local html text file
tagword   text div Valid html element (tag) name or multple html element name
attribute   text class A valid attribute name within the element above
attrvalue   text item A valid attribute value corresponding to the attribute name above
listdisp 2/1 select 2 Output display format, corresponding to HTML/HTML tags in HTML respectively

Use Extract HTML TAPoRware Tool in Your Web Page

You can add a button in your web page to extract text contained in specified HTML tags in that page by call TAPoRware cgi script.

HTML Tags:
Attribute:
Attribute Value:

Here is the code for the tool interface:

(Note: you need to change the values of HTML Tag, Attribute, and Attribute Value to fit your page requirement)

<form method="post" name="htmlForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/textextractor.cgi" onsubmit="document.htmlForm.htmlurl.value=document.location.href">

<input type="hidden" name="source" value="url" />

<input type="hidden" name="htmlurl" />

HTML Tags: <input type="text" name="tagword" value="div" /><br>

Attribute: <input type="text" name="attribute" value="class" />

Attribute Value: <input type="text" name="attrvalue" value="ds-contentcontainer"/><br>

<input type="hidden" name="listdisp" value="2" />

<input type="hidden" name="taporface" value="same" />

<input type="submit" name="doit" value="Extract Text" />

</form>

-- LianYan - 01 Jun 2007


Use this box to quickly add a comment to the page.

more options...