Tokenize HTML Document
See
http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Description
This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.
Term Definition
- token: The basic syntactic unit of a text. A token consists of one or more characters, excluding the blank character and excluding characters within a string constant or delimited identifier.
- tokenizer: A parsing program that scans text and determines when and if a series of characters can be recognized as a token.
Predefined Parameter Values in Tool Bar
- Source: the page the user is currently in.
- Element:
body or set by site owner
- Token type: words
- Display option: strip separator
- Display format: HTML
Pseudocode
- Obtain HTML string by URL or from user's local disk
- Obtain text contained by user specified tags
- Generate token or separator based on user selected token type
- Perform tokenizer function
- Arrange token and separator based on user's preference
- Generate output format
Ways of Using
- Enter a valid URL in the URL field or enter a local upload html text
- Enter a valid html tag or tag list separated by comma, default is "body"
- Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
- Select how you want the separators to be treated
- Select output format
- If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
- Finally, click the "Submit" button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| htmlurl | | text | | A Valid URL that the pointed document should be an html text |
| localFile | | file | | The path to your local html text file |
| tagtext | | text | body | Valid html element (tag) name or multiple html element name separated by comma |
| token | /Word/Line/Sentence/Paragraph/char/pat | radio button | Word | pattern types as the values mean |
| character | | text | | characters separated by space as token type |
| pFormat | unix/regexp | radio button | | type of token as pattern to select |
| unix | | text | | unix styled pattern |
| regexp | | text | | regular expression as token |
| dispop | 1/2/3/4 | radio button | 1 | the options corresponding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token |
| HowToList? | 1/2/3 | selection | 1 | Display format which are HTML/XML text in HTML/XML tree in the order of parameter values |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Use Tokenizer TAPoRware Tool in Your Web Page
You can add a button in your web page to tokenize this page (in words) by call
TAPoRware cgi script.
Here is the code for the tool interface
<form method="post" name="htmlForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/htokenize.cgi" onsubmit="document.htmlForm.htmlurl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="htmlurl" />
<input type="hidden" name="freetext" value="yes"/>
<input type="hidden" name="tagtext" value="body" />
<input type="hidden" name="token" value="Word" />
<input type="hidden" name="dispop" value="1" />
<input type="hidden" name="HowToList" value="1" />
<input type="hidden" name="taporface" value="same" />
<input type="submit" name="doit" value="Tokenize This Tags" />
</form>
Web Service Interface
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: tokenizer_HTML
- parameters:
- htmlInput -- any html string
- htmlTag -- any html element (tag) name or multiple html element name separated by comma
- tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
- option1 -- the value of this parameter depend on the "tokenType" you used. If "char" is used, you should enter space separated characters. if "pat" is used, you should enter "unix" or "regexp"
- option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.
- displayOption -- value can be 1/2/3/4 which is corresponding to strip separator/keep separator as token/keep with previous token/keep with following token
- outFormat -- value can be 1/2/3 which is corresponding to HTML/XML text in HTML/XML tree respectively
Known Bugs
To Do
--
MattPatey - 13 Oct 2005