|
Tokenize HTML Document
See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
|
|
Ways of Using
- Enter a valid URL in the URL field or enter a local upload html text
|
< < |
- Enter a valid html tag or tag list seperated by comma, default is "body"
|
> > |
- Enter a valid html tag or tag list separated by comma, default is "body"
|
|
- Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
|
< < |
- Select how you want the separators tb be treated
|
> > |
- Select how you want the separators to be treated
|
|
- Select output format
- If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
- Finally, click the "Submit" button
|
|
Here are the parameters:
|
< < |
|
> > |
|
|
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| htmlurl | | text | | A Valid URL that the pointed document should be an html text |
| localFile | | file | | The path to your local html text file |
|
< < |
| tagtext | | text | body | Valid html element (tag) name or multple html element name separated by comma |
|
> > |
| tagtext | | text | body | Valid html element (tag) name or multiple html element name separated by comma |
|
|
| token | /Word/Line/Sentence/Paragraph/char/pat | radio button | Word | pattern types as the values mean |
| character | | text | | characters separated by space as token type |
| pFormat | unix/regexp | radio button | | type of token as pattern to select |
| unix | | text | | unix styled pattern |
| regexp | | text | | regular expression as token |
|
< < |
| dispop | 1/2/3/4 | radio button | 1 | the options corresonding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token |
|
> > |
| dispop | 1/2/3/4 | radio button | 1 | the options corresponding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token |
|
|
| HowToList? | 1/2/3 | selection | 1 | Display format which are HTML/XML text in HTML/XML tree in the order of parameter values |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Web Service Interface
|
< < |
Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:
|
> > |
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:
|
|
|
< < |
-
- htmlTag -- any html element (tag) name or multple html element name separated by comma
|
> > |
-
- htmlTag -- any html element (tag) name or multiple html element name separated by comma
|
|
-
- tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
- option1 -- the value of this parameter depend on the "tokenType" you used. If "char" is used, you should enter space separated characters. if "pat" is used, you should enter "unix" or "regexp"
- option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.
|
|
Tokenize HTML Document
See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
|
|
Description
This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.
|
> > |
Term Definition
- token: The basic syntactic unit of a text. A token consists of one or more characters, excluding the blank character and excluding characters within a string constant or delimited identifier.
- tokenizer: A parsing program that scans text and determines when and if a series of characters can be recognized as a token.
|
|
History
Pseudocode
|
|
Tokenize HTML Document
See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
|
|
CGI Interface
|
< < |
If you want to use this tool from you web site, here is the CGI Interface:
|
> > |
If you want to use this tool from your web site, here is the CGI Interface:
|
|
(Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
|
|
| unix | | text | | unix styled pattern |
| regexp | | text | | regular expression as token |
| dispop | 1/2/3/4 | radio button | 1 | the options corresonding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token |
|
< < |
| HowToList? | 1/2/3 | selection | 1 | Display foemat which are HTML/XML text in HTML/XML tree in the order of parameter values |
|
> > |
| HowToList? | 1/2/3 | selection | 1 | Display format which are HTML/XML text in HTML/XML tree in the order of parameter values |
|
|
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Web Service Interface
|
|
Tokenize HTML Document
See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
|
|
Ways of Using
|
> > |
- Enter a valid URL in the URL field or enter a local upload html text
- Enter a valid html tag or tag list seperated by comma, default is "body"
- Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
- Select how you want the separators tb be treated
- Select output format
- If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
- Finally, click the "Submit" button
|
|
CGI Interface
|
> > |
If you want to use this tool from you web site, here is the CGI Interface:
(Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Discription |
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| htmlurl | | text | | A Valid URL that the pointed document should be an html text |
| localFile | | file | | The path to your local html text file |
| tagtext | | text | body | Valid html element (tag) name or multple html element name separated by comma |
| token | /Word/Line/Sentence/Paragraph/char/pat | radio button | Word | pattern types as the values mean |
| character | | text | | characters separated by space as token type |
| pFormat | unix/regexp | radio button | | type of token as pattern to select |
| unix | | text | | unix styled pattern |
| regexp | | text | | regular expression as token |
| dispop | 1/2/3/4 | radio button | 1 | the options corresonding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token |
| HowToList? | 1/2/3 | selection | 1 | Display foemat which are HTML/XML text in HTML/XML tree in the order of parameter values |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
|
|
Web Service Interface
|
> > |
Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:
- Endpoint URL: http://strange.mcmaster.ca:9982
- Service URI: http://strange.mcmaster.ca/~taporware/webservice
- Service Method: tokenizer_HTML
- parameters:
- htmlInput -- any html string
- htmlTag -- any html element (tag) name or multple html element name separated by comma
- tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
- option1 -- the value of this parameter depend on the "tokenType" you used. If "char" is used, you should enter space separated characters. if "pat" is used, you should enter "unix" or "regexp"
- option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.
- displayOption -- value can be 1/2/3/4 which is corresponding to strip separator/keep separator as token/keep with previous token/keep with following token
- outFormat -- value can be 1/2/3 which is corresponding to HTML/XML text in HTML/XML tree respectively
|
|
Known Bugs
To Do
|
|
|
> > |
List HTML Tags
See http://taporware/~taporware/htmlTools/tokenize.shtml
Description
|
|
This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.
|
> > |
History
Pseudocode
Ways of Using
CGI Interface
Web Service Interface
Known Bugs
To Do
|
|
-- MattPatey - 13 Oct 2005
|