Tokenize XML document or specific elements
See
http://taporware.mcmaster.ca/~taporware/xmlTools/tokenize.shtml
Description
This tool splits an XML document at specified points, or tokens. These tokens can be words, lines, sentences, paragraphs, characters, patterns, or tags. The results can be listed with the token removed or preserved before or after the split.
Pseudocode
- Get user submitted xml text from specified URL in the internet or from user's local disk
- Get the sub-text user want to tokenize specified by element name, and/or attribute
- Run tokenizer program
- Format output
Ways of Using
- Enter a valid URL in the URL field or enter a local upload xml text
- Enter a valid xml element name or or xpath existed in the xml text, the default is "//"
- Select token type
- Select the way to treat separators
- Select output format
- Click the submit button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local xml text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Discription |
| source | url/local | radio button | url | Let user select input text (either a url or upload local xml text) |
| xmlurl | | text | | A Valid URL pointing to an xml text |
| localFile | | file | | The path to your local html text file |
| xmlpath | | text | // | Valid xml element (tag) name or multple xml element names separated by comma |
| attr_name | | text | | Valid xml attribute name |
| attr_value | | text | | Valid xml attribute value |
| token | Word/Line/Sentence/ Paragraph/char/elem/pat | radio | Word | A radio button group allows you to select token type |
| character | | text | | If you select "characters" as token type, you need to fill this field |
| element | | text | | If "Separate on tags" is selected, this field need to to be filled with a valid xml tag of the submited xml document |
| keeptag | | checkbox | unchecked | Check it to keep the element with tokens when select "Separate on tags" |
| pFormat | unix/regexp | radio button | unix | When select "Pattern" as token type, this control is used to indicate the pattern type -- unix style of regular expression |
| unix | | text | | When "Pattern" and "Unix" are selected, enter a unix styled pattern here |
| regexp | | text | | When "Pattern" and "regexp" are selected, enter a regular expression here |
| dispop | 1/2/3/4 | selection | 1 | Indicate how to treat separators. The ways to deal with separators are in order of 'Parameter value' Strip separator/Keep separator as token/Keep with previous token/Keep with following token |
| HowToList? | 1/2/3/4 | selection | 1 | Indicate how to display the results in the browser. Thers are in order of the "parameter value" respectively HTML/XML text in HTML/XML tree/Tab delimited text |
| taporface | | checkbox | checked | Indicate if displayed with the taporware interface |
Web Service Interface
Taporware provides web services to any non-benefit organizations. Here is the taporware web services infomation:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: tokenizer_XML
- parameters:
- xmlInput -- any well-formed xml text
- element -- any valid xml element names or xpath in the submited xml document
- attributeName -- an attribute name existed in the xml document
- attributeValue -- an attribute value existed in the xml document. You must specified attribute name is this one is given
- tokenType -- A selection for token type: Word/Sentence/Paragraph/char/pat/elem
- option1 -- values depend on the "tokenType". If you select "character" as token type, fill this field with character token. If you select "pattern" as token type, "unix" or "regexp", no quotation need to be filled in this field to indicate the pattern type. If "element as separator" is selected, enter an valid xml element name in this field.
- option2 -- Used when select "pattern" or "element" as tokentype only. If "pattern as separator is selected, its value depends on "option1". Enter the right pattern accordingly. If "element as separator" is select, Enter "Yes" to keep element with token, enter "No" to ignore element
- displayOption -- Indicate ho to treat separators. The values are the same as the parameter "dospop" in the CGI interface above
- outFormat -- values are html for "MHTL", xml for "XML tree" and anyhing for "XML in HTML text"
Known Bugs
To Do
--
LianYan - 26 Mar 2007