Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Tokenize XML document or specific elements

See http://taporware.mcmaster.ca/~taporware/xmlTools/tokenize.shtml

Description

This tool splits an XML document at specified points, or tokens. These tokens can be words, lines, sentences, paragraphs, characters, patterns, or tags. The results can be listed with the token removed or preserved before or after the split.

Pseudocode

  • Get user submitted xml text from specified URL in the internet or from user's local disk
  • Get the sub-text user want to tokenize specified by element name, and/or attribute
  • Run tokenizer program
  • Format output

Ways of Using

  • Enter a valid URL in the URL field or enter a local upload xml text
  • Enter a valid xml element name or or xpath existed in the xml text, the default is "//"
  • Select token type
  • Select the way to treat separators
  • Select output format
  • Click the submit button

CGI Interface

If you want to use this tool from your web site, here is the CGI Interface: (Note: If you want to upload local xml text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Discription
source url/local radio button url Let user select input text (either a url or upload local xml text)
xmlurl   text   A Valid URL pointing to an xml text
localFile   file   The path to your local html text file
xmlpath   text // Valid xml element (tag) name or multple xml element names separated by comma
attr_name   text   Valid xml attribute name
attr_value   text   Valid xml attribute value
token Word/Line/Sentence/ Paragraph/char/elem/pat radio Word A radio button group allows you to select token type
character   text   If you select "characters" as token type, you need to fill this field
element   text   If "Separate on tags" is selected, this field need to to be filled with a valid xml tag of the submited xml document
keeptag   checkbox unchecked Check it to keep the element with tokens when select "Separate on tags"
pFormat unix/regexp radio button unix When select "Pattern" as token type, this control is used to indicate the pattern type -- unix style of regular expression
unix   text   When "Pattern" and "Unix" are selected, enter a unix styled pattern here
regexp   text   When "Pattern" and "regexp" are selected, enter a regular expression here
dispop 1/2/3/4 selection 1 Indicate how to treat separators. The ways to deal with separators are in order of 'Parameter value' Strip separator/Keep separator as token/Keep with previous token/Keep with following token
HowToList? 1/2/3/4 selection 1 Indicate how to display the results in the browser. Thers are in order of the "parameter value" respectively HTML/XML text in HTML/XML tree/Tab delimited text
taporface   checkbox checked Indicate if displayed with the taporware interface

Web Service Interface

Taporware provides web services to any non-benefit organizations. Here is the taporware web services infomation:

  • Endpoint URL: http://taporware.mcmaster.ca:9982
  • Service URI: http://taporware.mcmaster.ca/~taporware/webservice
  • Service Method: tokenizer_XML
  • parameters:
    • xmlInput -- any well-formed xml text
    • element -- any valid xml element names or xpath in the submited xml document
    • attributeName -- an attribute name existed in the xml document
    • attributeValue -- an attribute value existed in the xml document. You must specified attribute name is this one is given
    • tokenType -- A selection for token type: Word/Sentence/Paragraph/char/pat/elem
    • option1 -- values depend on the "tokenType". If you select "character" as token type, fill this field with character token. If you select "pattern" as token type, "unix" or "regexp", no quotation need to be filled in this field to indicate the pattern type. If "element as separator" is selected, enter an valid xml element name in this field.
    • option2 -- Used when select "pattern" or "element" as tokentype only. If "pattern as separator is selected, its value depends on "option1". Enter the right pattern accordingly. If "element as separator" is select, Enter "Yes" to keep element with token, enter "No" to ignore element
    • displayOption -- Indicate ho to treat separators. The values are the same as the parameter "dospop" in the CGI interface above
    • outFormat -- values are html for "MHTL", xml for "XML tree" and anyhing for "XML in HTML text"

Known Bugs

To Do

-- LianYan - 26 Mar 2007


Use this box to quickly add a comment to the page.

more options...