Tokenize Text with Specified Separators
See
http://taporware.mcmaster.ca/~taporware/textTools/tokenize.shtml
Description
Tool splits text document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters or patterns. The results can be listed with the token removed, before the split, or after the split.
Pseudocode
- Obtain submitted text string by URL or from user's local disk
- If the text format is html or xml, strip off all the tags
- Generate token or separator based on user selected token type
- Perform tokenizer function
- Arrange token and separator based on user's preference
- Generate output format
Ways of Using
- Enter a valid URL in the URL field or enter (browse) a local path to upload source text
- Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
- Select how you want the separators to be treated
- Select output format
- If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
- Finally, click the "Submit" button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local source text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| texturl | | text | | A Valid URL that points to the source text |
| localFile | | file | | The path to your local text file |
| token | /Word/Line/Sentence/ Paragraph/char/pat | radio button | Word | token types are as the values mean |
| character | | text | | characters separated by space as token type. If you select "separate on characters", This field must be filled |
| pFormat | unix/regexp | radio button | | type of token as pattern to select |
| unix | | text | | unix styled pattern |
| regexp | | text | | regular expression as token |
| dispop | 1/2/3/4 | radio button | 1 | the options corresponding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token |
| HowToList | 1/2/3/4 | selection | 1 | Display format which are HTML/XML text in HTML/XML tree/Tab delimited text in the order of parameter values |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Use Tokenizer TAPoRware Tool in Your Web Page
You can add a text field and a button in your web page to get the concordance of the pattern you entered in that page by call
TAPoRware cgi script.
Here is the code that you can cut and paste to your web pages:
<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/ttokenize.cgi" onsubmit="document.textForm.texturl.value=document.location.href">>
<input type="hidden" name="source" value="url" />
<input type="hidden" name="texturl" />
<input type="hidden" name="token" value="Word" />
<input type="hidden" name="dispop" value="1" />
<input type="hidden" name="HowToList" value="1" />
<input type="submit" name="doIt" value="Tokenizing" />
</form>
Web Service Interface
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: tokenizer_Plain
- parameters:
- textInput -- any text string
- tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
- option1 -- the value of this parameter depend on the "tokenType" you selected. If "Separate on characters" is selected, you should enter space separated characters. if "Separate on pattern" is selected, you should enter "unix" or "regexp"
- option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.
- sorting -- value can be 1/2/3/4 which is corresponding to strip separator/keep separator as token/keep with previous token/keep with following token respectively
- outFormat -- value can be 1/2/3 which is corresponding to HTML/XML text in HTML/XML tree respectively
Known Bugs
To Do
--
LianYan - 29 Mar 2007