Main.TAPoRwarePlainTokenize (r1.1 vs. r1.4)
Diffs

 <<O>>  Difference Topic TAPoRwarePlainTokenize (r1.4 - 12 Jun 2008 - LianYan)

META TOPICPARENT TAPoRware

Tokenize Text with Specified Separators

See http://taporware.mcmaster.ca/~taporware/textTools/tokenize.shtml
Line: 48 to 48

HowToList 1/2/3/4 selection 1 Display format which are HTML/XML text in HTML/XML tree/Tab delimited text in the order of parameter values
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window
Added:
>
>

Use Tokenizer TAPoRware Tool in Your Web Page

You can add a text field and a button in your web page to get the concordance of the pattern you entered in that page by call TAPoRware cgi script.

Here is the code that you can cut and paste to your web pages:

<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/ttokenize.cgi" onsubmit="document.textForm.texturl.value=document.location.href">>

<input type="hidden" name="source" value="url" />

<input type="hidden" name="texturl" /> <input type="hidden" name="token" value="Word" /> <input type="hidden" name="dispop" value="1" /> <input type="hidden" name="HowToList" value="1" /> <input type="submit" name="doIt" value="Tokenizing" /> </form>


Web Service Interface

Taporware provides web services to any non-benefit organizations. here is the taporware web services information:


 <<O>>  Difference Topic TAPoRwarePlainTokenize (r1.3 - 29 Mar 2007 - LianYan)

META TOPICPARENT TAPoRware

Tokenize Text with Specified Separators

See http://taporware.mcmaster.ca/~taporware/textTools/tokenize.shtml
Line: 8 to 8

Description

Tool splits text document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters or patterns. The results can be listed with the token removed, before the split, or after the split.
Deleted:
<
<

History


Pseudocode

Added:
>
>
  • Obtain submitted text string by URL or from user's local disk
  • If the text format is html or xml, strip off all the tags
  • Generate token or separator based on user selected token type
  • Perform tokenizer function
  • Arrange token and separator based on user's preference
  • Generate output format

Ways of Using

Added:
>
>
  • Enter a valid URL in the URL field or enter (browse) a local path to upload source text
  • Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
  • Select how you want the separators to be treated
  • Select output format
  • If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
  • Finally, click the "Submit" button

CGI Interface

Added:
>
>
If you want to use this tool from your web site, here is the CGI Interface: (Note: If you want to upload local source text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Description
source url/local radio button url Let user select input text (either a url or upload local html text)
texturl   text   A Valid URL that points to the source text
localFile   file   The path to your local text file
token /Word/Line/Sentence/ Paragraph/char/pat radio button Word token types are as the values mean
character   text   characters separated by space as token type. If you select "separate on characters", This field must be filled
pFormat unix/regexp radio button   type of token as pattern to select
unix   text   unix styled pattern
regexp   text   regular expression as token
dispop 1/2/3/4 radio button 1 the options corresponding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token
HowToList 1/2/3/4 selection 1 Display format which are HTML/XML text in HTML/XML tree/Tab delimited text in the order of parameter values
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Web Service Interface

Added:
>
>
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:

  • Endpoint URL: http://taporware.mcmaster.ca:9982
  • Service URI: http://taporware.mcmaster.ca/~taporware/webservice
  • Service Method: tokenizer_Plain
  • parameters:
    • textInput -- any text string
    • tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
    • option1 -- the value of this parameter depend on the "tokenType" you selected. If "Separate on characters" is selected, you should enter space separated characters. if "Separate on pattern" is selected, you should enter "unix" or "regexp"
    • option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.
    • sorting -- value can be 1/2/3/4 which is corresponding to strip separator/keep separator as token/keep with previous token/keep with following token respectively
    • outFormat -- value can be 1/2/3 which is corresponding to HTML/XML text in HTML/XML tree respectively

Known Bugs

To Do

Changed:
<
<
-- MattPatey - 13 Oct 2005
>
>
-- LianYan - 29 Mar 2007


 <<O>>  Difference Topic TAPoRwarePlainTokenize (r1.2 - 15 Oct 2005 - MattPatey)

META TOPICPARENT TAPoRware
Added:
>
>

Tokenize Text with Specified Separators

See http://taporware.mcmaster.ca/~taporware/textTools/tokenize.shtml

Description


Tool splits text document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters or patterns. The results can be listed with the token removed, before the split, or after the split.
Added:
>
>

History

Pseudocode

Ways of Using

CGI Interface

Web Service Interface

Known Bugs

To Do


-- MattPatey - 13 Oct 2005

 <<O>>  Difference Topic TAPoRwarePlainTokenize (r1.1 - 13 Oct 2005 - MattPatey)
Line: 1 to 1
Added:
>
>
META TOPICPARENT TAPoRware
Tool splits text document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters or patterns. The results can be listed with the token removed, before the split, or after the split.

-- MattPatey - 13 Oct 2005


Topic: TAPoRwarePlainTokenize . { View | Diffs | r1.4 | > | r1.3 | > | r1.2 | More }

Revision r1.1 - 13 Oct 2005 - 22:42 - MattPatey
Revision r1.4 - 12 Jun 2008 - 20:29 - LianYan