Main.TAPoRwareHTMLTokenize (r1.1 vs. r1.11)
Diffs

 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.11 - 06 Jun 2008 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 62 to 62

HowToList? 1/2/3 selection 1 Display format which are HTML/XML text in HTML/XML tree in the order of parameter values
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window
Added:
>
>

Use Tokenizer TAPoRware Tool in Your Web Page

You can add a button in your web page to tokenize this page (in words) by call TAPoRware cgi script.

Here is the code for the tool interface

<form method="post" name="htmlForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/htokenize.cgi" onsubmit="document.htmlForm.htmlurl.value=document.location.href">

<input type="hidden" name="source" value="url" />

<input type="hidden" name="htmlurl" />

<input type="hidden" name="freetext" value="yes"/>

<input type="hidden" name="tagtext" value="body" />

<input type="hidden" name="token" value="Word" />

<input type="hidden" name="dispop" value="1" />

<input type="hidden" name="HowToList" value="1" />

<input type="hidden" name="taporface" value="same" />

<input type="submit" name="doit" value="Tokenize This Tags" />

</form>


Web Service Interface

Taporware provides web services to any non-benefit organizations. here is the taporware web services information:


 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.10 - 29 Mar 2007 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 32 to 32

Ways of Using

  • Enter a valid URL in the URL field or enter a local upload html text
Changed:
<
<
  • Enter a valid html tag or tag list seperated by comma, default is "body"
>
>
  • Enter a valid html tag or tag list separated by comma, default is "body"

  • Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
Changed:
<
<
  • Select how you want the separators tb be treated
>
>
  • Select how you want the separators to be treated

  • Select output format
  • If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
  • Finally, click the "Submit" button
Line: 48 to 48

Here are the parameters:

Changed:
<
<
Parameter Name Parameter Value Control Type Default Discription
>
>
Parameter Name Parameter Value Control Type Default Description

source url/local radio button url Let user select input text (either a url or upload local html text)
htmlurl   text   A Valid URL that the pointed document should be an html text
localFile   file   The path to your local html text file
Changed:
<
<
tagtext   text body Valid html element (tag) name or multple html element name separated by comma
>
>
tagtext   text body Valid html element (tag) name or multiple html element name separated by comma

token /Word/Line/Sentence/Paragraph/char/pat radio button Word pattern types as the values mean
character   text   characters separated by space as token type
pFormat unix/regexp radio button   type of token as pattern to select
unix   text   unix styled pattern
regexp   text   regular expression as token
Changed:
<
<
dispop 1/2/3/4 radio button 1 the options corresonding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token
>
>
dispop 1/2/3/4 radio button 1 the options corresponding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token

HowToList? 1/2/3 selection 1 Display format which are HTML/XML text in HTML/XML tree in the order of parameter values
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Web Service Interface

Changed:
<
<
Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:
>
>
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:

Changed:
<
<
    • htmlTag -- any html element (tag) name or multple html element name separated by comma
>
>
    • htmlTag -- any html element (tag) name or multiple html element name separated by comma

    • tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
    • option1 -- the value of this parameter depend on the "tokenType" you used. If "char" is used, you should enter space separated characters. if "pat" is used, you should enter "unix" or "regexp"
    • option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.

 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.9 - 18 Jul 2006 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 66 to 66

Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:

Changed:
<
<
>
>

  • Service Method: tokenizer_HTML
  • parameters:
    • htmlInput -- any html string

 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.8 - 11 Jul 2006 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 13 to 13

  • token: The basic syntactic unit of a text. A token consists of one or more characters, excluding the blank character and excluding characters within a string constant or delimited identifier.
  • tokenizer: A parsing program that scans text and determines when and if a series of characters can be recognized as a token.
Changed:
<
<

History

>
>

Predefined Parameter Values in Tool Bar


Added:
>
>
  • Source: the page the user is currently in.
  • Element: body or set by site owner
  • Token type: words
  • Display option: strip separator
  • Display format: HTML

Pseudocode

  • Obtain HTML string by URL or from user's local disk

 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.7 - 21 Jun 2006 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 8 to 8

Description

This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.
Added:
>
>

Term Definition

  • token: The basic syntactic unit of a text. A token consists of one or more characters, excluding the blank character and excluding characters within a string constant or delimited identifier.
  • tokenizer: A parsing program that scans text and determines when and if a series of characters can be recognized as a token.

History

Pseudocode


 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.6 - 06 Feb 2006 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 31 to 31

CGI Interface

Changed:
<
<
If you want to use this tool from you web site, here is the CGI Interface:
>
>
If you want to use this tool from your web site, here is the CGI Interface:

(Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Line: 49 to 49

unix   text   unix styled pattern
regexp   text   regular expression as token
dispop 1/2/3/4 radio button 1 the options corresonding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token
Changed:
<
<
HowToList? 1/2/3 selection 1 Display foemat which are HTML/XML text in HTML/XML tree in the order of parameter values
>
>
HowToList? 1/2/3 selection 1 Display format which are HTML/XML text in HTML/XML tree in the order of parameter values

taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Web Service Interface


 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.5 - 22 Dec 2005 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 21 to 21

Ways of Using

Added:
>
>
  • Enter a valid URL in the URL field or enter a local upload html text
  • Enter a valid html tag or tag list seperated by comma, default is "body"
  • Select token type. If you select "Separate on characters" or "Separate on pattern", you need to provide the corresponding characters or pattern -- in unix style or regular expression
  • Select how you want the separators tb be treated
  • Select output format
  • If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
  • Finally, click the "Submit" button

CGI Interface

Added:
>
>
If you want to use this tool from you web site, here is the CGI Interface: (Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Discription
source url/local radio button url Let user select input text (either a url or upload local html text)
htmlurl   text   A Valid URL that the pointed document should be an html text
localFile   file   The path to your local html text file
tagtext   text body Valid html element (tag) name or multple html element name separated by comma
token /Word/Line/Sentence/Paragraph/char/pat radio button Word pattern types as the values mean
character   text   characters separated by space as token type
pFormat unix/regexp radio button   type of token as pattern to select
unix   text   unix styled pattern
regexp   text   regular expression as token
dispop 1/2/3/4 radio button 1 the options corresonding to the parameter values are: strip separator/keep separator as token/keep with previous token/keep with following token
HowToList? 1/2/3 selection 1 Display foemat which are HTML/XML text in HTML/XML tree in the order of parameter values
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Web Service Interface

Added:
>
>
Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:

  • Endpoint URL: http://strange.mcmaster.ca:9982
  • Service URI: http://strange.mcmaster.ca/~taporware/webservice
  • Service Method: tokenizer_HTML
  • parameters:
    • htmlInput -- any html string
    • htmlTag -- any html element (tag) name or multple html element name separated by comma
    • tokenType -- values can be Word/Line/Sentence/Paragraph/char/pat
    • option1 -- the value of this parameter depend on the "tokenType" you used. If "char" is used, you should enter space separated characters. if "pat" is used, you should enter "unix" or "regexp"
    • option2 -- if you enter "unix" in the option1 field, enter a unix styled pattern, if you enter "regexp" in the option1 field, enter a valid regular expression, otherwise, keep it empty.
    • displayOption -- value can be 1/2/3/4 which is corresponding to strip separator/keep separator as token/keep with previous token/keep with following token
    • outFormat -- value can be 1/2/3 which is corresponding to HTML/XML text in HTML/XML tree respectively

Known Bugs

To Do


 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.4 - 21 Dec 2005 - LianYan)

META TOPICPARENT TAPoRware

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml
Line: 12 to 12

Pseudocode

Added:
>
>
  • Obtain HTML string by URL or from user's local disk
  • Obtain text contained by user specified tags
  • Generate token or separator based on user selected token type
  • Perform tokenizer function
  • Arrange token and separator based on user's preference
  • Generate output format

Ways of Using

CGI Interface


 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.3 - 15 Oct 2005 - MattPatey)

META TOPICPARENT TAPoRware
Changed:
<
<

List HTML Tags

See http://taporware/~taporware/htmlTools/tokenize.shtml
>
>

Tokenize HTML Document

See http://taporware.mcmaster.ca/~taporware/htmlTools/tokenize.shtml

TOC: No TOC in "Main.TAPoRwareHTMLTokenize"


 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.2 - 15 Oct 2005 - MattPatey)

META TOPICPARENT TAPoRware
Added:
>
>

List HTML Tags

See http://taporware/~taporware/htmlTools/tokenize.shtml

Description


This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.
Added:
>
>

History

Pseudocode

Ways of Using

CGI Interface

Web Service Interface

Known Bugs

To Do


-- MattPatey - 13 Oct 2005

 <<O>>  Difference Topic TAPoRwareHTMLTokenize (r1.1 - 13 Oct 2005 - MattPatey)
Line: 1 to 1
Added:
>
>
META TOPICPARENT TAPoRware
This tool splits an HTML document at specified points, or tokens. These tokens can be words, lines, sentences, and paragraphs, as well as certain characters, patterns, or tags. The results can be listed with the token removed, before the split, or after the split.

-- MattPatey - 13 Oct 2005


Topic: TAPoRwareHTMLTokenize . { View | Diffs | r1.11 | > | r1.10 | > | r1.9 | More }

Revision r1.1 - 13 Oct 2005 - 20:52 - MattPatey
Revision r1.11 - 06 Jun 2008 - 18:56 - LianYan