List Words
See
http://taporware.mcmaster.ca/~taporware/textTools/listword.shtml
Description
This tool lists words found within a given text document in different manners. It can list all words, words matching a pattern, all words except stop-words etc. It can also list words by applying a inflectional stemmer. The results can be sorted alphabetically, by frequency, by order of appearance, or in reversed alphabetical order and displayed in different format.
History
- List all words, words matching pattern, user selected words and all words except user entered stop words
- Add Glasgow stop words as the default stop words
- Add inflectional stemmer
- Add sparkline images for the top frequency words distribution
Pseudocode
- Obtain text string by URL or from user's local disk
- If the text format is XML or HTML, strip off all the tags
- Tokenize text into words using taporware tokenizer
- Apply stemmer if user selects it
- Sort and count words with the capitalization ignored
- Extract words based on user specified criteria if necessary
- Generate sparkline if user selects a number in the "Display top ..." selection control
- Generate output format based on user's selection
Ways of Using
- Enter a valid URL in the URL field or enter a local path to upload text
- Select which list you want to get and enter the corresponding text if necessary
- Select sorting criterion
- Check the "Apply inflectional stemmer" box if you want to apply the stemmer
- Select output format
- Select a number in the "Display top ..." selection control if you want to see the top frequency words distributions
- If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
- Finally, click the "Submit" button
- In the result page, you can click any word to get its concordance with 5 words of context.
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: You need to use attribute name/value pair: enctype="multipart/form-data" within the form tag because the tool was to designed to allow local file uploading even if you do not use this feature)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or upload local html text) |
| texturl | | text | | A valid URL pointing plain text, html or xml document |
| localFile | | file | | The path to your local text file |
| range | all/patt/find/stop | radio button | all | Options that let user select the word list he/she want to see |
| wpat | | text | | A unix styled pattern. This field corresponding to the value "patt" in the radio button group named "range" |
| findstop | typedin/textfile/glasgow | radio button | glasgow | The option are connected with value "find" and "stop" in the radio button group named "range" |
| typedinword | | text | | This text field is corresponding to the value "typedin" of radio button group named "findstop" |
| wordfile | | file | | This field is corresponding to the value "textfile" of radio button group named "findstop" |
| sorting | 1/2/3/4 | selection | 2 | Sorting criteria which are alphabetically/by frequency/by order of first appearance/by reversed alphabetic order in the order of parameter values |
| stem | | checkbox | unchecked | Indicate if inflectional stemmer would be applied |
| display | 1/2/3/4 | selection | 2 | Display format which are XML tags in HTML/HTML/XML tree/tab Delimited Text in the order of parameter values |
| sparkline | none/5/10/20/50 | selection | none | select the number of high frequency words to generate sparkline word distributions over each 5% chunk of text |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Use List Words TAPoRware Tool in Your Web Page
You can add a button in your web page to list all the words in that page by call
TAPoRware cgi script.
Here is the code for this function
<form method="post" name="textForm" enctype="multipart/form-data" target="_blank"
action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tlistwordstem.cgi"
onsubmit="document.textForm.texturl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="texturl" />
<input type="hidden" name="freetext" value="yes"/>
<input type="hidden" name="range" value="all" />
<input type="hidden" name="sorting" value="2" />
<input type="hidden" name="sparkline" value="10" />
<input type="hidden" name="display" value="1" />
<input type="hidden" name="taporface" value="same" />
<input type="submit" name="doIt" value="List All Words of the Page" />
</form>
Web Service Interface
Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: list_Words_Plain
- parameters:
- textInput -- any text string. If the text format is html or xml, the tags will be stripped
- listOption -- values are same as parameter "range" in the CGI interface above
- optionSeletion -- values are corresponding to the "list option"
- sorting -- values are same as parameter "sorting" in the CGI interface above
- outFormat -- values are same as parameter "display" in the CGI interface above
- Note: The service will automatically generate sparkline on the top 10 frequency words' distribution
Known Bugs
To Do
--
LianYan - 28 Mar 2007