Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

List Words

See http://taporware.mcmaster.ca/~taporware/textTools/listword.shtml

Description

This tool lists words found within a given text document in different manners. It can list all words, words matching a pattern, all words except stop-words etc. It can also list words by applying a inflectional stemmer. The results can be sorted alphabetically, by frequency, by order of appearance, or in reversed alphabetical order and displayed in different format.

History

  1. List all words, words matching pattern, user selected words and all words except user entered stop words
  2. Add Glasgow stop words as the default stop words
  3. Add inflectional stemmer
  4. Add sparkline images for the top frequency words distribution

Pseudocode

  • Obtain text string by URL or from user's local disk
  • If the text format is XML or HTML, strip off all the tags
  • Tokenize text into words using taporware tokenizer
  • Apply stemmer if user selects it
  • Sort and count words with the capitalization ignored
  • Extract words based on user specified criteria if necessary
  • Generate sparkline if user selects a number in the "Display top ..." selection control
  • Generate output format based on user's selection

Ways of Using

  • Enter a valid URL in the URL field or enter a local path to upload text
  • Select which list you want to get and enter the corresponding text if necessary
  • Select sorting criterion
  • Check the "Apply inflectional stemmer" box if you want to apply the stemmer
  • Select output format
  • Select a number in the "Display top ..." selection control if you want to see the top frequency words distributions
  • If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
  • Finally, click the "Submit" button
    • In the result page, you can click any word to get its concordance with 5 words of context.

CGI Interface

If you want to use this tool from your web site, here is the CGI Interface: (Note: You need to use attribute name/value pair: enctype="multipart/form-data" within the form tag because the tool was to designed to allow local file uploading even if you do not use this feature)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Description
source url/local radio button url Let user select input text (either a url or upload local html text)
texturl   text   A valid URL pointing plain text, html or xml document
localFile   file   The path to your local text file
range all/patt/find/stop radio button all Options that let user select the word list he/she want to see
wpat   text   A unix styled pattern. This field corresponding to the value "patt" in the radio button group named "range"
findstop typedin/textfile/glasgow radio button glasgow The option are connected with value "find" and "stop" in the radio button group named "range"
typedinword   text   This text field is corresponding to the value "typedin" of radio button group named "findstop"
wordfile   file   This field is corresponding to the value "textfile" of radio button group named "findstop"
sorting 1/2/3/4 selection 2 Sorting criteria which are alphabetically/by frequency/by order of first appearance/by reversed alphabetic order in the order of parameter values
stem   checkbox unchecked Indicate if inflectional stemmer would be applied
display 1/2/3/4 selection 2 Display format which are XML tags in HTML/HTML/XML tree/tab Delimited Text in the order of parameter values
sparkline none/5/10/20/50 selection none select the number of high frequency words to generate sparkline word distributions over each 5% chunk of text
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Use List Words TAPoRware Tool in Your Web Page

You can add a button in your web page to list all the words in that page by call TAPoRware cgi script.

Here is the code for this function

<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tlistwordstem.cgi" onsubmit="document.textForm.texturl.value=document.location.href">

<input type="hidden" name="source" value="url" />

<input type="hidden" name="texturl" />

<input type="hidden" name="freetext" value="yes"/>

<input type="hidden" name="range" value="all" />

<input type="hidden" name="sorting" value="2" />

<input type="hidden" name="sparkline" value="10" />

<input type="hidden" name="display" value="1" />

<input type="hidden" name="taporface" value="same" />

<input type="submit" name="doIt" value="List All Words of the Page" />

</form>

Web Service Interface

Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:

  • Endpoint URL: http://taporware.mcmaster.ca:9982
  • Service URI: http://taporware.mcmaster.ca/~taporware/webservice
  • Service Method: list_Words_Plain
  • parameters:
    • textInput -- any text string. If the text format is html or xml, the tags will be stripped
    • listOption -- values are same as parameter "range" in the CGI interface above
    • optionSeletion -- values are corresponding to the "list option"
    • sorting -- values are same as parameter "sorting" in the CGI interface above
    • outFormat -- values are same as parameter "display" in the CGI interface above
  • Note: The service will automatically generate sparkline on the top 10 frequency words' distribution

Known Bugs

To Do

-- LianYan - 28 Mar 2007


Use this box to quickly add a comment to the page.

more options...