Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Summarizer

See http://taporware.mcmaster.ca/~taporware/xmlTools/summarizer.shtml

Description

This tool creates a summary of statistical information on a given document. It enables the user to select what types of information to display in the summary. Options include high frequency words, sentences with high frequency words, high frequency word context, collocation and element/text distribution.

Pseudocode

  • Obtain XML string by URL or from user's local disk. If the text is not an XML, return an error message
  • Get XML meta data from the XML string if possible, including title, author, date of publication, and language etc.
  • Get the text user want to summarize based on the specified element name or xpath
  • Obtain and list top N high frequency words with summation (unique words, total words, highest word frequency, average word frequency etc.) based on user specified criteria, N is a integer number.
  • get statistics on user specified text including number of specified element, maximum, average and minimum number of words contained in each element etc.
  • Strip text into sentences and obtain the sentences that contain at least n high frequency words. n is specified by user.
  • For each high frequency word, get their concordance with user specified context length in words, then list the first m concordance for each words. m is specified by user.
  • Get collocation of each high frequency word with specified context length in words, and then list them.
  • Get element against number or words data and generate distribution

Ways of Using

  • Enter a valid URL pointing to an xml document in the URL field or enter a local path to upload xml text
  • Enter a valid xml element name or xpath existed in the xml text, the default is "//"
  • Enter an integer in the "List top" words field
  • Select how the words being listed, select stop words or list only words, do not forget enter the word list if you don't select Glasgow stop words.
  • Enter an integer in the text field in the line "List sentences that have n or more high frequency words"
  • Select from the only selection control in the form for the number of concordance (context) of each high frequency words, and fill the text field followed with context length in words
  • Enter an integer in the text field in the line "List collocation within n words of the high frequency words "
  • In the results panel, select how the listed words are sorted
  • Select display format. But currently only HTML is support.
  • Click the submit button

CGI Interface

If you want to use this tool from your web site, here is the CGI Interface: (Note: You need to use attribute name/value pair: enctype="multipart/form-data" within the form tag because the tool was to designed to allow local file uploading in mind)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Description
source url/local radio button url Let user select input text (either a url or upload local xml text)
xmlurl   text   A valid URL that the pointed document should be an xml text
localFile   file   The path to your local xml text file
xmlelem   text // Valid xml element names or xpaths separated by comma
numoftop   text 10 Indicate the number of top high frequency will be displayed and used in the following functions
range all/pat/find/stop radio button stop Indicate how the words will be listed
wordpat   text   Enter a pattern here if you want to list words that match this pattern
fromwhere thislist/userfile/glasgow radio button glasgow This variable is used when you select find/stop in the "how to list" selection
thislist   text   If you select "Type in", you need to fill in this field with the words you want or not want to list delimited by comma
userfile   file   If you select "local file" which contain the word list, browser to enter the file path here
numofword   text 2 Indicate the sentences that contain the number of high frequency words
concornum first/three/all selection first Indicate the number of concordance for each high frequency word to be displayed
contLength   text 5 The length of context in words for concordance (context)
collength   text 1 the context length for collocation of each high frequency word
sorting 1/2/3/4 selection 2 How the listed words been sorted. They are alphabetically/frequency/first appear/reversed alphabetic in the order corresponding to the value in the parameter value field
display 1 selection 1 The format of output. Currently only HTML is implemented

Web Service Interface

Not implement yet.

Known Bugs

To Do

-- MattPatey - 13 Oct 2005


Use this box to quickly add a comment to the page.

more options...