Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Summarizer

See http://taporware.mcmaster.ca/~taporware/htmlTools/summarizer.shtml

Description

This tool creates a summary of statistical information on a given document. It enables the user to select what types of information to display in the summary. Options include high frequency words, sentences with high frequency words, high frequency word context, collocation and element/text distribution.

Predefined Parameter Values in Tool Bar

  • Source: the page the user is currently in.
  • Element: body or set by site owner
  • Number of high frequency words: 20
  • Use stop list: Yes, using glasgow stop-words list
  • List sentences contain at least 3 of high frequency words
  • For each high frequency words, list 3 of its concordance with context length of 5 words
  • List collocates of the all high frequency words with context length of 3 words
  • Word list sorting: by frequency
  • Display format: HTML

Pseudocode

  • Obtain HTML string by URL or from user's local disk. If the text is not an HTML, return an error message
  • Get HTML meta data from the HTML string, including page title, first heading, open words etc.
  • Get the text user want to summarize based on the specified HTML tags
  • Obtain and list top N high frequency words with summation (unique words, total words, highest word frequency, average word frequency etc.) based on user specified criteria, N is a integer number.
  • get statistics on user specified text including number of specified element, maximum, average and minimum number of words contained in each element etc.
  • Strip text into sentences and obtain the sentences that contain at least n high frequency words. n is specified by user.
  • For each high frequency word, get their concordance with user specified context length in words, then list the first m concordance for each words. m is specified by user.
  • Get collocation of each high frequency word with specified context length in words, and then list them.
  • Get element against number or words data and generate distribution

Ways of Using

  • Enter a valid URL pointing to an html document in the URL field or enter a local path to upload html text
  • Enter a valid html element name, the default is "body"
  • Enter an integer in the "List top" words field
  • Select how the words being listed, select stop words or list only words, do not forget enter the word list if you don't select Glasgow stop words.
  • Enter an integer in the text field in the line "List sentences that have n or more high frequency words"
  • Select from the only selection control in the form for the number of concordance (context) of each high frequency words, and fill the text field followed with context length in words
  • Enter an integer in the text field in the line "List collocation within n words of the high frequency words "
  • In the results panel, select how the listed words are sorted
  • Select display format. But currently only HTML is support.
  • Click the submit button

CGI Interface

If you want to use this tool from your web site, here is the CGI Interface: (Note: You need to use attribute name/value pair: enctype="multipart/form-data" within the form tag because the tool was to designed to allow local file uploading in mind)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Description
source url/local radio button url Let user select input text (either a url or upload local xml text)
htmlurl   text   A valid URL that the pointed document should be an html text
localFile   file   The path to your local xml text file
tagtext   text body Valid html element tag or html tags separated by comma
numoftop   text 10 Indicate the number of top high frequency will be displayed and used in the following functions
range all/pat/find/stop radio button stop Indicate how the words will be listed
wordpat   text   Enter a pattern here if you want to list words that match this pattern. this parameter is associated with value "pat" of the radio button name "range"
fromwhere thislist/userfile/glasgow radio button glasgow This variable is used when you select find/stop in the "range" selection
thislist   text   If you select "Type in", you need to fill in this field with the words you want or not want to list delimited by comma
userfile   file   If you select "local file" which contain the word list, browser to enter the file path here
numofword   text 2 Indicate the sentences that contain the number of high frequency words
concornum first/three/all selection first Indicate the number of concordance for each high frequency word to be displayed
contLength   text 5 The length of context in words for concordance (context)
collength   text 1 the context length for collocation of each high frequency word
sorting 1/2/3/4 selection 2 How the listed words been sorted. They are alphabetically/frequency/first appear/reversed alphabetic in the order corresponding to the value in the parameter value field
display 1 selection 1 The format of output. Currently only HTML is implemented

Use Summarizer TAPoRware Tool in Your Web Page

You can add a button in your web page to summarize the current page by call TAPoRware cgi script.

Here is the code for this function

<form method="post" name="htmlForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/hsummarizer.cgi" onsubmit="document.htmlForm.htmlurl.value=document.location.href">

<input type="hidden" name="source" value="url" />

<input type="hidden" name="htmlurl" />

<input type="hidden" name="freetext" value="yes"/>

<input type="hidden" name="tagtext" value="body" />

<input type="hidden" name="highfre" value="true">

<input type="hidden" name="numoftop" value="10">

<input type="hidden" name="range" value="stop" />

<input type="hidden" name="fromwhere" value="glasgow" />

<input type="hidden" name="sentence" value="true">

<input type="hidden" name="numofword" value="2" />

<input type="hidden" name="concor" value="true">

<input type="hidden" name="concornum" value="three" />

<input type="hidden" name="contLength" value="5" />

<input type="hidden" name="colloc" value="true">

<input type="hidden" name="collength" value="1" />

<input type="hidden" name="distrib" value="true"/>

<input type="hidden" name="sorting" value="2" />

<input type="hidden" name="display" value="1" />

<input type="hidden" name="taporface" value="same" />

<input type="submit" name="doIt" value="Summarize This Page" />

</form>

Web Service Interface

Not implement yet.

Known Bugs

To Do

-- MattPatey - 13 Oct 2005


Use this box to quickly add a comment to the page.

more options...