Summarizer
See
http://taporware.mcmaster.ca/~taporware/htmlTools/summarizer.shtml
Description
This tool creates a summary of statistical information on a given document. It enables the user to select what types of information to display in the summary. Options include high frequency words, sentences with high frequency words, high frequency word context, collocation and element/text distribution.
Predefined Parameter Values in Tool Bar
- Source: the page the user is currently in.
- Element:
body or set by site owner
- Number of high frequency words: 20
- Use stop list: Yes, using glasgow stop-words list
- List sentences contain at least 3 of high frequency words
- For each high frequency words, list 3 of its concordance with context length of 5 words
- List collocates of the all high frequency words with context length of 3 words
- Word list sorting: by frequency
- Display format: HTML
Pseudocode
- Obtain HTML string by URL or from user's local disk. If the text is not an HTML, return an error message
- Get HTML meta data from the HTML string, including page title, first heading, open words etc.
- Get the text user want to summarize based on the specified HTML tags
- Obtain and list top N high frequency words with summation (unique words, total words, highest word frequency, average word frequency etc.) based on user specified criteria, N is a integer number.
- get statistics on user specified text including number of specified element, maximum, average and minimum number of words contained in each element etc.
- Strip text into sentences and obtain the sentences that contain at least n high frequency words. n is specified by user.
- For each high frequency word, get their concordance with user specified context length in words, then list the first m concordance for each words. m is specified by user.
- Get collocation of each high frequency word with specified context length in words, and then list them.
- Get element against number or words data and generate distribution
Ways of Using
- Enter a valid URL pointing to an html document in the URL field or enter a local path to upload html text
- Enter a valid html element name, the default is "body"
- Enter an integer in the "List top" words field
- Select how the words being listed, select stop words or list only words, do not forget enter the word list if you don't select Glasgow stop words.
- Enter an integer in the text field in the line "List sentences that have n or more high frequency words"
- Select from the only selection control in the form for the number of concordance (context) of each high frequency words, and fill the text field followed with context length in words
- Enter an integer in the text field in the line "List collocation within n words of the high frequency words "
- In the results panel, select how the listed words are sorted
- Select display format. But currently only HTML is support.
- Click the submit button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: You need to use attribute name/value pair: enctype="multipart/form-data" within the form tag because the tool was to designed to allow local file uploading in mind)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or upload local xml text) |
| htmlurl | | text | | A valid URL that the pointed document should be an html text |
| localFile | | file | | The path to your local xml text file |
| tagtext | | text | body | Valid html element tag or html tags separated by comma |
| numoftop | | text | 10 | Indicate the number of top high frequency will be displayed and used in the following functions |
| range | all/pat/find/stop | radio button | stop | Indicate how the words will be listed |
| wordpat | | text | | Enter a pattern here if you want to list words that match this pattern. this parameter is associated with value "pat" of the radio button name "range" |
| fromwhere | thislist/userfile/glasgow | radio button | glasgow | This variable is used when you select find/stop in the "range" selection |
| thislist | | text | | If you select "Type in", you need to fill in this field with the words you want or not want to list delimited by comma |
| userfile | | file | | If you select "local file" which contain the word list, browser to enter the file path here |
| numofword | | text | 2 | Indicate the sentences that contain the number of high frequency words |
| concornum | first/three/all | selection | first | Indicate the number of concordance for each high frequency word to be displayed |
| contLength | | text | 5 | The length of context in words for concordance (context) |
| collength | | text | 1 | the context length for collocation of each high frequency word |
| sorting | 1/2/3/4 | selection | 2 | How the listed words been sorted. They are alphabetically/frequency/first appear/reversed alphabetic in the order corresponding to the value in the parameter value field |
| display | 1 | selection | 1 | The format of output. Currently only HTML is implemented |
Use Summarizer TAPoRware Tool in Your Web Page
You can add a button in your web page to summarize the current page by call
TAPoRware cgi script.
Here is the code for this function
<form method="post" name="htmlForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/hsummarizer.cgi" onsubmit="document.htmlForm.htmlurl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="htmlurl" />
<input type="hidden" name="freetext" value="yes"/>
<input type="hidden" name="tagtext" value="body" />
<input type="hidden" name="highfre" value="true">
<input type="hidden" name="numoftop" value="10">
<input type="hidden" name="range" value="stop" />
<input type="hidden" name="fromwhere" value="glasgow" />
<input type="hidden" name="sentence" value="true">
<input type="hidden" name="numofword" value="2" />
<input type="hidden" name="concor" value="true">
<input type="hidden" name="concornum" value="three" />
<input type="hidden" name="contLength" value="5" />
<input type="hidden" name="colloc" value="true">
<input type="hidden" name="collength" value="1" />
<input type="hidden" name="distrib" value="true"/>
<input type="hidden" name="sorting" value="2" />
<input type="hidden" name="display" value="1" />
<input type="hidden" name="taporface" value="same" />
<input type="submit" name="doIt" value="Summarize This Page" />
</form>
Web Service Interface
Not implement yet.
Known Bugs
To Do
--
MattPatey - 13 Oct 2005