Text Summarizer
See
http://taporware.mcmaster.ca/~taporware/textTools/summarizer.shtml
Description
This tool creates a summary of statistical information on a given document. It enables the user to select what types of information to display in the summary. Options include high frequency words, sentences with high frequency words, high frequency word context, collocation and element/text distribution.
Note: This tool treats xml and html as plain text with all tags stripped off, so some data will be missed such title, author etc. If your text is xml or html, it is recommend to use the xml or html version of the summarizer.
Pseudocode
- Obtain text string by URL or from user's local disk.
- If the text format is xml or html, strip off all the tags
- Get the first two non-empty paragraph
- Get the words statistics of the text
- Obtain and list top N high frequency words based on user specified criteria, N is a integer number.
- Strip text into sentences and obtain the sentences that contain at least n high frequency words. n is specified by user.
- For each high frequency word, get their concordance with user specified context length in words, then list the first m concordance for each words. m is specified by user.
- Get collocation of each high frequency word with specified context length in words, and then list them.
Ways of Using
- Enter a valid URL pointing to any text document in the URL field or enter a local path to upload the source text
- Enter an integer in the "List top... words" field
- Select how the words being listed, select stop words or list only words, do not forget enter the word list if you don't select Glasgow stop words.
- Enter an integer in the text field in the line "List sentences that have n or more high frequency words"
- Select from the only selection control in the form for the number of concordance (context) of each high frequency words, and fill the text field followed with context length in words
- Enter an integer in the text field in the line "List collocation within n words of the high frequency words "
- In the results panel, select how the listed words are sorted
- Select display format. But currently only HTML is support.
- Click the submit button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: You need to use attribute name/value pair: enctype="multipart/form-data" within the form tag because the tool was to designed to allow local file uploading in mind)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or upload local xml text) |
| texturl | | text | | A valid URL that points to any text document |
| localFile | | file | | The path to your local source text file |
| numoftop | | text | 10 | Indicate the number of top high frequency will be displayed and used in the following functions |
| range | all/pat/find/stop | radio button | stop | Indicate how the words will be listed |
| wordpat | | text | | Enter a pattern here if you want to list words that match this pattern |
| fromwhere | thislist/userfile/glasgow | radio button | glasgow | This variable is used when you select find/stop in the "how to list" selection |
| thislist | | text | | If you select "Type in", you need to fill in this field with the words you want or not want to list delimited by comma |
| userfile | | file | | If you select "local file" which contain the word list, browser to enter the file path here |
| numofword | | text | 2 | Indicate the sentences that contain the number of high frequency words |
| concornum | first/three/all | selection | first | Indicate the number of concordance for each high frequency word to be displayed |
| contLength | | text | 5 | The length of context in words for concordance (context) |
| collength | | text | 1 | the context length for collocation of each high frequency word |
| sorting | 1/2/3/4 | selection | 2 | How the listed words been sorted. They are alphabetically/frequency/first appear/reversed alphabetic in the order corresponding to the value in the parameter value field |
| display | 1 | selection | 1 | The format of output. Currently only HTML is implemented |
Use Summarizer TAPoRware Tool in Your Web Page
You can add a a button in your web page to get the summation of the page by call
TAPoRware cgi script.
Here is the code for this button:
<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tsummarizer.cgi" onsubmit="document.textForm.texturl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="texturl" />
<input type="hidden" name="freetext" value="yes"/>
<input type="hidden" name="highfre" value="true">
<input type="hidden" name="numoftop" value="10">
<input type="hidden" name="range" value="stop" />
<input type="hidden" name="fromwhere" value="glasgow" />
<input type="hidden" name="sentence" value="true">
<input type="hidden" name="numofword" value="2" />
<input type="hidden" name="concor" value="true">
<input type="hidden" name="concornum" value="three" />
<input type="hidden" name="contLength" value="5" />
<input type="hidden" name="colloc" value="true">
<input type="hidden" name="collength" value="1" />
<input type="hidden" name="distrib" value="true"/>
<input type="hidden" name="sorting" value="2" />
<input type="hidden" name="display" value="1" />
<input type="hidden" name="taporface" value="same" />
<input type="submit" name="doIt" value="Summarize This Page" />
Web Service Interface
Taporware provides web services to any non-benefit organizations. Here is the taporware web services information:
(Note: the form layout is customized)
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: summarizer_plain
- parameters:
- textInput -- any text string including xml, html and plain text. However, all tags will be stripped
- numoftop -- an text field that let user to enter the number of top frequency words to be used in the following functions
- listOption -- indicate how the words are listed. The values are all/patt/find/stop(default). They stand for all words/words matching pattern/list user specified words only/not list user specified words
- optionSelection -- value depends on the selection of the "lostOption".
- numofhigh -- a number of top frequency words contained in sentences
- numofstart -- a selection user can specify how many concordance entries for each top frequency word will be displayed. The values are first/three/all meaning first, first three and all concordance of each top word
- sorting -- how the listed words are sorted: 1 - alphabetically, 2 - by frequency, 3 - in the order of appearance, or 4 - in reversed alphabetic order
- outFormat -- result format, set is to "html" because it is the format implement currently
Known Bugs
To Do
Update help
--
LianYan - 29 Mar 2007