Find Text — Collocation
See
http://taporware/~taporware/textTools/collocation.shtml
Description
This tool takes a word from the user and returns all the words directly before and directly after it based on the given context (i.e. words, lines, sentences) and context length. Results can be sorted alphabetically, by frequency, or by Z-score (an indication of how far and in what direction that item deviates from its distribution's mean, expressed in units of its distribution's standard deviation).
Pseudocode
- Obtain text string by URL or from user's local disk. If the text format is html or xml, strip all the tags
- Find user specified word/pattern along with user specified context -- concordance
- If user selects "sorting by z-score", perform span word counting and total words counting, and then calculate the values of z-score
- Otherwise, sort and count the words of the concordance text
- Generate output of the collocates of the concordance text
Ways of Using
- Enter a valid URL in the URL field or enter a local path to upload the source text (can be xml, html or plain text format)
- Enter a word or pattern in Word/pattern field
- If you don't want to exclude the Glasgow stop words, uncheck the "Exclude Glasgow Stop Words" box
- Select the context of concordance and the length of context
- Select the collocates sorting criteria
- Select output format
- If you want the results displayed in the same window with taporware interface, uncheck the check box - "Open results in new window"
- Finally, click the "Submit" button
Developing a Theme
This can be used to develop a theme. The collocates of a word (or words) that you are typical of a theme might suggest other words associated with the theme.
Semantic Field
The collocates for a target word or pattern might help you identify what words associate with the target. Be careful that some words that are collocates may not be in the same sentence or they might be collocates because they are high frequency words that appear as collocates randomly.
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or a local path to upload text source) |
| texturl | | text | | A Valid URL pointing to any plain text, html or xml text |
| localFile | | file | | The path to your local text file |
| find_pattern | | text | | key word/pattern of the concordance |
| glasgow | | checkbox | checked | Indicate if Glasgow stop words should be excluded |
| context | Word/Line/ Sentence/Paragraph | selection | Word | context type |
| contLength | | text | 5 | context length corresponding to the selected context |
| sorting | 1/2/3 | selection | 1 | co-occurrence words sorting: by frequency/alphabetically/by zscore corresponding the order of this parameter values |
| HowToList | 1/2/3/4 | selection | 1 | Display format which are HTML/XML text in HTML/XML tree/Tab delimited text in the order of parameter values |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Use Find Text -- Collocation TAPoRware Tool in Your Web Page
You can add a text field and a button in your web page to get the concordance of the pattern you entered in that page by call
TAPoRware cgi script.
Here is the code that you can cut and paste to your web pages:
<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tcollocation.cgi" onsubmit="document.textForm.texturl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="texturl" />
<input type="hidden" name="freetext" value="yes"/>
Pattern: <input type="text" name="find_pattern" />
<input type="hidden" name="glasgow' value="true" />
<input type="hidden" name="context" value="Word" />
<input type="hidden" name="contLength" value="5" />
<input type="hidden" name="sorting" value="3" />
<input type="hidden" name="HowToList" value="1" />
<input type="hidden" name="taporface" value="same" />
<input type="submit" name="doIt" value="Get Collocates of the Page" />
</form>
Web Service Interface
Taporware provides web services to any non-benefit organizations. here is the taporware web services infomation:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: find_Collocation_Plain
- parameters:
- textInput -- any text string. If the text format is html or xml, all the tags will be stripped
- pattern -- unix styled pattern or regular expression
- context -- value can be 1/2/3/4 which corresponding to Words/Lines/Sentences/Paragraphs respectively
- contextLength -- length of context
- sorting -- values can be 1/2/3 corresponding to co-occurrence words by frequency/alphabetically/z-score
- outFormat -- values are html/xml/others where others format will "generate XML text in HTML"
Known Bugs
To Do
- We need to include the word or pattern for which the collocates are found.
- It would be nice to offer the same stop list options as in the list words tools.
- It would be nice to allow people to specify how many words before and after they want.
--
GeoffreyRockwell - 19 May 2005