Principal Components Analysis
See
http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
Description
This tool applies Principal Components Analysis rules to a text; generating relationships between word distributions and text units.
Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribution of each text unit can be generated easily, but the usage (or similarity) of group (cluster) of word in all the text units can not be easily measured.
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwise, they are displayed apart away.
Note 1: This tool works better with large text units (i.e. at least 500 words).
Note 2: The total number of units should be less than 150. You can use the List Words tool to determine the total number of words.
Note 3: Paragraphs are determined by multiple line breaks, so some paragraphs may be very short (e.g. titles). In this case, it is better to use a subtext other than paragraphs
Pseudocode
- Obtain source text by URL or form user's local disk. If the text format is XML or HTML, strip off all the tags
- Chop the text into subtext based on user's choice, generate the word list in each subtext unit.
- Get the top frequency words user specified in the entire text, which may exclude stop words, and generate the top words counting matrix of each subtext unit
- Calculate the mean matrix of the word count matrix
- Generate the co-variance and the correlation coefficient matrix based on the mean matrix
- Generate the eigenvector and eigenvalue of the co-variance matrix
- Generate m x 2 and m x 3 matrix from the eigenvector above, where m is the number of top frequency words
- Obtain final matrix by perform the matrix (the two above separately) transpose and then time and mean matrix
- Generate result based on user selection -- a java applet or table format
- Note: for java applet, only the 3-dimensional graph is generated
Ways of Using
- Enter a URL pointing to the source text or browse (enter) the local text file path in the corresponding field
- Select subtext unit, enter or select the related field or using default value
- Select the number of top frequency you want to investigate
- Check to include or exclude Glasgow stop words as you wish. You can also enter your own stop word list in the specified field
- Select output format: HTML or Java applet graphics
- Click the submit button
CGI Interface
If you want to use this tool from your web site, here is the CGI Interface:
(
Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or a local path to upload text source) |
| texturl | | text | | A Valid URL pointing to any plain text, html or xml text |
| localFile | | file | | The path to your local text file |
| pcatext | 1/2/3 | radio button | 2 | subtext unit corresponding to paragraph/percent of characters/chunk of words respectively |
| percent | 5/10/20/50 | selection | 10 | selection related the selection of percent of character above |
| chunk | | text | 100 | it is related to the selection of "chunk of words" on "pcatext" |
| numword | 0/5/10/20/50/100 | selection | 10 | number of top frequency words. If you select "0", the following field must be entered with a list of words |
| glasgow | | checkbox | checked | check it to exclude glasgow stop words |
| userlist | | text | | enter your own stop list here, separated by comma. If you check to exclude glasgow stop words and enter your own stop words here, they will be combined to form a new stop word list |
| disp | 1/2 | selection | 2 | the output format. 1 -- HTML, 2 -- java applet graphics |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Use PCA Tool in Your Web Page
You can add a a button in your web page to get the PCA results of the current page by call
TAPoRware cgi script.
Here is the code for the PCA button above:
<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tpcacluster.cgi" onsubmit="document.textForm.texturl.value=document.location.href">
<input type="hidden" name="source" value="url" />
<input type="hidden" name="texturl" />
<input type="hidden" name="freetext" value="Y"/>
<input type="hidden" name="pcatext" value="2" />>
<input type="hidden" name="percent" value="10" />
<input type="hidden" name="numword" value="30" />
<input type="hidden" name="glasgow" value="Y" />
<input type="hidden" name="disp" value="2" />
<input type="submit" value="Get PCA" />
</form>
Web Service Interface
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: plainPCA
- parameters:
- textSource -- any text string. If the text format is html or xml, all the tags will be stripped
- suboption -- subtext unit selection, the values 1/2/3 are corresponding to paragraph/percent of characters/chunk of text in words
- percent -- this selection is related to the choice of "percent of characters"
- chunk -- this text field is related to the choice of "chunk of text in words" in the suboption parameter
- topWords -- number of top frequency words (may or may not exclude stop words, see below) to be investigated
- glasgow -- a boolean value (NO) to exclude the glasgow stop words in the top frequency word list
- userwords -- a text field for user enter his/hers stop words (separated by comma). This list will combined with the glosgow stop list if you select it
- outFormat -- values are 1/2 which are corresponding to HTML and java applet respectively
Known Bugs
To Do
Write help and walkthrough
--
LianYan - 19 Jun 2006