|
Principal Components Analysis
See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
|
|
Description
This tool applies Principal Components Analysis rules to a text; generating relationships between word distributions and text units.
|
< < |
Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribition of each text unit can be generated easily, but the usage (or similarility) of group (clustter) of word in all the text units can not be easily measured.
|
> > |
Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribution of each text unit can be generated easily, but the usage (or similarity) of group (cluster) of word in all the text units can not be easily measured.
|
|
|
< < |
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.
|
> > |
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwise, they are displayed apart away.
|
|
Note 1: This tool works better with large text units (i.e. at least 500 words).
|
|
- Select subtext unit, enter or select the related field or using default value
- Select the number of top frequency you want to investigate
- Check to include or exclude Glasgow stop words as you wish. You can also enter your own stop word list in the specified field
|
< < |
- Select output format: HTML or java applet graphics
|
> > |
- Select output format: HTML or Java applet graphics
|
|
CGI Interface
|
|
| numword | 0/5/10/20/50/100 | selection | 10 | number of top frequency words. If you select "0", the following field must be entered with a list of words |
| glasgow | | checkbox | checked | check it to exclude glasgow stop words |
| userlist | | text | | enter your own stop list here, separated by comma. If you check to exclude glasgow stop words and enter your own stop words here, they will be combined to form a new stop word list |
|
< < |
| disp | 1/2 | selection | 2 | the output fromat. 1 -- HTML, 2 -- java applet graphics |
|
> > |
| disp | 1/2 | selection | 2 | the output format. 1 -- HTML, 2 -- java applet graphics |
|
|
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
Use PCA Tool in Your Web Page
|
|
Principal Components Analysis
See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
|
|
|
< < |
- Service Method: find_Collocation_Plain
|
> > |
|
|
|
< < |
-
- textInput -- any text string. If the text format is html or xml, all the tags will be stripped
- pattern -- unix styled pattern or regular expression
- context -- value can be 1/2/3/4 which corresponding to Words/Lines/Sentences/Paragraphs respectively
- contextLength -- length of context
- sorting -- values can be 1/2/3 corresponding to co-occurrence words by frequency/alphabetically/z-score
- outFormat -- values are html/xml/others where others format will "generate XML text in HTML"
|
> > |
-
- textSource -- any text string. If the text format is html or xml, all the tags will be stripped
- suboption -- subtext unit selection, the values 1/2/3 are corresponding to paragraph/percent of characters/chunk of text in words
- percent -- this selection is related to the choice of "percent of characters"
- chunk -- this text field is related to the choice of "chunk of text in words" in the suboption parameter
- topWords -- number of top frequency words (may or may not exclude stop words, see below) to be investigated
- glasgow -- a boolean value (NO) to exclude the glasgow stop words in the top frequency word list
- userwords -- a text field for user enter his/hers stop words (separated by comma). This list will combined with the glosgow stop list if you select it
- outFormat -- values are 1/2 which are corresponding to HTML and java applet respectively
|
|
Known Bugs
|
|
Principal Components Analysis
See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
|
|
Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribition of each text unit can be generated easily, but the usage (or similarility) of group (clustter) of word in all the text units can not be easily measured.
|
< < |
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, the the results can be displayed. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.
|
> > |
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.
|
|
Note 1: This tool works better with large text units (i.e. at least 500 words).
|
|
Note 3: Paragraphs are determined by multiple line breaks, so some paragraphs may be very short (e.g. titles). In this case, it is better to use a subtext other than paragraphs
|
< < |
History
|
|
Pseudocode
|
> > |
- Obtain source text by URL or form user's local disk. If the text format is XML or HTML, strip off all the tags
- Chop the text into subtext based on user's choice, generate the word list in each subtext unit.
- Get the top frequency words user specified in the entire text, which may exclude stop words, and generate the top words counting matrix of each subtext unit
- Calculate the mean matrix of the word count matrix
- Generate the co-variance and the correlation coefficient matrix based on the mean matrix
- Generate the eigenvector and eigenvalue of the co-variance matrix
- Generate m x 2 and m x 3 matrix from the eigenvector above, where m is the number of top frequency words
- Obtain final matrix by perform the matrix (the two above separately) transpose and then time and mean matrix
- Generate result based on user selection -- a java applet or table format
- Note: for java applet, only the 3-dimensional graph is generated
|
|
Ways of Using
|
> > |
- Enter a URL pointing to the source text or browse (enter) the local text file path in the corresponding field
- Select subtext unit, enter or select the related field or using default value
- Select the number of top frequency you want to investigate
- Check to include or exclude Glasgow stop words as you wish. You can also enter your own stop word list in the specified field
- Select output format: HTML or java applet graphics
- Click the submit button
|
|
CGI Interface
|
> > |
If you want to use this tool from your web site, here is the CGI Interface:
(Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)
Here are the parameters:
| Parameter Name | Parameter Value | Control Type | Default | Description |
| source | url/local | radio button | url | Let user select input text (either a url or a local path to upload text source) |
| texturl | | text | | A Valid URL pointing to any plain text, html or xml text |
| localFile | | file | | The path to your local text file |
| pcatext | 1/2/3 | radio button | 2 | subtext unit corresponding to paragraph/percent of characters/chunk of words respectively |
| percent | 5/10/20/50 | selection | 10 | selection related the selection of percent of character above |
| chunk | | text | 100 | it is related to the selection of "chunk of words" on "pcatext" |
| numword | 0/5/10/20/50/100 | selection | 10 | number of top frequency words. If you select "0", the following field must be entered with a list of words |
| taporface | | checkbox | checked | display result in a new window without graphics interface (default) or with taporware interface in the same window |
|
|
Web Service Interface
|
> > |
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:
- Endpoint URL: http://taporware.mcmaster.ca:9982
- Service URI: http://taporware.mcmaster.ca/~taporware/webservice
- Service Method: find_Collocation_Plain
- parameters:
- textInput -- any text string. If the text format is html or xml, all the tags will be stripped
- pattern -- unix styled pattern or regular expression
- context -- value can be 1/2/3/4 which corresponding to Words/Lines/Sentences/Paragraphs respectively
- contextLength -- length of context
- sorting -- values can be 1/2/3 corresponding to co-occurrence words by frequency/alphabetically/z-score
- outFormat -- values are html/xml/others where others format will "generate XML text in HTML"
|
|
Known Bugs
To Do
|