Main.TAPoRwarePCA (r1.1 vs. r1.6)
Diffs

 <<O>>  Difference Topic TAPoRwarePCA (r1.6 - 29 Apr 2010 - KamalRanaweera)

META TOPICPARENT TAPoRware

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
Line: 8 to 8

Description

This tool applies Principal Components Analysis rules to a text; generating relationships between word distributions and text units.
Changed:
<
<
Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribition of each text unit can be generated easily, but the usage (or similarility) of group (clustter) of word in all the text units can not be easily measured.
>
>
Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribution of each text unit can be generated easily, but the usage (or similarity) of group (cluster) of word in all the text units can not be easily measured.

Changed:
<
<
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.
>
>
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwise, they are displayed apart away.

Note 1: This tool works better with large text units (i.e. at least 500 words).

Line: 35 to 35

  • Select subtext unit, enter or select the related field or using default value
  • Select the number of top frequency you want to investigate
  • Check to include or exclude Glasgow stop words as you wish. You can also enter your own stop word list in the specified field
Changed:
<
<
  • Select output format: HTML or java applet graphics
>
>
  • Select output format: HTML or Java applet graphics

  • Click the submit button

CGI Interface

Line: 56 to 56

numword 0/5/10/20/50/100 selection 10 number of top frequency words. If you select "0", the following field must be entered with a list of words
glasgow   checkbox checked check it to exclude glasgow stop words
userlist   text   enter your own stop list here, separated by comma. If you check to exclude glasgow stop words and enter your own stop words here, they will be combined to form a new stop word list
Changed:
<
<
disp 1/2 selection 2 the output fromat. 1 -- HTML, 2 -- java applet graphics
>
>
disp 1/2 selection 2 the output format. 1 -- HTML, 2 -- java applet graphics

taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Use PCA Tool in Your Web Page


 <<O>>  Difference Topic TAPoRwarePCA (r1.5 - 13 Jun 2008 - LianYan)

META TOPICPARENT TAPoRware

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
Line: 59 to 59

disp 1/2 selection 2 the output fromat. 1 -- HTML, 2 -- java applet graphics
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window
Added:
>
>

Use PCA Tool in Your Web Page

You can add a a button in your web page to get the PCA results of the current page by call TAPoRware cgi script.

Here is the code for the PCA button above:

<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tpcacluster.cgi" onsubmit="document.textForm.texturl.value=document.location.href">

<input type="hidden" name="source" value="url" />

<input type="hidden" name="texturl" />

<input type="hidden" name="freetext" value="Y"/>

<input type="hidden" name="pcatext" value="2" />>

<input type="hidden" name="percent" value="10" />

<input type="hidden" name="numword" value="30" />

<input type="hidden" name="glasgow" value="Y" />

<input type="hidden" name="disp" value="2" />

<input type="submit" value="Get PCA" />

</form>


Web Service Interface

Taporware provides web services to any non-benefit organizations. here is the taporware web services information:


 <<O>>  Difference Topic TAPoRwarePCA (r1.4 - 04 Apr 2007 - LianYan)

META TOPICPARENT TAPoRware

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
Line: 65 to 65

Changed:
<
<
  • Service Method: find_Collocation_Plain
>
>
  • Service Method: plainPCA

  • parameters:
Changed:
<
<
    • textInput -- any text string. If the text format is html or xml, all the tags will be stripped
    • pattern -- unix styled pattern or regular expression
    • context -- value can be 1/2/3/4 which corresponding to Words/Lines/Sentences/Paragraphs respectively
    • contextLength -- length of context
    • sorting -- values can be 1/2/3 corresponding to co-occurrence words by frequency/alphabetically/z-score
    • outFormat -- values are html/xml/others where others format will "generate XML text in HTML"
>
>
    • textSource -- any text string. If the text format is html or xml, all the tags will be stripped
    • suboption -- subtext unit selection, the values 1/2/3 are corresponding to paragraph/percent of characters/chunk of text in words
    • percent -- this selection is related to the choice of "percent of characters"
    • chunk -- this text field is related to the choice of "chunk of text in words" in the suboption parameter
    • topWords -- number of top frequency words (may or may not exclude stop words, see below) to be investigated
    • glasgow -- a boolean value (NO) to exclude the glasgow stop words in the top frequency word list
    • userwords -- a text field for user enter his/hers stop words (separated by comma). This list will combined with the glosgow stop list if you select it
    • outFormat -- values are 1/2 which are corresponding to HTML and java applet respectively

Known Bugs


 <<O>>  Difference Topic TAPoRwarePCA (r1.3 - 04 Apr 2007 - LianYan)

META TOPICPARENT TAPoRware

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
Line: 54 to 54

percent 5/10/20/50 selection 10 selection related the selection of percent of character above
chunk   text 100 it is related to the selection of "chunk of words" on "pcatext"
numword 0/5/10/20/50/100 selection 10 number of top frequency words. If you select "0", the following field must be entered with a list of words
Added:
>
>
glasgow   checkbox checked check it to exclude glasgow stop words
userlist   text   enter your own stop list here, separated by comma. If you check to exclude glasgow stop words and enter your own stop words here, they will be combined to form a new stop word list
disp 1/2 selection 2 the output fromat. 1 -- HTML, 2 -- java applet graphics

taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Web Service Interface


 <<O>>  Difference Topic TAPoRwarePCA (r1.2 - 04 Apr 2007 - LianYan)

META TOPICPARENT TAPoRware

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml
Line: 10 to 10

Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribition of each text unit can be generated easily, but the usage (or similarility) of group (clustter) of word in all the text units can not be easily measured.

Changed:
<
<
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, the the results can be displayed. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.
>
>
The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.

Note 1: This tool works better with large text units (i.e. at least 500 words).

Line: 18 to 18

Note 3: Paragraphs are determined by multiple line breaks, so some paragraphs may be very short (e.g. titles). In this case, it is better to use a subtext other than paragraphs

Deleted:
<
<

History


Pseudocode

Added:
>
>
  • Obtain source text by URL or form user's local disk. If the text format is XML or HTML, strip off all the tags
  • Chop the text into subtext based on user's choice, generate the word list in each subtext unit.
  • Get the top frequency words user specified in the entire text, which may exclude stop words, and generate the top words counting matrix of each subtext unit
  • Calculate the mean matrix of the word count matrix
  • Generate the co-variance and the correlation coefficient matrix based on the mean matrix
  • Generate the eigenvector and eigenvalue of the co-variance matrix
  • Generate m x 2 and m x 3 matrix from the eigenvector above, where m is the number of top frequency words
  • Obtain final matrix by perform the matrix (the two above separately) transpose and then time and mean matrix
  • Generate result based on user selection -- a java applet or table format
  • Note: for java applet, only the 3-dimensional graph is generated

Ways of Using

Added:
>
>
  • Enter a URL pointing to the source text or browse (enter) the local text file path in the corresponding field
  • Select subtext unit, enter or select the related field or using default value
  • Select the number of top frequency you want to investigate
  • Check to include or exclude Glasgow stop words as you wish. You can also enter your own stop word list in the specified field
  • Select output format: HTML or java applet graphics
  • Click the submit button

CGI Interface

Added:
>
>
If you want to use this tool from your web site, here is the CGI Interface: (Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Description
source url/local radio button url Let user select input text (either a url or a local path to upload text source)
texturl   text   A Valid URL pointing to any plain text, html or xml text
localFile   file   The path to your local text file
pcatext 1/2/3 radio button 2 subtext unit corresponding to paragraph/percent of characters/chunk of words respectively
percent 5/10/20/50 selection 10 selection related the selection of percent of character above
chunk   text 100 it is related to the selection of "chunk of words" on "pcatext"
numword 0/5/10/20/50/100 selection 10 number of top frequency words. If you select "0", the following field must be entered with a list of words
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Web Service Interface

Added:
>
>
Taporware provides web services to any non-benefit organizations. here is the taporware web services information:

  • Endpoint URL: http://taporware.mcmaster.ca:9982
  • Service URI: http://taporware.mcmaster.ca/~taporware/webservice
  • Service Method: find_Collocation_Plain
  • parameters:
    • textInput -- any text string. If the text format is html or xml, all the tags will be stripped
    • pattern -- unix styled pattern or regular expression
    • context -- value can be 1/2/3/4 which corresponding to Words/Lines/Sentences/Paragraphs respectively
    • contextLength -- length of context
    • sorting -- values can be 1/2/3 corresponding to co-occurrence words by frequency/alphabetically/z-score
    • outFormat -- values are html/xml/others where others format will "generate XML text in HTML"

Known Bugs

To Do


 <<O>>  Difference Topic TAPoRwarePCA (r1.1 - 19 Jun 2006 - LianYan)
Line: 1 to 1
Added:
>
>
META TOPICPARENT TAPoRware

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml

Description

This tool applies Principal Components Analysis rules to a text; generating relationships between word distributions and text units.

Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribition of each text unit can be generated easily, but the usage (or similarility) of group (clustter) of word in all the text units can not be easily measured.

The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, the the results can be displayed. The words with similar usage will be displayed close to each other to form a cluster, otherwiae, they are diaplayed apart away.

Note 1: This tool works better with large text units (i.e. at least 500 words).

Note 2: The total number of units should be less than 150. You can use the List Words tool to determine the total number of words.

Note 3: Paragraphs are determined by multiple line breaks, so some paragraphs may be very short (e.g. titles). In this case, it is better to use a subtext other than paragraphs

History

Pseudocode

Ways of Using

CGI Interface

Web Service Interface

Known Bugs

To Do

Write help and walkthrough

-- LianYan - 19 Jun 2006


Topic: TAPoRwarePCA . { View | Diffs | r1.6 | > | r1.5 | > | r1.4 | More }

Revision r1.1 - 19 Jun 2006 - 14:56 - LianYan
Revision r1.6 - 29 Apr 2010 - 15:08 - KamalRanaweera