Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Principal Components Analysis

See http://taporware.mcmaster.ca/~taporware/betaTools/pca.shtml

Description

This tool applies Principal Components Analysis rules to a text; generating relationships between word distributions and text units.

Generally, one corpus could be divided into many text units based on subtitles, paragraphs, number of words etc. The word distribution of each text unit can be generated easily, but the usage (or similarity) of group (cluster) of word in all the text units can not be easily measured.

The Principal Components Analysis tool applies the corresponding mathematics to reduce the dimension of the text units into 2 or 3, so that the results can be displayed graphically. The words with similar usage will be displayed close to each other to form a cluster, otherwise, they are displayed apart away.

Note 1: This tool works better with large text units (i.e. at least 500 words).

Note 2: The total number of units should be less than 150. You can use the List Words tool to determine the total number of words.

Note 3: Paragraphs are determined by multiple line breaks, so some paragraphs may be very short (e.g. titles). In this case, it is better to use a subtext other than paragraphs

Pseudocode

  • Obtain source text by URL or form user's local disk. If the text format is XML or HTML, strip off all the tags
  • Chop the text into subtext based on user's choice, generate the word list in each subtext unit.
  • Get the top frequency words user specified in the entire text, which may exclude stop words, and generate the top words counting matrix of each subtext unit
  • Calculate the mean matrix of the word count matrix
  • Generate the co-variance and the correlation coefficient matrix based on the mean matrix
  • Generate the eigenvector and eigenvalue of the co-variance matrix
  • Generate m x 2 and m x 3 matrix from the eigenvector above, where m is the number of top frequency words
  • Obtain final matrix by perform the matrix (the two above separately) transpose and then time and mean matrix
  • Generate result based on user selection -- a java applet or table format
  • Note: for java applet, only the 3-dimensional graph is generated

Ways of Using

  • Enter a URL pointing to the source text or browse (enter) the local text file path in the corresponding field
  • Select subtext unit, enter or select the related field or using default value
  • Select the number of top frequency you want to investigate
  • Check to include or exclude Glasgow stop words as you wish. You can also enter your own stop word list in the specified field
  • Select output format: HTML or Java applet graphics
  • Click the submit button

CGI Interface

If you want to use this tool from your web site, here is the CGI Interface: (Note: If you want to upload local html text to the tool, you need to use attribute name/value pair: enctype="multipart/form-data" within the form tag)

Here are the parameters:

Parameter Name Parameter Value Control Type Default Description
source url/local radio button url Let user select input text (either a url or a local path to upload text source)
texturl   text   A Valid URL pointing to any plain text, html or xml text
localFile   file   The path to your local text file
pcatext 1/2/3 radio button 2 subtext unit corresponding to paragraph/percent of characters/chunk of words respectively
percent 5/10/20/50 selection 10 selection related the selection of percent of character above
chunk   text 100 it is related to the selection of "chunk of words" on "pcatext"
numword 0/5/10/20/50/100 selection 10 number of top frequency words. If you select "0", the following field must be entered with a list of words
glasgow   checkbox checked check it to exclude glasgow stop words
userlist   text   enter your own stop list here, separated by comma. If you check to exclude glasgow stop words and enter your own stop words here, they will be combined to form a new stop word list
disp 1/2 selection 2 the output format. 1 -- HTML, 2 -- java applet graphics
taporface   checkbox checked display result in a new window without graphics interface (default) or with taporware interface in the same window

Use PCA Tool in Your Web Page

You can add a a button in your web page to get the PCA results of the current page by call TAPoRware cgi script.

Here is the code for the PCA button above:

<form method="post" name="textForm" enctype="multipart/form-data" target="_blank" action="http://taporware.mcmaster.ca/~taporware/cgi-bin/prototype/tpcacluster.cgi" onsubmit="document.textForm.texturl.value=document.location.href">

<input type="hidden" name="source" value="url" />

<input type="hidden" name="texturl" />

<input type="hidden" name="freetext" value="Y"/>

<input type="hidden" name="pcatext" value="2" />>

<input type="hidden" name="percent" value="10" />

<input type="hidden" name="numword" value="30" />

<input type="hidden" name="glasgow" value="Y" />

<input type="hidden" name="disp" value="2" />

<input type="submit" value="Get PCA" />

</form>

Web Service Interface

Taporware provides web services to any non-benefit organizations. here is the taporware web services information:

  • Endpoint URL: http://taporware.mcmaster.ca:9982
  • Service URI: http://taporware.mcmaster.ca/~taporware/webservice
  • Service Method: plainPCA
  • parameters:
    • textSource -- any text string. If the text format is html or xml, all the tags will be stripped
    • suboption -- subtext unit selection, the values 1/2/3 are corresponding to paragraph/percent of characters/chunk of text in words
    • percent -- this selection is related to the choice of "percent of characters"
    • chunk -- this text field is related to the choice of "chunk of text in words" in the suboption parameter
    • topWords -- number of top frequency words (may or may not exclude stop words, see below) to be investigated
    • glasgow -- a boolean value (NO) to exclude the glasgow stop words in the top frequency word list
    • userwords -- a text field for user enter his/hers stop words (separated by comma). This list will combined with the glosgow stop list if you select it
    • outFormat -- values are 1/2 which are corresponding to HTML and java applet respectively

Known Bugs

To Do

Write help and walkthrough

-- LianYan - 19 Jun 2006


Use this box to quickly add a comment to the page.

more options...