Experiments in Text Analysis II: May 1-2, 2008
Editorial Note: This document was written during an experiment in "extreme text analysis" where, in conversation, one of us did the text analysis and the other documented the research practice here. This document, because it was written while doing the analysis is often elliptical and has typos the way any living notes are. It is a record of an attempt at open research practice documentation and is shared as such. The outcome of the text analysis experiment, Now, Analyze That: Obama and Wright on Race in America is the more polished text, though it is being finished still. As we continue these experiments we expect to prepare more detailed explanations about extreme text analysis, the rhetoric of text analysis, and open research.
All passages in italics, including this one, were written after the fact. This being a wiki, you can retrieve the document as it was Friday afternoon when we stopped. Think about it!
For the purposes of this experiment we want to identify a topic of general interest. Note that we aren't necessarily coming to the topic with pre-conceived hypotheses, unlike most traditional text analysis; this is more about exploratory text analysis. The choice of topic seems a bit arbitrary, but we're more concerned with how you might do some quick text analysis on an topic of interest.
Possible topics:
- US Primaries debates
- toxins in plastic containers
- GTA4
- global food prices
- race comments by Obama and Wright
We've decided on the Obama/Wright topic because it will allow us to compare two documents (or at least a small corpus).
Timeline
Initial Gathering of Texts
We have decided initially to not clean the texts, even though the "print versions" have extraneous content (but we're interested in experimenting with speed analysis).
We should give some thought to how to best find texts for this purpose.
Preliminary Text Analysis
- aggregate all three texts
- open analyze text
- fire word cloud without stop words and sorted by appearance
When asking for more sparklines in taporware we realized that one of the pages included a lot of reader comments, so we decided to use the plain text versions with just the transcripts.
Comparaison on Cleaned Speeches
- text sizes similar
- wright repeats words more often
- "black" evenly distributed
- more often Obama: white (white house?), racial, time (?), country
- more often Wright: treat, committed, doctor, different, religious, book
Examining "time" More Closely in Obama
- concordance with context of 15
- looking at word series "this time we want to talk about" and "at a time when we need"
- Wright uses time to talk about music (among other things)
- comparative distribution in HyperPo highlights that Obama uses time more often towards the end
- looking at time with visual collocator showed "devisive" as an interesting collocate
- looking at black and time with visual collocator
Examining "commit*" More Closesly in Wright
- "many of us are committed" and "we are committed"
Looking at Words that Only One Uses:
- Obama: anger, health, generation, woman (singular), future, stories, racism
- Wright: changing, European, tradition, learning
It seems noteworthy that Wright uses rac* about 4 times whereas Obama uses it about 32 times
Googlizer: obama wright race
(?)
Issues and Wishlist
- TAPoRize window opens behind opened window (fault of opening page?) and the window styling should be simplified
- tool for assisting in cleaning and stripping texts found on the web, and for managing derivative texts
- quicker way of pointing to texts in research log once they're gathered (instead of manually getting the texts)
- results of a tool should be savable directly to myTexts (not via data bench) - do we need the data bench?
- reconsider distinction between myTexts and Workbench
- why are counts different between frequency and appearance word clouds for aggregate text (and yet again for frequency lists)?
- can we have an easier way of viewing aggregate texts (the actual texts), as a public url (instead of going back to myTexts and generating the link with Meta)
- can we have a way in the portal of viewing more sparklines for the frequency tool?
- can we have a way of spawning multiple results windows (for comparison)
- need for a tool to have words from same semantic cluster (not just synonyms), ex: time, now, moment, etc.
- generate word list and be able to call it as input to tools
- possible tool: find words with notably irregular distribution
- need for better tool for repeating word sequences
- stop words not working properly for word pairs taporware tool
- collocates display shows up very differently with z-score
- both speeches have rhetorical arc or structure which none of our tools could show
- structure analyzer: specific terms, repetition clusters, part of speech trends, mva
- categorize research log entries
Reflections
- is text analysis just good for thought seeds?
- if the text analysis happens at the seeding stage, would text analysis ever show up in the published paper? (see http://philosophi.ca/theorti/?p=1732)
- as developers, we're discovering the limits of the tools and the next functions that would be useful (this doesn't seem to happen otherwise)
- we're working in a very human way: making somewhat arbitrary choices (based on intuition?) - could computers provide more thoroughness through limited intelligence?
- writing forces checking of hypotheses formed at exploratory stage (ex: we thought Obama used white as white house, but that's not the case)
--
StefanSinclair - 01 May 2008