Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

What is Text Analysis?

Short and longer discussions of what text analysis is.


What is text analysis? A very short answer.

by Geoffrey Rockwell

Word processors have searching tools that allow you to find a word or phrase. Such finding tools can be used as a simple text analysis environment. Your word processor is not, however, suited to searching large texts interactively, nor does it show you the results of a search in a way that can help you understand a text. Computer assisted text analysis environments do three types of things beyond what the "Find" tool of a word processor might do:

  • Text analysis systems can search large texts quickly. They do this by preparing electronic indexes to the text so that the computer does not have to read through the entire text. When finding words can be done so quickly that it is "interactive", it changes how you can work with the text - you can serendipitously explore without being frustrated by the slowness of the search process.

  • Text analysis systems can conduct complex searches. Text analysis systems will often allow you to search for lists of words or for complex patterns of words. For example you can search for the co-occurrence of two words.

  • Text analysis systems can present the results in ways that suit the study of texts. Text analysis systems can display the results in a number of ways; for example, a Keyword In Context display shows you all the occurrences of the found word with one line of context.


Why bother with computer-assisted text analysis? A short answer.

by Geoffrey Rockwell

Text analyis tools aide the interpreter asking questions of electronic texts

Much of the evidence used by humanists is in textual form. One way of interpreting texts involves bringing questions to these texts and using various reading practices to reflect on the possible answers. Simple text analysis tools can help with this process of asking questions of a text and retrieving passages that help one think through the questions. The computer does not replace human interepretation, it enhances it. The concordance is an example of a research aide that predates computing. The concordance, like the index, allows the interpreter to find passages that share a common word about which you are asking. Unlike the index, the concordance is a presentation of the passages that "concord" together for reflection. Thus a Key Word In Context display shows one line of text with the key word searched for in the middle so that the reader can see if there are patterns in the way the word is used. Most text analysis tools build on the concordance. They break down the text (analysis) and then represent passages in new arrangements (synthesis) that aide the questioner list graphs of the distribution of a word.
  1. Text-analysis tools break a text down into smaller units like words, sentences, and passages, and then
  2. Gather these units into new views on the text that aide interpretation.

Text analysis practices encourage reflection on the questions asked and formalization of queries

One of the side-effects of computer-assisted text is that it forces the interpreter to think about their interpretative practices in order to use the tool. When you use an index or concordance you have ask yourself what word or concept you want to follow through the text. In order for a computer to aide in interpretation you need to describe the questions you bring to bear and then formalize them into queries that the computer can handle. To look for an abstract concept in a text with a concordance you have to ask what words or patterns would be indicative of the discussion of that concept. In the formalization the interpreter can learn about what they are asking and develop new questions. With interactive systems you start thinking you are asking about one question but in the formalization and refinement discover anomalies or other questions you want to ask. Interactive with a text through text analysis tools then becomes an interative practice of discovery with its own serindipidous paths comparable to, but not identical to, the serindipidous discovery that happen in rereading a text.

Text analysis is a way of targeted rereading that tests intuitions

For some, text analysis is a way of "thinking through" or targeted rereading. When interpreting a text you develop intuitions about what may be an interesting rereading of the evidence. Most intutions have implied quantitative or comparative elements of the sort "Rousseau is more interested in political issues while Diderot is interested in domestic philosophical issues." Such intuitions can be checked with text analysis. What language use would we expect in a philosopher interested in state issues rather than personal or domestic issues? Is there a higher frequency of certain words in Rosseau? In this formal thinking through of intuitions one doesn't necessarily prove the intuition, instead one uses interpretative aides in practices of reflection. Interestingly, these practices are often not shared as formal methods in the humanities. The resulting papers we write often don't mention the text analysis just as they don't mention the library practices. The practices are undertheorized in the humanities, which is one of the reasons for this Text Analysis Developers Alliance wiki. It was set up to provide a place for those interested in thinking about computer-assisted interpretative practices, associated tools, and how they can be developed.

You are already doing text analysis

As an increasing amount of the evidence that we use is accessed electronically online, we are forced to use analytical techniques in everyday research. When you Google for documents and then follow trails through the web you are using sophisticated tools that rank results of queries for words or combinations of words. Few understand how Google operates, but we learn to use it. Likewise when we search bibliographic databases for articles and then read them online we are using computer research tools. To use large full-text databases we learn to search for keywords and we learn to interpret the resulting views. This is simple text analysis, aren't you bothered by not be able to ask more refined questions of an electronic text? Doesn't it bother you to not know how the search works when you don't get what you expected? Text analysis as a practice is reflective. Text analysis researchers and developers are asking about the tools and the way they constrain or make possible practices of reflection.

Starting points

One resource TAPoR is adding to this wiki is a collection of Recipes which outline research practices that can be enabled by text analysis tools and give concrete examples.


What are electronic texts and how can we analyze them? A longer answer.

Based on Electronic Texts and Text Analysis by Geoffrey Rockwell and Ian Lancashire

The written word is one of the most important ways we communicate and preserve information. Whether it is legal records, novels, historical records, medical case studies, or now WWW pages, written text is in an important form of data to preserve. It is one of the primary means by which we communicate in industry, academia or for pleasure and, as an increasing amount of the texts that we care about are created in electronic form and accessed in electronic form, Canada needs a well thought out strategy for preserving those electronic texts of use to future generations. The future understanding of our past and understanding of this age of technological change will be incomplete if we do not take steps to preserve one of the most widely used forms of electronic information - the electronic text.

What is an Electronic Text?
Electronic texts digitally represent oral or written language in a form suitable for analysis with a computer. Typically an electronic text is either an electronic version of a written work, an electronic version of a transcript of an oral event, or a document composed on the computer. In any case the information in an electronic text is meant to be in a natural language that can be read by humans when displayed properly. Some examples of electronic texts would be:

An e-mail message
A medical case study that is stored on a computer

WWW pages that contain a significant amount of text so that to be understood they have to be read
A hypertext manual
An electronic edition of a play with markup and links to images of the original manuscript
A corpus of texts collected for linguistic or lexicographical study like a collection of exemplary texts used in the creation of a dictionary
A business document like a report, proposal, or contract
An interactive CD-ROM dictionary or encyclopedia

An interactive text adventure game where you read passages and make decisions
A collection of legal documents accessible through a retrieval system
A transcript of a series of interviews with embedded interpretative information
A transcript of a court case or administrative proceeding

Electronic texts come in four major forms:

    1. A copy of a work that was originally on paper - a digital representation of a literary, dramatic, or other type of written work that was originally in analogue form.
    2. A work composed on the computer that is stored in that form, but was intended to be printed like a word-processing file or PDF (Portable Document Format) file.
    3. A work composed on a computer that is meant to be accessed on a computer like a WWW page, electronic text database, or hypertext

    4. A transcript of a conversation or other oral event

What can we do with electronic texts?
We can use computers to present, manage, and learn from electronic texts in ways difficult to do by hand. We can archive large quantities of text and make reliable copies of these archives. We can quickly retrieve passages from a large text database of millions of pages. We can ask where two or more words occur within the same paragraph. We can link automatically to other information from a hypertext. We can quantify writing style or try to identify the author of a disputed work by his or her style. We can compare written works or study the evolution of language usage over a collection of texts. In general the process of computer assisted text-analysis uses computers to search, retrieve, manipulate, measure and classify natural-language documents for patterns and by author, subject, and genre or type. Here is partial list of some of the activities researchers do with electronic texts.

  • Editors and translators add interpretative information to electronic versions of historically important texts to create rich electronic editions for use by other scholars, students or interested readers. Such electronic editions can include modern spellings, commentary, variant translations, references, multimedia supplements and images of the original manuscript all available at a click of a button.

  • Researchers across the humanities and social sciences use electronic text collections to find passages where issues are discussed and to retrieve documents that are relevant to their questions.

  • Social scientists use text analysis to study interviews, responses to questionnaires, collections of policy documents, or letters. By qualitative analysis they characterize or model the topics, opinions, or psychological traits exhibited in the texts.


  • Linguists add information to texts about language features so that they can study language use. Using these corpora (collections of texts) they write dictionaries, grammars, studies of language change over time, and analyses of language use in different communities.

  • Researchers who are interested in the meaning of words analyze them by their company, that is, by the terms that co-occur or collate with them, and use statistical techniques. Their research benefits researchers developing automatic translation tools for global commerce.

  • Sociologists, educationalists, and psychologists sometimes analyze one aspect of human behavior, communicative tasks, by studying the utterances of individual speakers for traits such as sentence length, rate of repetition, pauses, questions, negatives, turns, etc.

  • Experts are called upon in court to use combinations of these techniques to establish the authorship of disputed texts. Forensic linguistics is a growing field as an increasing number of the documents that we exchange are electronic so that traditional ways of establishing the author will not work. (There is no fingerprint on an e-mail message only patterns of language use.) In general, a quantitative or qualitative profile of the disputed text is compared to profiles of texts known to have been written by candidate authors.

  • Documenters and usability analysts employ such techniques to improve client manuals and business technical reports, and to help customers to summarize documents.


  • Language instruction researchers use text tools to study language learning problems and develop collections of electronic texts with which to teach languages. They work with linguists to develop text collections with which to train translation systems.

  • Researchers from all areas publish in electronic journals creating more electronic texts for others to study and access.

A brief history of electronic texts and text-analysis tools.
A good way to understand text analysis is to look at the tradition of concordancing from which it evolved. A concordance is a standard study tool where one can look up a word and find references to all the passages in the target work where that word occurs. They are alphabetically-sorted lists of the vocabulary of a text (its different words or phrases). Occurrences of each word (the keyword) appear under a headword, each one surrounded by enough context to make out the meaning, and each one identified by a citation to the text that gives its location in the original.

The first text-analysis tools were designed to create paper concordances. Father Roberto Busa in the late 1940s was one of the first to use of computers in the production of concordances with his Index Thomisticus, a project that began by using index cards, moved onto analogue information technology in the 50s and migrated to the computer. The results were finally published in the 1970s and a CD was released in 1992.

The concordance, however, goes back to the 13th century. Hugh of St. Cher is credited with directing the production of a concordance to the Vulgate bible by brother Dominican monks in Paris. This concordance, supposedly finished by 1247, suffered in that it only had references and not quotations to give a sense of context. Quotations were apparently added later by English Dominicans to a concordance that has not survived. Finally, a concordance attributed to Conrad of Halberstadt improved on the model, leaving us by the end of the 13th century with a concordance that provided some context along with references.

I.1/577.1       | Four nights will quickly dream away the time; | And
I.1/578.2  Swift as a shadow, short as any dream; | Brief as the
II.2/585.1       | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1   this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1     as the fierce vexation of a dream.| But first I will
IV.1/594.2   to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2     rare | vision. I have had a dream, past the wit of man to
IV.1/594.2    the wit of man to | say what dream it was: man is but an
IV.1/594.2   he go | about to expound this dream. Methought I was--there
IV.1/594.2    his heart to report, what my dream  | was. I will get Peter
IV.1/594.2     to write a ballad of | this dream: it shall be called
IV.1/594.2     it shall be called Bottom's dream, | because it hath no
V.1/599.1      | Following darkness like a dream, | Now are frolic: not a
V.1/599.2  theme, | No more yielding but a dream, | Gentles, do not

Example of a Key Word In Context display from an interactive concordance of Shakespeare's A Midsummer Night's Dream

To return to text-analysis tools, the first generation of widely available tools were batch tools that were not interactive, but were designed to produce paper concordances. This can be seen in the names and operations of many of these tools. COCOA stands for COunt and COncordance generation on the Atlas. The Oxford University Computing Service took over COCOA in 1978 and produced OCP (the Oxford Concordance Program.)

With the availability and increasing power of micro-computers in the 1980s text-analysis tools migrated from mainframes to personal computers. OCP developed into Micro-OCP and new programs came out for the personal computer like the Brigham Young Concordance program (BYC) later renamed and commercialized under the name WordCruncher? and the TACT (Text-Analysis and Concordance Tools) environment developed at the University of Toronto. This shift to the microcomputer changed the nature of our use of the tools in two ways. The scholar could now use tools whenever they wanted on a personal computer instead of having to wait for mainframe time or time on a terminal. This meant that textual scholars were no longer dependent on the paper concordance, but could use the electronic tools in his or her place of study. This change in the location of computer-assisted text-analysis along with developments in interface technology led developers away from a batch concording model towards interactive models that assumed that the scholar would have access to the tools and a collection of e-texts for personal study. It is the access to research-quality electronic texts that we need to ensure.

One of the best known examples of this shift from batch to interactive concording is TACT which was developed at the University of Toronto and is still in wide use today. TACT is not meant for producing a printed concordance but for exploring the electronic text interactively through queries and windowed displays. TACT did not just automate the job of the concordancer, but changed the perspective of the user of the concordance. It offers advanced features suitable to careful text analysis not found in text retrieval systems. The WWW accessible version of TACT called TACTweb has further made it possible for a researcher to share his research textual data over the Internet.

As interactive concording tools became accessible researchers began to ask more complex questions of text databases. Rather than simply asking for a list of locations of a word in the text, researchers began to ask for patterns of words, parts of words, linguistic features and punctuation. Researchers began to add statistical tools that could count and compare features and now we have visualization tools that display graphs that show usage over large quantities of texts. Many of these techniques are being used for business document management systems and basic Internet tools like the search engines we depend on to find WWW pages.

Why is it important to preserve electronic texts for research?

The primary means by which scholarship in the humanities and certain social sciences is transmitted, studied and stored for future use is in the form of written works like books, journals and manuscripts. Philosophers, historians, literary critics, art historians, political scientists and others in the Social Sciences and Humanities use primary sources that are texts and produce new research in the form of written works. An increasing number of these texts are generated on a computer and are therefore originally in electronic form. Further, a significant number of Canadian scholars have created text research resources that can only be studied in electronic form with the appropriate tools. There is now a critical mass of research resources available as electronic texts, and in some cases, only as electronic texts. It is safe to say that a significant number of researchers now need access to well maintained electronic text services in order to conduct research and the majority will in the near future as computing methods and text services spread through the disciplines.

Here are some specific reasons for preserving electronic text:

  1. It is already the case that certain bibliographic databases (those that include a substantial amount of text information) are being used primarily in electronic form. Few Scholars use the MLA Index or the Philosophers Index on paper. Bibliographic research tools such as Iter, which specializes in references having to do with the Middle Ages and Renaissance, are being developed in Canada. Such rich bibliographic resources need to be preserved to provide future scholars access to research literature in their fields.
  2. Major dictionary projects based in Canada like the Dictionary of Old English are producing electronic text databases of language usage for researchers. The DOE has a full-text database of all the significant works written in Old English. There is no organization at present in Canada that can provide access to, let alone preserve over the long term, this important research resource. The project had to look abroad for electronic text delivery support. Further, the dictionary itself, while it will be published on paper, will also exist in electronic form with additional features that will not be available unless preserved in electronic form. The Comparative Lexicography Project of French and English in Canada is likewise creating a database of Canadian texts in English and French and a bilingual Canadian dictionary. The research text resources of projects like these are crucial to preserve for the ongoing study of our languages. We will not know why we speak and write as we do without such systematic resources.

  3. As mentioned above, one of the better known text-analysis tools, TACT, was developed at the University of Toronto. Text and document tools are not only created at Canadian universities; an important commercial document management environment, Livelink by OpenText?, benefited from research into text retrieval at the University of Waterloo. Another company,SoftQuad produces X-metal, one of the best SGML/XML editors. Canadian industry and researchers have been at the forefront of the development of tools for the study and retrieval of electronic texts whether for business or academic use. Both development communities benefit from the preservation and dissemination of a variety of electronic texts. For researchers and developers alike, these tools are only as useful as the e-texts available to study with them. An appropriate text service would provide the raw materials for software projects that have had a demonstrated benefit to the Canadian economy.
  4. Canadian researchers are involved in the creation of research quality electronic editions of important works of literature. The Internet Shakespeare Editions, for example, is a project out of the University of Victoria that is creating electronic versions of Shakespeare's works following best practices in the field. The Trésor de la langue française au Québec (TLFQ) at the Université Laval has assembled a corpus of electronic editions of literary texts of importance to the study of French in Québec. These research electronic texts are important not just to Canadian researchers, but also to researchers around the world. TLFQ is typical of the high level of electronic scholarship of Canadian scholars that we need to preserve and make available outside of Canada in order to increase international understanding of our cultures.

  5. Canadian researchers are also developing new forms of electronic research texts that can only be accessed on a computer. The Lyrical Ballads Bicentenary Project is making this work by Wordsworth and Coleridge available in an electronic form that allows for the comparison of versions. This resource can only exist in electronic form. The Performance in Victorian Hamilton project is creating a text database of all records of musical and theatrical performances in Hamilton in order to study entertainment before Cinema. Resources such as these have no analogue on paper and represent new forms of research that can only be preserved in electronic form.
  6. There is a growing number of courses and programmes in the Social Studies and the Humanities that make significant use of electronic texts to train undergraduate and graduate students in the application of computing to research. McMaster University has started an undergraduate Multimedia programme that includes courses on electronic texts and their study. The University of Alberta has announced a M.A. in Humanities Computing that includes core courses in electronic texts. We need access to a well maintained Canadian electronic text service to train future students in electronic literacy. The ability to manage, study, and retrieve textual information in electronic form is an important literacy skill in a world where an increasing amount of information is only available electronically.
  7. Finally, Canadian researchers need a service where they can deposit electronic texts comparable to the national efforts of countries like the United Kingdom which has set up an Arts and Humanities Data Service. What is at stake is the preservation of our textual heritage. The electronic texts of today will provide rich resources for future study including works of interest to those who want to study our age and this transition from analogue to digital scholarship.

Appropriate investment in the preservation of electronic texts created by or for researchers will not only keep valuable research resources accessible. It will also provide a model for other domains that have significant investments in electronic texts from the health sciences to the insurance sector. We need to begin to solve the problem of how to archive our exploding electronic record before it overwhelms us. The pursuit of efficient models for the research community will benefit Canada.

Selected Canadian Links:

Dictionary Projects:
Dictionary of Old English: http://www.doe.utoronto.ca/
Waterloo Centre for the Study of the New OED: http://db.uwaterloo.ca/OED/

Dictionnaire de l'Académie française: http://www.chass.utoronto.ca/~wulfric/academie/
Termium: http://www.translationbureau.gc.ca/pwgsc_internet/english/03_tools/03_termium.htm
Comparative Lexicography of French and English in Canada: http://balzac.sti.uottawa.ca/

Electronic Editions:

Internet Shakespeare Editions: http://web.uvic.ca/shakespeare/
TLFQ: http://www.ciral.ulaval.ca/tlfq/
Canadian Poetry Database: http://www.lib.unb.ca/Texts/projects.html
Representative Poetry On-line: http://www.library.utoronto.ca/utel/rp/intro.html
Laboratoire de Français Ancien: http://www.uottawa.ca/academic/arts/lfa/

Web Joyce - Finnegans Web: http://www.trentu.ca/jjoyce/fw.htm
Lyrical Ballads Project: http://www.dal.ca/~etc/lballads/index_std.html
Complete Poems and Letters of E.J. Pratt: http://www.trentu.ca/pratt/
Canadian Poetry: http://www.library.utoronto.ca/canpoetry/
Early Canadiana Online: http://www.canadiana.org/

Other Projects:
The Orlando Project: http://www.ualberta.ca/ORLANDO/
Iter: http://iter.utoronto.ca/iter/index.htm
Performance in Victorian Hamilton: http://cheiron.humanities.mcmaster.ca/~hamperf

University Programmes:

University of Alberta M.A. in Humanities Computing: http://www.arts.ualberta.ca/huco/
McMaster University Multimedia Programme: http://www.humanities.mcmaster.ca/~macmedia

Text Tools:
Open Text Corporation: http://www.opentext.com/
SoftQuad?: http://www.softquad.com/

TACT: http://www.chass.utoronto.ca/cch/tact.html
TACTweb: http://tactweb.humanities.mcmaster.ca
SATO 4.0: http://www.ling.uqam.ca/sato/outils/sato.htm
Hyperpo: http://tapor.ualberta.ca/HyperPo/

Text Projects Elsewhere:

Arts and Humanities Data Service: http://www.ahds.ac.uk/
Oxford Text Archives: http://ota.ahds.ac.uk/
University of Virginia Electronic Text Centre: http://etext.lib.virginia.edu/
University of Virginia Institute for Advanced Technology in the Humanities: http://www.iath.virginia.edu/
Project Gutenberg: http://promo.net/pg/

Text Encoding Initiative: http://www.uic.edu/orgs/tei/
Humbul Humanities Hub: http://www.humbul.ac.uk/

 

-- GeoffreyRockwell - 30 Apr 2005


Use this box to quickly add a comment to the page.

more options...