Much of the evidence used by humanists is in textual form. One way of interpreting texts involves bringing questions to these texts and using various reading practices to reflect on the possible answers. Simple text analysis tools can help with this process of asking questions of a text and retrieving passages that help one think through the questions. The computer does not replace human interepretation, it enhances it. The concordance is an example of a research aide that predates computing. The concordance, like the index, allows the interpreter to find passages that share a common word about which you are asking. Unlike the index, the concordance is a presentation of the passages that "concord" together for reflection. Thus a Key Word In Context display shows one line of text with the key word searched for in the middle so that the reader can see if there are patterns in the way the word is used. Most text analysis tools build on the concordance. They break down the text (analysis) and then represent passages in new arrangements (synthesis) that aide the questioner list graphs of the distribution of a word.
- Text-analysis tools break a text down into smaller units like words, sentences, and passages, and then
- Gather these units into new views on the text that aide interpretation.
One of the side-effects of computer-assisted text is that it forces the interpreter to think about their interpretative practices in order to use the tool. When you use an index or concordance you have ask yourself what word or concept you want to follow through the text. In order for a computer to aide in interpretation you need to describe the questions you bring to bear and then formalize them into queries that the computer can handle. To look for an abstract concept in a text with a concordance you have to ask what words or patterns would be indicative of the discussion of that concept. In the formalization the interpreter can learn about what they are asking and develop new questions. With interactive systems you start thinking you are asking about one question but in the formalization and refinement discover anomalies or other questions you want to ask. Interactive with a text through text analysis tools then becomes an interative practice of discovery with its own serindipidous paths comparable to, but not identical to, the serindipidous discovery that happen in rereading a text.
For some, text analysis is a way of "thinking through" or targeted rereading. When interpreting a text you develop intuitions about what may be an interesting rereading of the evidence. Most intutions have implied quantitative or comparative elements of the sort "Rousseau is more interested in political issues while Diderot is interested in domestic philosophical issues." Such intuitions can be checked with text analysis. What language use would we expect in a philosopher interested in state issues rather than personal or domestic issues? Is there a higher frequency of certain words in Rosseau? In this formal thinking through of intuitions one doesn't necessarily prove the intuition, instead one uses interpretative aides in practices of reflection. Interestingly, these practices are often not shared as formal methods in the humanities. The resulting papers we write often don't mention the text analysis just as they don't mention the library practices. The practices are undertheorized in the humanities, which is one of the reasons for this Text Analysis Developers Alliance wiki. It was set up to provide a place for those interested in thinking about computer-assisted interpretative practices, associated tools, and how they can be developed.
As an increasing amount of the evidence that we use is accessed electronically online, we are forced to use analytical techniques in everyday research. When you Google for documents and then follow trails through the web you are using sophisticated tools that rank results of queries for words or combinations of words. Few understand how Google operates, but we learn to use it. Likewise when we search bibliographic databases for articles and then read them online we are using computer research tools. To use large full-text databases we learn to search for keywords and we learn to interpret the resulting views. This is simple text analysis, aren't you bothered by not be able to ask more refined questions of an electronic text? Doesn't it bother you to not know how the search works when you don't get what you expected? Text analysis as a practice is reflective. Text analysis researchers and developers are asking about the tools and the way they constrain or make possible practices of reflection.
|
The written word is one of the most important ways we communicate and preserve information. Whether it is legal records, novels, historical records, medical case studies, or now WWW pages, written text is in an important form of data to preserve. It is one of the primary means by which we communicate in industry, academia or for pleasure and, as an increasing amount of the texts that we care about are created in electronic form and accessed in electronic form, Canada needs a well thought out strategy for preserving those electronic texts of use to future generations. The future understanding of our past and understanding of this age of technological change will be incomplete if we do not take steps to preserve one of the most widely used forms of electronic information - the electronic text. What is an Electronic Text?
Electronic texts come in four major forms:
What can we do with electronic texts?
A brief history of electronic texts and text-analysis
tools. The first text-analysis tools were designed to create paper concordances. Father Roberto Busa in the late 1940s was one of the first to use of computers in the production of concordances with his Index Thomisticus, a project that began by using index cards, moved onto analogue information technology in the 50s and migrated to the computer. The results were finally published in the 1970s and a CD was released in 1992. The concordance, however, goes back to the 13th century. Hugh of St. Cher is credited with directing the production of a concordance to the Vulgate bible by brother Dominican monks in Paris. This concordance, supposedly finished by 1247, suffered in that it only had references and not quotations to give a sense of context. Quotations were apparently added later by English Dominicans to a concordance that has not survived. Finally, a concordance attributed to Conrad of Halberstadt improved on the model, leaving us by the end of the 13th century with a concordance that provided some context along with references. I.1/577.1 | Four nights will quickly dream away the time; | And I.1/578.2 Swift as a shadow, short as any dream; | Brief as the II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander, III.2/591.1 this derision | Shall seem a dream and fruitless vision, | IV.1/593.1 as the fierce vexation of a dream.| But first I will IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The IV.1/594.2 rare | vision. I have had a dream, past the wit of man to IV.1/594.2 the wit of man to | say what dream it was: man is but an IV.1/594.2 he go | about to expound this dream. Methought I was--there IV.1/594.2 his heart to report, what my dream | was. I will get Peter IV.1/594.2 to write a ballad of | this dream: it shall be called IV.1/594.2 it shall be called Bottom's dream, | because it hath no V.1/599.1 | Following darkness like a dream, | Now are frolic: not a V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not Example of a Key Word In Context display from an interactive concordance of Shakespeare's A Midsummer Night's Dream To return to text-analysis tools, the first generation of widely available tools were batch tools that were not interactive, but were designed to produce paper concordances. This can be seen in the names and operations of many of these tools. COCOA stands for COunt and COncordance generation on the Atlas. The Oxford University Computing Service took over COCOA in 1978 and produced OCP (the Oxford Concordance Program.) With the availability and increasing power of micro-computers in the 1980s text-analysis tools migrated from mainframes to personal computers. OCP developed into Micro-OCP and new programs came out for the personal computer like the Brigham Young Concordance program (BYC) later renamed and commercialized under the name WordCruncher? and the TACT (Text-Analysis and Concordance Tools) environment developed at the University of Toronto. This shift to the microcomputer changed the nature of our use of the tools in two ways. The scholar could now use tools whenever they wanted on a personal computer instead of having to wait for mainframe time or time on a terminal. This meant that textual scholars were no longer dependent on the paper concordance, but could use the electronic tools in his or her place of study. This change in the location of computer-assisted text-analysis along with developments in interface technology led developers away from a batch concording model towards interactive models that assumed that the scholar would have access to the tools and a collection of e-texts for personal study. It is the access to research-quality electronic texts that we need to ensure. One of the best known examples of this shift from batch to interactive concording is TACT which was developed at the University of Toronto and is still in wide use today. TACT is not meant for producing a printed concordance but for exploring the electronic text interactively through queries and windowed displays. TACT did not just automate the job of the concordancer, but changed the perspective of the user of the concordance. It offers advanced features suitable to careful text analysis not found in text retrieval systems. The WWW accessible version of TACT called TACTweb has further made it possible for a researcher to share his research textual data over the Internet. As interactive concording tools became accessible researchers began to ask more complex questions of text databases. Rather than simply asking for a list of locations of a word in the text, researchers began to ask for patterns of words, parts of words, linguistic features and punctuation. Researchers began to add statistical tools that could count and compare features and now we have visualization tools that display graphs that show usage over large quantities of texts. Many of these techniques are being used for business document management systems and basic Internet tools like the search engines we depend on to find WWW pages. Why is it important to preserve electronic
texts for research? Here are some specific reasons for preserving electronic text:
Appropriate investment in the preservation of electronic texts created by or for researchers will not only keep valuable research resources accessible. It will also provide a model for other domains that have significant investments in electronic texts from the health sciences to the insurance sector. We need to begin to solve the problem of how to archive our exploding electronic record before it overwhelms us. The pursuit of efficient models for the research community will benefit Canada. Selected Canadian Links: Dictionary Projects: Termium: http://www.translationbureau.gc.ca/pwgsc_internet/english/03_tools/03_termium.htm Comparative Lexicography of French and English in Canada: http://balzac.sti.uottawa.ca/ Electronic Editions: TLFQ: http://www.ciral.ulaval.ca/tlfq/ Canadian Poetry Database: http://www.lib.unb.ca/Texts/projects.html Representative Poetry On-line: http://www.library.utoronto.ca/utel/rp/intro.html Laboratoire de Français Ancien: http://www.uottawa.ca/academic/arts/lfa/ Web Joyce - Finnegans Web: http://www.trentu.ca/jjoyce/fw.htm Lyrical Ballads Project: http://www.dal.ca/~etc/lballads/index_std.html Complete Poems and Letters of E.J. Pratt: http://www.trentu.ca/pratt/ Canadian Poetry: http://www.library.utoronto.ca/canpoetry/ Early Canadiana Online: http://www.canadiana.org/ Other Projects: University Programmes: McMaster University Multimedia Programme: http://www.humanities.mcmaster.ca/~macmedia Text Tools: TACTweb: http://tactweb.humanities.mcmaster.ca SATO 4.0: http://www.ling.uqam.ca/sato/outils/sato.htm Hyperpo: http://tapor.ualberta.ca/HyperPo/ Text Projects Elsewhere: Oxford Text Archives: http://ota.ahds.ac.uk/ University of Virginia Electronic Text Centre: http://etext.lib.virginia.edu/ University of Virginia Institute for Advanced Technology in the Humanities: http://www.iath.virginia.edu/ Project Gutenberg: http://promo.net/pg/ Text Encoding Initiative: http://www.uic.edu/orgs/tei/ Humbul Humanities Hub: http://www.humbul.ac.uk/
|