|
The written word is one of the most important
ways we communicate and preserve information.
Whether it is legal records, novels, historical
records, medical case studies, or now WWW pages,
written text is in an important form of data to
preserve. It is one of the primary means by which
we communicate in industry, academia or for pleasure
and, as an increasing amount of the texts that
we care about are created in electronic form and
accessed in electronic form, Canada needs a well
thought out strategy for preserving those electronic
texts of use to future generations. The future
understanding of our past and understanding of
this age of technological change will be incomplete
if we do not take steps to preserve one of the
most widely used forms of electronic information
- the electronic text.
What is an Electronic Text?
Electronic texts digitally represent oral or written
language in a form suitable for analysis with
a computer. Typically an electronic text is either
an electronic version of a written work, an electronic
version of a transcript of an oral event, or a
document composed on the computer. In any case
the information in an electronic text is meant
to be in a natural language that can be read by
humans when displayed properly. Some examples
of electronic texts would be:
An e-mail message
A medical case study that is stored on a computer
WWW pages that contain a significant amount
of text so that to be understood they have to
be read
A hypertext manual
An electronic edition of a play with markup
and links to images of the original manuscript
A corpus of texts collected for linguistic or
lexicographical study like a collection of exemplary
texts used in the creation of a dictionary
A business document like a report, proposal,
or contract
An interactive CD-ROM dictionary or encyclopedia
An interactive text adventure game where you
read passages and make decisions
A collection of legal documents accessible through
a retrieval system
A transcript of a series of interviews with
embedded interpretative information
A transcript of a court case or administrative
proceeding
Electronic texts come in four major forms:
- A copy of a work that was originally on
paper - a digital representation of a literary,
dramatic, or other type of written work that
was originally in analogue form.
- A work composed on the computer that is
stored in that form, but was intended to be
printed like a word-processing file or PDF
(Portable Document Format) file.
- A work composed on a computer that is meant
to be accessed on a computer like a WWW page,
electronic text database, or hypertext
- A transcript of a conversation or other
oral event
What can we do with electronic texts?
We can use computers to present, manage, and learn
from electronic texts in ways difficult to do
by hand. We can archive large quantities of text
and make reliable copies of these archives. We
can quickly retrieve passages from a large text
database of millions of pages. We can ask where
two or more words occur within the same paragraph.
We can link automatically to other information
from a hypertext. We can quantify writing style
or try to identify the author of a disputed work
by his or her style. We can compare written works
or study the evolution of language usage over
a collection of texts. In general the process
of computer assisted text-analysis uses computers
to search, retrieve, manipulate, measure and classify
natural-language documents for patterns and by
author, subject, and genre or type. Here is partial
list of some of the activities researchers do
with electronic texts.
- Editors and translators add interpretative
information to electronic versions of historically
important texts to create rich electronic editions
for use by other scholars, students or interested
readers. Such electronic editions can include
modern spellings, commentary, variant translations,
references, multimedia supplements and images
of the original manuscript all available at
a click of a button.
- Researchers across the humanities and social
sciences use electronic text collections to find
passages where issues are discussed and to retrieve
documents that are relevant to their questions.
- Social scientists use text analysis to study
interviews, responses to questionnaires, collections
of policy documents, or letters. By qualitative
analysis they characterize or model the topics,
opinions, or psychological traits exhibited
in the texts.
- Linguists add information to texts about language
features so that they can study language use.
Using these corpora (collections of texts) they
write dictionaries, grammars, studies of language
change over time, and analyses of language use
in different communities.
- Researchers who are interested in the meaning
of words analyze them by their company, that
is, by the terms that co-occur or collate with
them, and use statistical techniques. Their
research benefits researchers developing automatic
translation tools for global commerce.
- Sociologists, educationalists, and psychologists
sometimes analyze one aspect of human behavior,
communicative tasks, by studying the utterances
of individual speakers for traits such as sentence
length, rate of repetition, pauses, questions,
negatives, turns, etc.
- Experts are called upon in court to use combinations
of these techniques to establish the authorship
of disputed texts. Forensic linguistics is a
growing field as an increasing number of the
documents that we exchange are electronic so
that traditional ways of establishing the author
will not work. (There is no fingerprint on an
e-mail message only patterns of language use.)
In general, a quantitative or qualitative profile
of the disputed text is compared to profiles
of texts known to have been written by candidate
authors.
- Documenters and usability analysts employ
such techniques to improve client manuals and
business technical reports, and to help customers
to summarize documents.
- Language instruction researchers use text
tools to study language learning problems and
develop collections of electronic texts with
which to teach languages. They work with linguists
to develop text collections with which to train
translation systems.
- Researchers from all areas publish in electronic
journals creating more electronic texts for
others to study and access.
A brief history of electronic texts and text-analysis
tools.
A good way to understand text analysis is to look
at the tradition of concordancing from which it
evolved. A concordance is a standard study tool
where one can look up a word and find references
to all the passages in the target work where that
word occurs. They are alphabetically-sorted lists
of the vocabulary of a text (its different words
or phrases). Occurrences of each word (the keyword)
appear under a headword, each one surrounded by
enough context to make out the meaning, and each
one identified by a citation to the text that
gives its location in the original.
The first text-analysis tools were designed to
create paper concordances. Father Roberto Busa
in the late 1940s was one of the first to use
of computers in the production of concordances
with his Index Thomisticus, a project that began
by using index cards, moved onto analogue information
technology in the 50s and migrated to the computer.
The results were finally published in the 1970s
and a CD was released in 1992.
The concordance, however, goes back to the 13th
century. Hugh of St. Cher is credited with directing
the production of a concordance to the Vulgate
bible by brother Dominican monks in Paris. This
concordance, supposedly finished by 1247, suffered
in that it only had references and not quotations
to give a sense of context. Quotations were apparently
added later by English Dominicans to a concordance
that has not survived. Finally, a concordance
attributed to Conrad of Halberstadt improved on
the model, leaving us by the end of the 13th century
with a concordance that provided some context
along with references.
I.1/577.1 | Four nights will quickly dream away the time; | And
I.1/578.2 Swift as a shadow, short as any dream; | Brief as the
II.2/585.1 | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1 this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1 as the fierce vexation of a dream.| But first I will
IV.1/594.2 to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2 rare | vision. I have had a dream, past the wit of man to
IV.1/594.2 the wit of man to | say what dream it was: man is but an
IV.1/594.2 he go | about to expound this dream. Methought I was--there
IV.1/594.2 his heart to report, what my dream | was. I will get Peter
IV.1/594.2 to write a ballad of | this dream: it shall be called
IV.1/594.2 it shall be called Bottom's dream, | because it hath no
V.1/599.1 | Following darkness like a dream, | Now are frolic: not a
V.1/599.2 theme, | No more yielding but a dream, | Gentles, do not
To return to text-analysis tools, the first generation
of widely available tools were batch tools that
were not interactive, but were designed to produce
paper concordances. This can be seen in the names
and operations of many of these tools. COCOA stands
for COunt and COncordance generation on the Atlas.
The Oxford University Computing Service took over
COCOA in 1978 and produced OCP (the Oxford Concordance
Program.)
With the availability and increasing power of
micro-computers in the 1980s text-analysis tools
migrated from mainframes to personal computers.
OCP developed into Micro-OCP and new programs
came out for the personal computer like the Brigham
Young Concordance program (BYC) later renamed
and commercialized under the name WordCruncher?
and the TACT (Text-Analysis and Concordance Tools)
environment developed at the University of Toronto.
This shift to the microcomputer changed the nature
of our use of the tools in two ways. The scholar
could now use tools whenever they wanted on a
personal computer instead of having to wait for
mainframe time or time on a terminal. This meant
that textual scholars were no longer dependent
on the paper concordance, but could use the electronic
tools in his or her place of study. This change
in the location of computer-assisted text-analysis
along with developments in interface technology
led developers away from a batch concording model
towards interactive models that assumed that the
scholar would have access to the tools and a collection
of e-texts for personal study. It is the access
to research-quality electronic texts that we need to ensure.
One of the best known examples of this shift
from batch to interactive concording is TACT which
was developed at the University of Toronto and
is still in wide use today. TACT is not meant
for producing a printed concordance but for exploring
the electronic text interactively through queries
and windowed displays. TACT did not just automate
the job of the concordancer, but changed the perspective
of the user of the concordance. It offers advanced
features suitable to careful text analysis not
found in text retrieval systems. The WWW accessible
version of TACT called TACTweb has further made
it possible for a researcher to share his research
textual data over the Internet.
As interactive concording tools became accessible
researchers began to ask more complex questions
of text databases. Rather than simply asking for
a list of locations of a word in the text, researchers
began to ask for patterns of words, parts of words,
linguistic features and punctuation. Researchers
began to add statistical tools that could count
and compare features and now we have visualization
tools that display graphs that show usage over
large quantities of texts. Many of these techniques
are being used for business document management
systems and basic Internet tools like the search
engines we depend on to find WWW pages.
Why is it important to preserve electronic
texts for research?
The primary means by which scholarship in the
humanities and certain social sciences is transmitted,
studied and stored for future use is in the form
of written works like books, journals and manuscripts.
Philosophers, historians, literary critics, art
historians, political scientists and others in
the Social Sciences and Humanities use primary
sources that are texts and produce new research
in the form of written works. An increasing number
of these texts are generated on a computer and
are therefore originally in electronic form. Further,
a significant number of Canadian scholars have
created text research resources that can only
be studied in electronic form with the appropriate
tools. There is now a critical mass of research
resources available as electronic texts, and in
some cases, only as electronic texts. It is safe
to say that a significant number of researchers
now need access to well maintained electronic
text services in order to conduct research and
the majority will in the near future as computing
methods and text services spread through the disciplines.
Here are some specific reasons for preserving
electronic text:
- It is already the case that certain bibliographic
databases (those that include a substantial
amount of text information) are being used primarily
in electronic form. Few Scholars use the MLA
Index or the Philosophers Index on paper. Bibliographic
research tools such as Iter, which specializes
in references having to do with the Middle Ages
and Renaissance, are being developed in Canada.
Such rich bibliographic resources need to be
preserved to provide future scholars access
to research literature in their fields.
- Major dictionary projects based in Canada
like the Dictionary of Old English are producing
electronic text databases of language usage
for researchers. The DOE has a full-text database
of all the significant works written in Old
English. There is no organization at present
in Canada that can provide access to, let alone
preserve over the long term, this important research resource. The project
had to look abroad for electronic text delivery
support. Further, the dictionary itself, while
it will be published on paper, will also exist
in electronic form with additional features
that will not be available unless preserved
in electronic form. The Comparative Lexicography
Project of French and English in Canada is likewise
creating a database of Canadian texts in English
and French and a bilingual Canadian dictionary.
The research text resources of projects like
these are crucial to preserve for the ongoing
study of our languages. We will not know why
we speak and write as we do without such systematic
resources.
- As mentioned above, one of the better known
text-analysis tools, TACT, was developed at the
University of Toronto. Text and document tools
are not only created at Canadian universities;
an important commercial document management
environment, Livelink by OpenText?, benefited
from research into text retrieval at the University
of Waterloo. Another company,SoftQuad produces
X-metal, one of the best SGML/XML editors. Canadian
industry and researchers have been at the forefront
of the development of tools for the study and
retrieval of electronic texts whether for business
or academic use. Both development communities
benefit from the preservation and dissemination
of a variety of electronic texts. For researchers
and developers alike, these tools are only as
useful as the e-texts available to study with
them. An appropriate text service would provide
the raw materials for software projects that
have had a demonstrated benefit to the Canadian
economy.
- Canadian researchers are involved in the
creation of research quality electronic editions
of important works of literature. The Internet
Shakespeare Editions, for example, is a project
out of the University of Victoria that is creating
electronic versions of Shakespeare's works following
best practices in the field. The Trésor
de la langue française au Québec
(TLFQ) at the Université Laval has assembled
a corpus of electronic editions of literary
texts of importance to the study of French in
Québec. These research electronic texts
are important not just to Canadian researchers,
but also to researchers around the world. TLFQ
is typical of the high level of electronic scholarship
of Canadian scholars that we need to preserve
and make available outside of Canada in order
to increase international understanding of our
cultures.
- Canadian researchers are also developing
new forms of electronic research texts that
can only be accessed on a computer. The Lyrical
Ballads Bicentenary Project is making this work
by Wordsworth and Coleridge available in an
electronic form that allows for the comparison
of versions. This resource can only exist in
electronic form. The Performance in Victorian
Hamilton project is creating a text database
of all records of musical and theatrical performances
in Hamilton in order to study entertainment
before Cinema. Resources such as these have
no analogue on paper and represent new forms
of research that can only be preserved in electronic
form.
- There is a growing number of courses and
programmes in the Social Studies and the Humanities
that make significant use of electronic texts
to train undergraduate and graduate students
in the application of computing to research.
McMaster University has started an undergraduate
Multimedia programme that includes courses on
electronic texts and their study. The University
of Alberta has announced a M.A. in Humanities
Computing that includes core courses in electronic
texts. We need access to a well maintained Canadian
electronic text service to train future students
in electronic literacy. The ability to manage,
study, and retrieve textual information in electronic
form is an important literacy skill in a world
where an increasing amount of information is
only available electronically.
- Finally, Canadian researchers need a service
where they can deposit electronic texts comparable
to the national efforts of countries like the
United Kingdom which has set up an Arts and Humanities
Data Service. What is at stake is the preservation
of our textual heritage. The electronic texts
of today will provide rich resources for future
study including works of interest to those
who want to study our age and this transition
from analogue to digital scholarship.
Appropriate investment in the preservation of
electronic texts created by or for researchers
will not only keep valuable research resources
accessible. It will also provide a model for other
domains that have significant investments in electronic
texts from the health sciences to the insurance
sector. We need to begin to solve the problem
of how to archive our exploding electronic record
before it overwhelms us. The pursuit of efficient
models for the research community will benefit
Canada.
Selected Canadian Links:
Dictionary Projects:
Dictionary of Old English: http://www.doe.utoronto.ca/
Waterloo Centre for the Study of the New OED:
http://db.uwaterloo.ca/OED/
Dictionnaire de l'Académie française:
http://www.chass.utoronto.ca/~wulfric/academie/
Termium: http://www.translationbureau.gc.ca/pwgsc_internet/english/03_tools/03_termium.htm
Comparative Lexicography of French and English
in Canada: http://balzac.sti.uottawa.ca/
Electronic Editions:
Internet Shakespeare Editions: http://web.uvic.ca/shakespeare/
TLFQ: http://www.ciral.ulaval.ca/tlfq/
Canadian Poetry Database: http://www.lib.unb.ca/Texts/projects.html
Representative Poetry On-line: http://www.library.utoronto.ca/utel/rp/intro.html
Laboratoire de Français Ancien: http://www.uottawa.ca/academic/arts/lfa/
Web Joyce - Finnegans Web: http://www.trentu.ca/jjoyce/fw.htm
Lyrical Ballads Project: http://www.dal.ca/~etc/lballads/index_std.html
Complete Poems and Letters of E.J. Pratt: http://www.trentu.ca/pratt/
Canadian Poetry: http://www.library.utoronto.ca/canpoetry/
Early Canadiana Online: http://www.canadiana.org/
Other Projects:
The Orlando Project: http://www.ualberta.ca/ORLANDO/
Iter: http://iter.utoronto.ca/iter/index.htm
Performance in Victorian Hamilton: http://cheiron.humanities.mcmaster.ca/~hamperf
University Programmes:
University of Alberta M.A. in Humanities Computing:
http://www.arts.ualberta.ca/huco/
McMaster University Multimedia Programme: http://www.humanities.mcmaster.ca/~macmedia
Text Tools:
Open Text Corporation: http://www.opentext.com/
SoftQuad?: http://www.softquad.com/
TACT: http://www.chass.utoronto.ca/cch/tact.html
TACTweb: http://tactweb.humanities.mcmaster.ca
SATO 4.0: http://www.ling.uqam.ca/sato/outils/sato.htm
Hyperpo: http://tapor.ualberta.ca/HyperPo/
Text Projects Elsewhere:
Arts and Humanities Data Service: http://www.ahds.ac.uk/
Oxford Text Archives: http://ota.ahds.ac.uk/
University of Virginia Electronic Text Centre:
http://etext.lib.virginia.edu/
University of Virginia Institute for Advanced
Technology in the Humanities: http://www.iath.virginia.edu/
Project Gutenberg: http://promo.net/pg/
Text Encoding Initiative: http://www.uic.edu/orgs/tei/
Humbul Humanities Hub: http://www.humbul.ac.uk/
|