Main.WhatTAPoR (r1.1 vs. r1.5)
Diffs

 <<O>>  Difference Topic WhatTAPoR (r1.5 - 29 Apr 2008 - GeoffreyRockwell)

META TOPICPARENT WebHome

What is the TAPoR portal?

Line: 12 to 12

Tool Developers

Changed:
<
<
The TAPoR Portal allows tool developers to make their tools available to a wide audience as a web service.
>
>
The TAPoR Portal allows tool developers to make their tools available to a wide audience as a web service. We have draft documentation for developers here. Please help us!

Links


 <<O>>  Difference Topic WhatTAPoR (r1.4 - 08 Dec 2005 - GeoffreyRockwell)

META TOPICPARENT WebHome

What is the TAPoR portal?

Line: 16 to 16

Links

Changed:
<
<
  • Current development version from OpenSky? Solutions is at dev.openskysolutions.ca. This is only for project developers.
  • Current stable release version is at test-tapor.mcmaster.ca To get an account for this write Lian Yan - lyan (at) mcmaster (dot) ca.
>
>

  • The development wiki (as opposed to this one, which is for documentation) is at tapor.openskysolutions.ca - Note, you will need an account for this.

 <<O>>  Difference Topic WhatTAPoR (r1.3 - 25 May 2005 - GeoffreyRockwell)

META TOPICPARENT WebHome

What is the TAPoR portal?

Line: 16 to 16

Links

Changed:
<
<
  • Current development version from OpenSky? Solutions is at dev.openskysolutions.ca This is only for project developers.
>
>
  • Current development version from OpenSky? Solutions is at dev.openskysolutions.ca. This is only for project developers.

  • Current stable release version is at test-tapor.mcmaster.ca To get an account for this write Lian Yan - lyan (at) mcmaster (dot) ca.
Changed:
<
<
  • The development wiki (as opposed to this one, which is for documentation) is at [[http://tapor.openskysolutions.ca][tapor.openskysolutions.ca] - Note, you will need an account for this.
>
>
  • The development wiki (as opposed to this one, which is for documentation) is at tapor.openskysolutions.ca - Note, you will need an account for this.

-- GeoffreyRockwell - 03 May 2005


 <<O>>  Difference Topic WhatTAPoR (r1.2 - 03 May 2005 - GeoffreyRockwell)

META TOPICPARENT WebHome
Changed:
<
<

What is Text Analysis?

>
>

What is the TAPoR portal?


Changed:
<
<
Based on Electronic Texts and Text Analysis by Geoffrey Rockwell and Ian Lancashire
>
>
The TAPoR Portal is an online environment where users can:

Changed:
<
<

The written word is one of the most important ways we communicate and preserve information. Whether it is legal records, novels, historical records, medical case studies, or now WWW pages, written text is in an important form of data to preserve. It is one of the primary means by which we communicate in industry, academia or for pleasure and, as an increasing amount of the texts that we care about are created in electronic form and accessed in electronic form, Canada needs a well thought out strategy for preserving those electronic texts of use to future generations. The future understanding of our past and understanding of this age of technological change will be incomplete if we do not take steps to preserve one of the most widely used forms of electronic information - the electronic text.

What is an Electronic Text?
Electronic texts digitally represent oral or written language in a form suitable for analysis with a computer. Typically an electronic text is either a electronic version of a written work, an electronic version of a transcript of an oral event, or a document composed on the computer. In any case the information in an electronic text is meant to be in a natural language that can be read by humans when displayed properly. Some examples of electronic texts would be:

An e-mail message
A medical case study that is stored on a computer

WWW pages that contain a significant amount of text so that to be understood they have to be read
A hypertext manual
An electronic edition of a play with markup and links to images of the original manuscript
A corpus of texts collected for linguistic or lexicographical study like a collection of exemplary texts used in the creation of a dictionary
A business document like a report, proposal, or contract
An interactive CD-ROM dictionary or encyclopedia

An interactive text adventure game where you read passages and make decisions
A collection of legal documents accessible through a retrieval system
A transcript of a series of interviews with embedded interpretative information
A transcript of a court case or administrative proceeding

Electronic texts come in four major forms:

    1. A copy of a work that was originally on paper - a digital representation of a literary, dramatic, or other type of written work that was originally in analogue form.
    2. A work composed on the computer that is stored in that form, but was intended to be printed like a word-processing file or PDF (Portable Document Format) file.
    3. A work composed on a computer that is meant to be accessed on a computer like a WWW page, electronic text database, or hypertext

    4. A transcript of a conversation or other oral event

What can we do with electronic texts?
We can use computers to present, manage, and learn from electronic texts in ways difficult to do by hand. We can archive large quantities of text and make reliable copies of these archives. We can quickly retrieve passages from a large text database of millions of pages. We can ask where two or more words occur within the same paragraph. We can link automatically to other information from a hypertext. We can quantify writing style or try to identify the author of a disputed work by his or her style. We can compare written works or study the evolution of language usage over a collection of texts. In general the process of computer assisted text-analysis uses computers to search, retrieve, manipulate, measure and classify natural-language documents for patterns and by author, subject, and genre or type. Here is partial list of some of the activities researchers do with electronic texts.

  • Editors and translators add interpretative information to electronic versions of historically important texts to create rich electronic editions for use by other scholars, students or interested readers. Such electronic editions can include modern spellings, commentary, variant translations, references, multimedia supplements and images of the original manuscript all available at a click of a button.

  • Researchers across the humanities and social sciences use electronic text collections to passages where issues are discussed and to retrieve documents to their questions.

  • Social scientists use text analysis to study interviews, responses to questionnaires, collections of policy documents, or letters. By qualitative analysis they characterize or model the topics, opinions, or psychological traits exhibited in the texts.


  • Linguists add information to texts about language features so that they can study language use. Using these corpora (collections of texts) they write dictionaries, grammars, studies of language change over time, and analyses of language use in different communities.

  • Researchers who are interested in the meaning of words analyze them by their company, that is, by the terms that co-occur or collate with them, and use statistical techniques. Their research benefits researchers developing automatic translation tools for global commerce.

  • Sociologists, educationalists, and psychologists sometimes analyze one aspect of human behavior, communicative tasks, by studying the utterances of individual speakers for traits such as sentence length, rate of repetition, pauses, questions, negatives, turns, etc.

  • Experts are called upon in court to use combinations of these techniques to establish the authorship of disputed texts. Forensic linguistics is a growing field as an increasing number of the documents that we exchange are electronic so that traditional ways of establishing the author will not work. (There is no fingerprint on an e-mail message only patterns of language use.) In general, a quantitative or qualitative profile of the disputed text is compared to profiles of texts known to have been written by candidate authors.

  • Documenters and usability analysts employ such techniques to improve client manuals and business technical reports, and to help customers to summarize documents.


  • Language instruction researchers use text tools to study language learning problems and develop collections of electronic texts with which to teach languages. They work linguists to develop text collections with which to train translation systems.

  • Researchers from all areas publish in electronic journals creating more electronic texts for others to study and access.

A brief history of electronic texts and text-analysis tools.
A good way to understand text analysis is to look at the tradition of concordancing from which it evolved. A concordance is a standard study tool where one can look up a word and find references to all the passages in the target work where that word occurs. They are alphabetically-sorted lists of the vocabulary of a text (its different words or phrases). Occurrences of each word (the keyword) appear under a headword, each one surrounded by enough context to make out the meaning, and each one identified by a citation to the text that gives its location in the original.

The first text-analysis tools were designed to create paper concordances. Father Roberto Busa in the late 1940s was one of the first to use of computers in the production of concordances with his Index Thomisticus, a project that began by using index cards, moved onto analogue information technology in the 50s and migrated to the computer. The results were finally published in the 1970s and a CD was released in 1992.

The concordance, however, goes back to the 13th century. Hugh of St. Cher is credited with directing the production of a concordance to the Vulgate bible by brother Dominican monks in Paris. This concordance, supposedly finished by 1247, suffered in that it only had references and not quotations to give a sense of context. Quotations were apparently added later by English Dominicans to a concordance that has not survived. Finally, a concordance attributed to Conrad of Halberstadt improved on the model, leaving us by the end of the 13th century with a concordance that provided some context along with references.

I.1/577.1       | Four nights will quickly dream away the time; | And
I.1/578.2  Swift as a shadow, short as any dream; | Brief as the
II.2/585.1       | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1   this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1     as the fierce vexation of a dream.| But first I will
IV.1/594.2   to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2     rare | vision. I have had a dream, past the wit of man to
IV.1/594.2    the wit of man to | say what dream it was: man is but an
IV.1/594.2   he go | about to expound this dream. Methought I was--there
IV.1/594.2    his heart to report, what my dream  | was. I will get Peter
IV.1/594.2     to write a ballad of | this dream: it shall be called
IV.1/594.2     it shall be called Bottom's dream, | because it hath no
V.1/599.1      | Following darkness like a dream, | Now are frolic: not a
V.1/599.2  theme, | No more yielding but a dream, | Gentles, do not

Example of a Key Word In Context display from an interactive concordance of Shakespeare's A Midsummer Night's Dream

To return to text-analysis tools, the first generation of widely available tools were batch tools that were not interactive, but were designed to produce paper concordances. This can be seen in the names and operations of many of these tools. COCOA stands for COunt and COncordance generation on the Atlas. The Oxford University Computing Service took over COCOA in 1978 and produced OCP (the Oxford Concordance Program.)

With the availability and increasing power of micro-computers in the 1980s text-analysis tools migrated from mainframes to personal computers. OCP developed into Micro-OCP and new programs came out for the personal computer like the Brigham Young Concordance program (BYC) later renamed and commercialized under the name WordCruncher? and the TACT (Text-Analysis and Concordance Tools) environment developed at the University of Toronto. This shift to the microcomputer changed the nature of our use of the tools in two ways. The scholar could now use tools whenever they wanted on a personal computer instead of having to wait for mainframe time or time on a terminal. This meant that textual scholars were no longer dependent on the paper concordance, but could use the electronic tools in his or her place of study. This change in the location of computer-assisted text-analysis along with developments in interface technology led developers away from a batch concording model towards interactive models that assumed that the scholar would have access to the tools and a collection of e-texts for personal study. It is the access to research electronic texts that we need to ensure through the preservation of research text data.

One of the best known examples of this shift from batch to interactive concording is TACT which was developed at the University of Toronto and is still in wide use today. TACT is not meant for producing a printed concordance but for exploring the electronic text interactively through queries and windowed displays. TACT did not just automate the job of the concordancer, but changed the perspective of the user of the concordance. It offers advanced features suitable to careful text analysis not found in text retrieval systems. The WWW accessible version of TACT called TACTweb has further made it possible for a researcher to share his research textual data over the Internet.

As interactive concording tools became accessible researchers began to ask more complex questions of text databases. Rather than simply asking for a list of locations of a word in the text, researchers began to ask for patterns of words, parts of words, linguistic features and punctuation. Researchers began to add statistical tools that could count and compare features and now we have visualization tools that display graphs that show usage over large quantities of texts. Many of these techniques are being used for business document management systems and basic Internet tools like the search engines we depend on to find WWW pages.

Why is it important to preserve electronic texts for research?

The primary means by which scholarship in the humanities and certain social sciences is transmitted, studied and stored for future use is in the form of written works like books, journals and manuscripts. Philosophers, historians, literary critics, art historians, political scientists and others in the Social Sciences and Humanities use primary sources that are texts and produce new research in the form of written works. An increasing number of these texts are generated on a computer and are therefore originally in electronic form. Further, a significant number of Canadian scholars have created text research resources that can only be studied in electronic form with the appropriate tools. There is now a critical mass of research resources available as electronic texts, and in some cases, only as electronic texts. It is safe to say that a significant number of researchers now need access to well maintained electronic text services in order to conduct research and the majority will in the near future as computing methods and text services spread through the disciplines.

Here are some specific reasons for preserving electronic text:

  1. It is already the case that certain bibliographic databases (those that include a substantial amount of text information) are being used primarily in electronic form. Few Scholars use the MLA Index or the Philosophers Index on paper. Bibliographic research tools such as Iter, which specializes in references having to do with the Middle Ages and Renaissance, are being developed in Canada. Such rich bibliographic resources need to be preserved to provide future scholars access to research literature in their fields.
  2. Major dictionary projects based in Canada like the Dictionary of Old English are producing electronic text databases of language usage for researchers. The DOE has a full-text database of all the significant works written in Old English. There is no organization at present in Canada that can provide access, let alone preserve, this important resource. The project had to look abroad for electronic text delivery support. Further, the dictionary itself, while it will be published on paper, will also exist in electronic form with additional features that will not be available unless preserved in electronic form. The Comparative Lexicography Project of French and English in Canada is likewise creating a database of Canadian texts in English and French and a bilingual Canadian dictionary. The research text resources of projects like these are crucial to preserve for the ongoing study of our languages. We will not know why we speak and write as we do without such systematic resources.

  3. As mentioned above, one of the better known text-analysis tools TACT was developed at the University of Toronto. Text and document tools are not only be created at Canadian universities; an important commercial document management environment, Livelink by OpenText?, benefited from research into text retrieval at the University of Waterloo. Another company,SoftQuad produces X-metal, one of the best SGML/XML editors. Canadian industry and researchers have been at the forefront of the development of tools for the study and retrieval of electronic texts whether for business or academic use. Both development communities benefit from the preservation and dissemination of a variety of electronic texts. For researchers and developers alike, these tools are only as useful as the e-texts available to study with them. An appropriate text service would provide the raw materials for software projects that have had a demonstrated benefit to the Canadian economy.
  4. Canadian researchers are involved in the creation of research quality electronic editions of important works of literature. The Internet Shakespeare Editions, for example, is a project out of the University of Victoria that is creating electronic versions of Shakespeare's works following best practices in the field. The Trésor de la langue française au Québec (TLFQ) at the Université Laval has assembled a corpus of electronic editions of literary texts of importance to the study of French in Québec. These research electronic texts are important not just to Canadian researchers, but also to researchers around the world. TLFQ is typical of the high level of electronic scholarship of Canadian scholars that we need to preserve and make available outside of Canada in order to increase international understanding of our cultures.

  5. Canadian researchers are also developing new forms of electronic research texts that can only be accessed on a computer. The Lyrical Ballads Bicentenary Project is making this work by Wordsworth and Coleridge available in an electronic form that allows for the comparison of versions. This resource can only exist in electronic form. The Performance in Victorian Hamilton project is creating a text database of all records of musical and theatrical performances in Hamilton in order to study entertainment before Cinema. Resources such as these have no analogue on paper and represent new forms of research that can only be preserved in electronic form.
  6. There is a growing number of courses and programmes in the Social Studies and the Humanities that make significant use of electronic texts to train undergraduate and graduate students in the application of computing to research. McMaster University has started an undergraduate Multimedia programme that includes courses on electronic texts and their study. The University of Alberta has announced a M.A. in Humanities Computing that includes core courses in electronic texts. We need access to a well maintained Canadian electronic text service to train future students in electronic literacy. The ability to manage, study, and retrieve textual information in electronic form is an important literacy skill in a world where an increasing amount of information is only available electronically.
  7. Finally, Canadian researchers need a service where they can deposit electronic texts comparable to the national efforts of countries like the United Kingdom which has set up a Arts and Humanities Data Service. What is at stake is the preservation of our textual heritage. The electronic texts of today will provide rich resources for future study including a works of interests to those who want to study our age and this transition from analogue to digital scholarship.

Appropriate investment in the preservation of electronic texts created by or for researchers will not only keep valuable research resources accessible. It will also provide a model for other domains that have significant investments in electronic texts from the health sciences to the insurance sector. We need to begin to solve the problem of how to archive our exploding electronic record before it overwhelms us. The pursuit of efficient models for the research community will benefit Canada.

Selected Canadian Links:

Dictionary Projects:
Dictionary of Old English: http://www.doe.utoronto.ca/
Waterloo Centre for the Study of the New OED: http://db.uwaterloo.ca/OED/

Dictionnaire de l'Académie française: http://www.chass.utoronto.ca/~wulfric/academie/
Termium: http://www.translationbureau.gc.ca/pwgsc_internet/english/03_tools/03_termium.htm
Comparative Lexicography of French and English in Canada: http://balzac.sti.uottawa.ca/

Electronic Editions:

Internet Shakespeare Editions: http://web.uvic.ca/shakespeare/
TLFQ: http://www.ciral.ulaval.ca/tlfq/
Canadian Poetry Database: http://www.lib.unb.ca/Texts/projects.html
Representative Poetry On-line: http://www.library.utoronto.ca/utel/rp/intro.html
Laboratoire de Français Ancien: http://www.uottawa.ca/academic/arts/lfa/

Web Joyce - Finnegans Web: http://www.trentu.ca/jjoyce/fw.htm
Lyrical Ballads Project: http://www.dal.ca/~etc/lballads/index_std.html
Complete Poems and Letters of E.J. Pratt: http://www.trentu.ca/pratt/
Canadian Poetry: http://www.library.utoronto.ca/canpoetry/
Early Canadiana Online: http://www.canadiana.org/

Other Projects:
The Orlando Project: http://www.ualberta.ca/ORLANDO/
Iter: http://iter.utoronto.ca/iter/index.htm
Performance in Victorian Hamilton: http://cheiron.humanities.mcmaster.ca/~hamperf

University Programmes:

University of Alberta M.A. in Humanities Computing: http://www.arts.ualberta.ca/huco/
McMaster University Multimedia Programme: http://www.humanities.mcmaster.ca/~macmedia

Text Tools:
Open Text Corporation: http://www.opentext.com/
SoftQuad?: http://www.softquad.com/

TACT: http://www.chass.utoronto.ca/cch/tact.html
TACTweb: http://tactweb.humanities.mcmaster.ca
SATO 4.0: http://www.ling.uqam.ca/sato/outils/sato.htm
Hyperpo: http://tapor.ualberta.ca/HyperPo/

Text Projects Elsewhere:

Arts and Humanities Data Service: http://www.ahds.ac.uk/
Oxford Text Archives: http://ota.ahds.ac.uk/
University of Virginia Electronic Text Centre: http://etext.lib.virginia.edu/
University of Virginia Institute for Advanced Technology in the Humanities: http://www.iath.virginia.edu/
Project Gutenberg: http://promo.net/pg/

Text Encoding Initiative: http://www.uic.edu/orgs/tei/
Humbol Humanities Hub: http://www.humbul.ac.uk/

 

>
>
  • Keep track of texts they want to study. These texts can be on the web or texts you upload.
  • Learn about different tools and try them.
  • Run tools on texts to analyze them in different ways.

Changed:
<
<
>
>
The Portal also has a news service and other features to support your research.

Changed:
<
<
-- GeoffreyRockwell - 30 Apr 2005
>
>

Tool Developers

The TAPoR Portal allows tool developers to make their tools available to a wide audience as a web service.

Links

  • Current development version from OpenSky? Solutions is at dev.openskysolutions.ca This is only for project developers.
  • Current stable release version is at test-tapor.mcmaster.ca To get an account for this write Lian Yan - lyan (at) mcmaster (dot) ca.

  • The development wiki (as opposed to this one, which is for documentation) is at [[http://tapor.openskysolutions.ca][tapor.openskysolutions.ca] - Note, you will need an account for this.

-- GeoffreyRockwell - 03 May 2005



 <<O>>  Difference Topic WhatTAPoR (r1.1 - 30 Apr 2005 - GeoffreyRockwell)
Line: 1 to 1
Added:
>
>
META TOPICPARENT WebHome

What is Text Analysis?

Based on Electronic Texts and Text Analysis by Geoffrey Rockwell and Ian Lancashire

The written word is one of the most important ways we communicate and preserve information. Whether it is legal records, novels, historical records, medical case studies, or now WWW pages, written text is in an important form of data to preserve. It is one of the primary means by which we communicate in industry, academia or for pleasure and, as an increasing amount of the texts that we care about are created in electronic form and accessed in electronic form, Canada needs a well thought out strategy for preserving those electronic texts of use to future generations. The future understanding of our past and understanding of this age of technological change will be incomplete if we do not take steps to preserve one of the most widely used forms of electronic information - the electronic text.

What is an Electronic Text?
Electronic texts digitally represent oral or written language in a form suitable for analysis with a computer. Typically an electronic text is either a electronic version of a written work, an electronic version of a transcript of an oral event, or a document composed on the computer. In any case the information in an electronic text is meant to be in a natural language that can be read by humans when displayed properly. Some examples of electronic texts would be:

An e-mail message
A medical case study that is stored on a computer

WWW pages that contain a significant amount of text so that to be understood they have to be read
A hypertext manual
An electronic edition of a play with markup and links to images of the original manuscript
A corpus of texts collected for linguistic or lexicographical study like a collection of exemplary texts used in the creation of a dictionary
A business document like a report, proposal, or contract
An interactive CD-ROM dictionary or encyclopedia

An interactive text adventure game where you read passages and make decisions
A collection of legal documents accessible through a retrieval system
A transcript of a series of interviews with embedded interpretative information
A transcript of a court case or administrative proceeding

Electronic texts come in four major forms:

    1. A copy of a work that was originally on paper - a digital representation of a literary, dramatic, or other type of written work that was originally in analogue form.
    2. A work composed on the computer that is stored in that form, but was intended to be printed like a word-processing file or PDF (Portable Document Format) file.
    3. A work composed on a computer that is meant to be accessed on a computer like a WWW page, electronic text database, or hypertext

    4. A transcript of a conversation or other oral event

What can we do with electronic texts?
We can use computers to present, manage, and learn from electronic texts in ways difficult to do by hand. We can archive large quantities of text and make reliable copies of these archives. We can quickly retrieve passages from a large text database of millions of pages. We can ask where two or more words occur within the same paragraph. We can link automatically to other information from a hypertext. We can quantify writing style or try to identify the author of a disputed work by his or her style. We can compare written works or study the evolution of language usage over a collection of texts. In general the process of computer assisted text-analysis uses computers to search, retrieve, manipulate, measure and classify natural-language documents for patterns and by author, subject, and genre or type. Here is partial list of some of the activities researchers do with electronic texts.

  • Editors and translators add interpretative information to electronic versions of historically important texts to create rich electronic editions for use by other scholars, students or interested readers. Such electronic editions can include modern spellings, commentary, variant translations, references, multimedia supplements and images of the original manuscript all available at a click of a button.

  • Researchers across the humanities and social sciences use electronic text collections to passages where issues are discussed and to retrieve documents to their questions.

  • Social scientists use text analysis to study interviews, responses to questionnaires, collections of policy documents, or letters. By qualitative analysis they characterize or model the topics, opinions, or psychological traits exhibited in the texts.


  • Linguists add information to texts about language features so that they can study language use. Using these corpora (collections of texts) they write dictionaries, grammars, studies of language change over time, and analyses of language use in different communities.

  • Researchers who are interested in the meaning of words analyze them by their company, that is, by the terms that co-occur or collate with them, and use statistical techniques. Their research benefits researchers developing automatic translation tools for global commerce.

  • Sociologists, educationalists, and psychologists sometimes analyze one aspect of human behavior, communicative tasks, by studying the utterances of individual speakers for traits such as sentence length, rate of repetition, pauses, questions, negatives, turns, etc.

  • Experts are called upon in court to use combinations of these techniques to establish the authorship of disputed texts. Forensic linguistics is a growing field as an increasing number of the documents that we exchange are electronic so that traditional ways of establishing the author will not work. (There is no fingerprint on an e-mail message only patterns of language use.) In general, a quantitative or qualitative profile of the disputed text is compared to profiles of texts known to have been written by candidate authors.

  • Documenters and usability analysts employ such techniques to improve client manuals and business technical reports, and to help customers to summarize documents.


  • Language instruction researchers use text tools to study language learning problems and develop collections of electronic texts with which to teach languages. They work linguists to develop text collections with which to train translation systems.

  • Researchers from all areas publish in electronic journals creating more electronic texts for others to study and access.

A brief history of electronic texts and text-analysis tools.
A good way to understand text analysis is to look at the tradition of concordancing from which it evolved. A concordance is a standard study tool where one can look up a word and find references to all the passages in the target work where that word occurs. They are alphabetically-sorted lists of the vocabulary of a text (its different words or phrases). Occurrences of each word (the keyword) appear under a headword, each one surrounded by enough context to make out the meaning, and each one identified by a citation to the text that gives its location in the original.

The first text-analysis tools were designed to create paper concordances. Father Roberto Busa in the late 1940s was one of the first to use of computers in the production of concordances with his Index Thomisticus, a project that began by using index cards, moved onto analogue information technology in the 50s and migrated to the computer. The results were finally published in the 1970s and a CD was released in 1992.

The concordance, however, goes back to the 13th century. Hugh of St. Cher is credited with directing the production of a concordance to the Vulgate bible by brother Dominican monks in Paris. This concordance, supposedly finished by 1247, suffered in that it only had references and not quotations to give a sense of context. Quotations were apparently added later by English Dominicans to a concordance that has not survived. Finally, a concordance attributed to Conrad of Halberstadt improved on the model, leaving us by the end of the 13th century with a concordance that provided some context along with references.

I.1/577.1       | Four nights will quickly dream away the time; | And
I.1/578.2  Swift as a shadow, short as any dream; | Brief as the
II.2/585.1       | Ay me, for pity! what a dream was here! | Lysander,
III.2/591.1   this derision | Shall seem a dream and fruitless vision, |
IV.1/593.1     as the fierce vexation of a dream.| But first I will
IV.1/594.2   to me | That yet we sleep, we dream. Do not you think | The
IV.1/594.2     rare | vision. I have had a dream, past the wit of man to
IV.1/594.2    the wit of man to | say what dream it was: man is but an
IV.1/594.2   he go | about to expound this dream. Methought I was--there
IV.1/594.2    his heart to report, what my dream  | was. I will get Peter
IV.1/594.2     to write a ballad of | this dream: it shall be called
IV.1/594.2     it shall be called Bottom's dream, | because it hath no
V.1/599.1      | Following darkness like a dream, | Now are frolic: not a
V.1/599.2  theme, | No more yielding but a dream, | Gentles, do not

Example of a Key Word In Context display from an interactive concordance of Shakespeare's A Midsummer Night's Dream

To return to text-analysis tools, the first generation of widely available tools were batch tools that were not interactive, but were designed to produce paper concordances. This can be seen in the names and operations of many of these tools. COCOA stands for COunt and COncordance generation on the Atlas. The Oxford University Computing Service took over COCOA in 1978 and produced OCP (the Oxford Concordance Program.)

With the availability and increasing power of micro-computers in the 1980s text-analysis tools migrated from mainframes to personal computers. OCP developed into Micro-OCP and new programs came out for the personal computer like the Brigham Young Concordance program (BYC) later renamed and commercialized under the name WordCruncher? and the TACT (Text-Analysis and Concordance Tools) environment developed at the University of Toronto. This shift to the microcomputer changed the nature of our use of the tools in two ways. The scholar could now use tools whenever they wanted on a personal computer instead of having to wait for mainframe time or time on a terminal. This meant that textual scholars were no longer dependent on the paper concordance, but could use the electronic tools in his or her place of study. This change in the location of computer-assisted text-analysis along with developments in interface technology led developers away from a batch concording model towards interactive models that assumed that the scholar would have access to the tools and a collection of e-texts for personal study. It is the access to research electronic texts that we need to ensure through the preservation of research text data.

One of the best known examples of this shift from batch to interactive concording is TACT which was developed at the University of Toronto and is still in wide use today. TACT is not meant for producing a printed concordance but for exploring the electronic text interactively through queries and windowed displays. TACT did not just automate the job of the concordancer, but changed the perspective of the user of the concordance. It offers advanced features suitable to careful text analysis not found in text retrieval systems. The WWW accessible version of TACT called TACTweb has further made it possible for a researcher to share his research textual data over the Internet.

As interactive concording tools became accessible researchers began to ask more complex questions of text databases. Rather than simply asking for a list of locations of a word in the text, researchers began to ask for patterns of words, parts of words, linguistic features and punctuation. Researchers began to add statistical tools that could count and compare features and now we have visualization tools that display graphs that show usage over large quantities of texts. Many of these techniques are being used for business document management systems and basic Internet tools like the search engines we depend on to find WWW pages.

Why is it important to preserve electronic texts for research?

The primary means by which scholarship in the humanities and certain social sciences is transmitted, studied and stored for future use is in the form of written works like books, journals and manuscripts. Philosophers, historians, literary critics, art historians, political scientists and others in the Social Sciences and Humanities use primary sources that are texts and produce new research in the form of written works. An increasing number of these texts are generated on a computer and are therefore originally in electronic form. Further, a significant number of Canadian scholars have created text research resources that can only be studied in electronic form with the appropriate tools. There is now a critical mass of research resources available as electronic texts, and in some cases, only as electronic texts. It is safe to say that a significant number of researchers now need access to well maintained electronic text services in order to conduct research and the majority will in the near future as computing methods and text services spread through the disciplines.

Here are some specific reasons for preserving electronic text:

  1. It is already the case that certain bibliographic databases (those that include a substantial amount of text information) are being used primarily in electronic form. Few Scholars use the MLA Index or the Philosophers Index on paper. Bibliographic research tools such as Iter, which specializes in references having to do with the Middle Ages and Renaissance, are being developed in Canada. Such rich bibliographic resources need to be preserved to provide future scholars access to research literature in their fields.
  2. Major dictionary projects based in Canada like the Dictionary of Old English are producing electronic text databases of language usage for researchers. The DOE has a full-text database of all the significant works written in Old English. There is no organization at present in Canada that can provide access, let alone preserve, this important resource. The project had to look abroad for electronic text delivery support. Further, the dictionary itself, while it will be published on paper, will also exist in electronic form with additional features that will not be available unless preserved in electronic form. The Comparative Lexicography Project of French and English in Canada is likewise creating a database of Canadian texts in English and French and a bilingual Canadian dictionary. The research text resources of projects like these are crucial to preserve for the ongoing study of our languages. We will not know why we speak and write as we do without such systematic resources.

  3. As mentioned above, one of the better known text-analysis tools TACT was developed at the University of Toronto. Text and document tools are not only be created at Canadian universities; an important commercial document management environment, Livelink by OpenText?, benefited from research into text retrieval at the University of Waterloo. Another company,SoftQuad produces X-metal, one of the best SGML/XML editors. Canadian industry and researchers have been at the forefront of the development of tools for the study and retrieval of electronic texts whether for business or academic use. Both development communities benefit from the preservation and dissemination of a variety of electronic texts. For researchers and developers alike, these tools are only as useful as the e-texts available to study with them. An appropriate text service would provide the raw materials for software projects that have had a demonstrated benefit to the Canadian economy.
  4. Canadian researchers are involved in the creation of research quality electronic editions of important works of literature. The Internet Shakespeare Editions, for example, is a project out of the University of Victoria that is creating electronic versions of Shakespeare's works following best practices in the field. The Trésor de la langue française au Québec (TLFQ) at the Université Laval has assembled a corpus of electronic editions of literary texts of importance to the study of French in Québec. These research electronic texts are important not just to Canadian researchers, but also to researchers around the world. TLFQ is typical of the high level of electronic scholarship of Canadian scholars that we need to preserve and make available outside of Canada in order to increase international understanding of our cultures.

  5. Canadian researchers are also developing new forms of electronic research texts that can only be accessed on a computer. The Lyrical Ballads Bicentenary Project is making this work by Wordsworth and Coleridge available in an electronic form that allows for the comparison of versions. This resource can only exist in electronic form. The Performance in Victorian Hamilton project is creating a text database of all records of musical and theatrical performances in Hamilton in order to study entertainment before Cinema. Resources such as these have no analogue on paper and represent new forms of research that can only be preserved in electronic form.
  6. There is a growing number of courses and programmes in the Social Studies and the Humanities that make significant use of electronic texts to train undergraduate and graduate students in the application of computing to research. McMaster University has started an undergraduate Multimedia programme that includes courses on electronic texts and their study. The University of Alberta has announced a M.A. in Humanities Computing that includes core courses in electronic texts. We need access to a well maintained Canadian electronic text service to train future students in electronic literacy. The ability to manage, study, and retrieve textual information in electronic form is an important literacy skill in a world where an increasing amount of information is only available electronically.
  7. Finally, Canadian researchers need a service where they can deposit electronic texts comparable to the national efforts of countries like the United Kingdom which has set up a Arts and Humanities Data Service. What is at stake is the preservation of our textual heritage. The electronic texts of today will provide rich resources for future study including a works of interests to those who want to study our age and this transition from analogue to digital scholarship.

Appropriate investment in the preservation of electronic texts created by or for researchers will not only keep valuable research resources accessible. It will also provide a model for other domains that have significant investments in electronic texts from the health sciences to the insurance sector. We need to begin to solve the problem of how to archive our exploding electronic record before it overwhelms us. The pursuit of efficient models for the research community will benefit Canada.

Selected Canadian Links:

Dictionary Projects:
Dictionary of Old English: http://www.doe.utoronto.ca/
Waterloo Centre for the Study of the New OED: http://db.uwaterloo.ca/OED/

Dictionnaire de l'Académie française: http://www.chass.utoronto.ca/~wulfric/academie/
Termium: http://www.translationbureau.gc.ca/pwgsc_internet/english/03_tools/03_termium.htm
Comparative Lexicography of French and English in Canada: http://balzac.sti.uottawa.ca/

Electronic Editions:

Internet Shakespeare Editions: http://web.uvic.ca/shakespeare/
TLFQ: http://www.ciral.ulaval.ca/tlfq/
Canadian Poetry Database: http://www.lib.unb.ca/Texts/projects.html
Representative Poetry On-line: http://www.library.utoronto.ca/utel/rp/intro.html
Laboratoire de Français Ancien: http://www.uottawa.ca/academic/arts/lfa/

Web Joyce - Finnegans Web: http://www.trentu.ca/jjoyce/fw.htm
Lyrical Ballads Project: http://www.dal.ca/~etc/lballads/index_std.html
Complete Poems and Letters of E.J. Pratt: http://www.trentu.ca/pratt/
Canadian Poetry: http://www.library.utoronto.ca/canpoetry/
Early Canadiana Online: http://www.canadiana.org/

Other Projects:
The Orlando Project: http://www.ualberta.ca/ORLANDO/
Iter: http://iter.utoronto.ca/iter/index.htm
Performance in Victorian Hamilton: http://cheiron.humanities.mcmaster.ca/~hamperf

University Programmes:

University of Alberta M.A. in Humanities Computing: http://www.arts.ualberta.ca/huco/
McMaster University Multimedia Programme: http://www.humanities.mcmaster.ca/~macmedia

Text Tools:
Open Text Corporation: http://www.opentext.com/
SoftQuad?: http://www.softquad.com/

TACT: http://www.chass.utoronto.ca/cch/tact.html
TACTweb: http://tactweb.humanities.mcmaster.ca
SATO 4.0: http://www.ling.uqam.ca/sato/outils/sato.htm
Hyperpo: http://tapor.ualberta.ca/HyperPo/

Text Projects Elsewhere:

Arts and Humanities Data Service: http://www.ahds.ac.uk/
Oxford Text Archives: http://ota.ahds.ac.uk/
University of Virginia Electronic Text Centre: http://etext.lib.virginia.edu/
University of Virginia Institute for Advanced Technology in the Humanities: http://www.iath.virginia.edu/
Project Gutenberg: http://promo.net/pg/

Text Encoding Initiative: http://www.uic.edu/orgs/tei/
Humbol Humanities Hub: http://www.humbul.ac.uk/

 

-- GeoffreyRockwell - 30 Apr 2005


Topic: WhatTAPoR . { View | Diffs | r1.5 | > | r1.4 | > | r1.3 | More }

Revision r1.1 - 30 Apr 2005 - 03:49 - GeoffreyRockwell
Revision r1.5 - 29 Apr 2008 - 19:51 - GeoffreyRockwell