Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Text Analysis Summit

Notes from the May 2005, Text Analysis Summit at McMaster University. For the web site see TAS Web Site. There is a group blog with better notes by the student note takers and with interventions by participants.

This summit set out to bring together people involved in developing text analysis tools, exploring interface issues, and imagining how collaboration could happen.

Day 1

What could one publish in text analysis

We had an interesting discussion about publications coming out of events like this. Ramsay argued that we no longer have the sorts sustained papers and responses that there used to be. We are too polite and don't argue in public. Acheson argued for posters. Do we need to publish in print?

What are the big questions?

We thrashed the question of the "killer app" in the sense of the killer tool or the killer application of computing to humanities. What is it that needs to be solved to move text analysis forward for others? Do we need to get our tools better? Do we need to build better tools? Do we actually want them to do it at all?

Who is the audience we are preparing tools for and how would they use them? These issues are tied to academic credibility and whether we can get credit in the humanities for working with tools.

*Here are some of the various questions that came up:

  • What would an accessible text analysis environment look like? What would it do and who would use it?
  • How do computing-assisted practices change reading and interpretation? How does access to networked computing change humanities research practices in general?
  • How are people, both researchers and the broader community, currently using computing to study and manipulate texts?
  • What are the text analysis problems emerging in disciplines, especially disciplines outside the humanities like medicine that have large textbases?
  • How do we intersect with disciplines that are looking at text processing outside the humanities like text mining, text indexing, text classification and information visualization?
  • What is text analysis? What is a tool? What questions can text analysis help answer?
  • What are the theoretical assumptions of computer assisted text analysis?
  • What is the history of text analysis?

John Bradley: Building Tools for Humanities Scholarship

John Bradley worked on COGS, TACT, and TACTweb. "When I finished TACT I looked around and asked 'who is going to use it?' and decided that probably no one would." He identified three types of text/tools:

  • Dynamic Text: Tools like TACT that turn an electronic text into an interactive text. He points out that these types of interactive texts are "not obviously in keeping with current trends in critical and textual theory..." (from R.G. Siemens).
  • Hypertextual Editions: Editions that have hypertextual annotations and commentary. There is a link between reading and analysis - the editor creates an edition that maps onto an editorial philosophy and theoretically informed interpretative approach. John quotes Lavagnino to the effect that often readers are defeated by, or uninterested in the editorial apparatus or commentary.
  • Electronic Scholarly Editions: Complex encoded electronic editions where the added richness lies in the encoding.

So what do scholars do? John quotes a study to the effect that scholars Read (and Reread), Network, Take Notes, and Write.

So what tools could they use? Annotation and Notetaking? Online Chopin Variourum Edition is an example of a preliminary exploration of annotation.

What tools could they use? We need to think about interoperation. Also, browser-based tools are limited - they can't track state, for example. Therefore we should be looking at applications. John talked about data-flow models of text analysis and the limitations of data-flow. He then talked about Eclipse as a possible framework for tools. Eclipse has a plug-in model, it has plug-ins from top to bottom. Plug-ins can extend plug-ins.

The problem with Eclipse is that it is for the techies. The solution might be a change in culture where the technical staff are no longer treated as someone you hire, but are co-investigators.

He concludes by asking us to think about frameworks more and about collaboration between programmers and scholars more.

Joanna Dacko and Audrey Carr: TAPoR - Interface

Joanna and Audrey talked about their research into usability for the TAPoR project. They questions they set out to answer were:

  • Why will they use it?
  • What features should the portal support?
  • How should the portal support?

From interviews and surveys they created personas of archetypical users. They showed some layout designs created to match different personas. They walked us through the current portal interface in terms of how it would support the three personas. You can see a research blog Audrey kept on personas here. Likewise Joanna has a research blog on portal skinning.

The full text of their report is posted at TAPoR Interface.

Matt Jockers

Matt talked about what are the success stories. He mentions two - Authorship Attribution and Textual Forensics. Burrows did great work, but no one understands it. Foster is understood, but did he do text analysis?

Matt quickly showed a database of Irish American texts that is at [[http://www.stanford.edu/group/shl/IAW/index.html][www.stanford.edu/group/shl/IAW/index.html]. Then he showed how he has created tools that can handle the data in XML format. I take it he exported the data from an SQL database to XNL. The tools let him create tables and graphs, among other things.

Matt compared what he does to macroeconomics. While most humanists do something akin to microeconomics, what we can do in text analysis is macro work with large databases. He showed how diachronic graphs can help you think through the links with history.

He closed by suggesting that TAPoR might provide the place where killer applications could emerge.

Breakout Sessions

There were three breakout sessions loosely connected to the three sets of speakers (John, Audrey & Joanna, and Matt)

Frameworks They talked about frameworks and how Eclipse is a framework for frameworks. They moved on to looking at how scientists deal with very large amounts of information (like the genome.) They talked about role of annotation as one way to step away from a scientific view of the text to the human reaction to the text. Finally they talked about relationships with developers. They commented on using the machine in ways you don't anticipate. How can we tempt scholars in to try things they wouldn't think of.

Interface We looked at the persona of the learner - a learner who is not learning text analysis, but using text tools to learn about other things. Can the portal be useful across disciplines. We agreed that using personas is an iterative process. We discussed the Google interface and whether all interfaces should be simple or can users put up with complexity. We talked about Raskin's idea of automatism. Then we talked about tracking user behaviour to modify interface. Also, should we have regular polls on the interface to find out what peole want?

Guided by Research Questions What might we want to walk away with in three days? Do we want standards or standarization or standard practices? They began discussion with problems - tradeoffs between things like XML (richness) and SQL (speed), and the web as platform vs application frameworks. They talked about the core requirements of our "clients", especially user interface.

They proposed trying to start standardization process:

  • Standardized way of setting up a tool
  • Standarization of data
  • Middleware - middle ground that allows disparate tools to interoperate

They talked about tension between building a tool for particular project and then trying to generalize it. Humanists get credit for the book, not the tool. Finally they discussed collaboration - how does it happen in humanities.

At the end we followed up questions from these reports. John Bradley discussed about "ease of use" - a violin is responsive, but not easy to use.

Day 2

James Hoover: Software Architecture in N Easy Peices

James gave us a overview of how software architecture could help us. "Software Arch is the study of boxes and lines and how to reuse them."

What can we reuse? Small things to big things, and Implementation to Design things. Libraries have lots of small implementations. Design patterns are big and can be reused intellectually (not implemented code.) "Components were a failure for everything but user-interface widgets." Horizontal frameworks are big chunks of utility. Application frameworks are different how?

"Don't give this diagram any semantics - it doesn't mean much." Services are what boxes typically are about. An architecture is a set of services connected in different relationships. Refactoring is the process of rethinking the services and relationships. In principle we want loose coupling and high cohesion.

  • Loose coupling: minimize number and strength of connections between services
  • High cohesion: maximize the conceptual unity of things in the service

"There isn't a lot of meaning in these diagrams. They tell you how to navigate through code." Once you build a lot of them then you see patterns. Reference architectures are:

  • Pipes and Filters
  • Data Flow -
  • Blackboard - standard for AI - throw all data onto a blackboard and everything can talk to it
  • DB Centric - Database centric
  • File Centric - everything is files that move around
  • Plug In - like Eclipse
  • GATE - Gen. Arch. Text Engineering - http://gate.ac.uk

Framework Design is about reusing big architectural chunks.

  • What is common?
  • What is variable?
  • How do you balance the architecture between the two?
  • Identification and factoring into services

Mark Olsen: Coding and Bricolage

See philologic

How well suited are the humanities to large scale projects and collaboration. We have disciplinary differences, theory differences, no money. Tinkering, bricollage, and hacking is one way forward. Mark talked about not having formally trained computer scientist. This may be standard in humanities computing - we tend to take people who understand humanities and teach them the programming.

Humanities computing is usually very applied and evolves out necessity and actual projects/problems.

The data objects of humanities are very difficult. They have layers of history and complexity. This is why formal training in the humanities is needed, to understand specific problems of texts.

The developer is the user.

The downside of just-in-time trained humanities programmers is that these bright young hackers move on leaving ugly code. The second problem is the funding model. Portability is a problem - we tend to build for a particular project. As a result humanities computing is always reinventing itself. Code organization and architecture are the last things we worry about.

Critical theory matters. Philologic has a theoretical model that comes out of a critical model (size matters, don't write for individual texts.) Anastasia has a completely different model. We have systems that will never be compatible because of their theoretical differences.

A second oddity is institutions. We have legacies - legacy code, legacy projects, and "legacy theme parks".

The third oddity is the TEI - it is a standard that has to be addressed, but it is not suited to hacking and bricolage.

Directions

Avoid temptation. Avoid over-engineering. Avoid "what-ifs". Avoid the myth of permanence. Have short term objectives and be prepared to toss stuff.

  • Web-services are the best place to start - Federated database model with web services. Fast and light tools could be shared across the net. Under the web interactions.
  • Linux and Unix based tools - We tried small tools before but everyone wanted Windows. Now we should be able to revist this. Shared data is another possibility inside the Unix model
  • Open-source development - the cost of building open-source infrastructure may be too high. It is hard to make your code really open.

We need a simple TAML - don't over engineer it. We also need the CGI arguments to go to the TAPoR tools. The TAPoR web services model is a great place for us to start and it suits the bricolage humanities model.

Bradford G. Nickerson: Towards a Text Analysis Language

Brad started with an analogy with image processing. There is a language that is effective for describing algorithms. The operations of image manipulation can be chained together.

Brad asked about text - what is it?

  • A document, a set of lines of words, a set of acts in a play ...
  • A text is in a natural languages and there are many
  • Text is articstic, it can have layout, fonts, notation, embellishment, and aesthetic appeal
  • image -> symbol -> word -> sentence -> context -> meaning -> classification and interpretation

How do we represent the text? Brad talked about how Unix handles text with standard tools like grep and diff. There is a consistent representation - a text is lines of symbols in files. There are nice languages like Perl (Practical Extraction and Report Language) that can do things to text in Unix.

What are the main operations that take place in the text anlaysis community (This was drawn from the papers from the Face of Text conference):

  • Enter operations: scanning, OCR, editing, cleaning, tagging, database entry
  • Preprocessing: counting (frequencies), filder, index, sorti, graph creation, morph
  • Search and retrieval: multilingual, contents, tag content, KWIC, collocation, phrasal repetends
  • Classification: word semantics (part of speech), authorship attribution, indentify irony (metaphor and other literary things), clustering, automated summarization
  • User interaction: visualization tools, geographical context, annotate

Can we develop a Text Analysis Language?

  • An object-oriented language where the objects are poems, play, act and so on
  • Attribute = eg. images, annotations, xml tags
  • Object methods - text analysis operations
  • Base classes can define default text analysis objects
  • Language elements - grammar (regular expressions?), keywords
  • Use compiler compiler to comple to something like Jave
  • Representation - directories of files of lines of text? database? images?

The challenge with a TAL - text analysis language is the user interface.

Discussion

There was discussion about a Text Analysis Language. Have we done it before with Icon and Snobol? Think about Susan Hockey's early books and the discussion of programming for humanists.

Martin talked about the vunerability of systems running on multiple machines with services.

Steve talked about tight iteractions, starting by coding and drawing UML diagrams after the fact. Will agile methodologies work with collaboration between large groups at different locations.

There was a discussion of the need for "verbs" for services and a light results language.

James Hoover talked about doing commonality and variability analysis.

Steve talked about the possibility of pattern languages for text analysis and whether it would not end up being another difficult programming language. Are the primitives in text analysis repeat themselves enough to warrant turning them in data types and operations.

James is big fan of writing tiny languages. We talked about proce55ing. Do develop a language we need to think of what the primitives are, both data types and operations. Things like finding irony would be the difficult problems of text analysis.

Tanya asked if humanists really want an irony detector. We might develop a tool that did it, but humanists wouldn't want it. On the other hand it might be an interesting problem that we would learn from in failure. But this would be a project for text analysis people, not colleagues. Would a mediocre iron detector be worth it? It might be worth it for large scale texts which we can read ourselves.

This got us into the machine learning - rote learning by brute statistics. We want algorithms that tell us something so we can change parts. Machine learning algorithms don't lead to exposition. They don't tell you how they got the results.

To attack irony we need to look at functional linguistics. There is a literature about this - we could learn from this.

Eric proposed a different approach - lets build some tools and play with them. The problem seems to be that we have had tools and people don't seem to have played.

Stan Ruecker: TAPoR á la Carte: Interfaces to Support Experimentation

Stan talked about prospect interfaces. Utility interfaces do only one task like Google. Development interfaces have lots of palettes like Photoshop. Process interfaces set out a set of steps - like wizards.

Process interfaces rarely show prospect. A lot of text analysis interfaces are process interfaces like TAPoRware. What we want is a cookbook with recipes with more than one procedure. What is the food we want and what are the processes to get there. "You bring ingredients and processes and you get a result." Each recipe should address:

  • What is the larger purpose
  • What do I start with at each step
  • What do I need to do manually
  • What tools to I use
  • What do I get out of each step

For example: 1) use the googlizer to web scrape, 2) use the word frequency to get list of interesting words, and 3) use collocations to get words that collocate that are near your keywords. This would be a higher level goal oriented task that can be accomplished with utility processes.

How would these recipes be presented to people through the portal?

Gary Shawver: Some Thoughts on Interface Design

It is useful to repeat what we know to make sure we know it.

  • There are no natural or intuitive interfaces - the pursuit of a "natural" interface is misguided
  • Some interfaces are consistent and some are easier to learn

Gary talked about some standard interface guidelines. He put up guidelines for Apple, Gnome, and Microsoft. Some of the principles are:

Reflect User's Mental Model, Metaphors, Direct Manipulation, User Control, WYSIWYG, Consistency, Feedback, Modelessness, Perceived Stability, and Foregiveness

Do users have a model for what they want to do? Are there metaphors that would work? Could we use direct manipulation in text analysis tools? Features should not be hidden.

Raskin: the user's data is sacred and the programmer is a sinner who needs to be forgiven.

The TAPoR portal doesn't save the user's last state. Is the portal modeless? As much as possible let users do what they want with changing things. Gary recommended that we be able to be able to go back to another page to do work while something else is happening.

Gnome has a principle of designing for people - make it accesible.

How does the Fixed Phrase tool fit? He talked about the interface issues to the tool. He also showed 10 X 10 as an example tool. He showed a hyperPoet adaptation to this.

Day 3

The morning session was about collaboration. I presented so my notes aren't very good. Check the blog.

Geoffrey Rockwell: Open (about) Collaboration

I presented a hermeneutical theory of text analysis and the place of automated techniques in humanities research. Mark Olsen and I had an interesting exchange as he feels a social text approach is better than a hermeneutical one.

Steve Ramsay: On Collaboration

Steve talked about the disencentives for collobaration. In English Studies we have created a culture that doesn't know how to collaborate, doesn't like it, and doesn't do it.

However, in humanities computing centres like IATH where Steve was a senior programmer, there was genuine collabration because of the need for teams in order to develop complex projects. HC projects demand real collaboration. Why is that? What does it mean for the humanities to shift to team-based research?

Collaboration is happening locally - why isn't it happening between groups? Can we think about collaboration as a raw technical issue and solve the between groups problem? Is it harder to do between distant groups?

Steve told the story about Algol and the development of BNF language. BNF was designed as a protocol for talking about designing languages. He pointed out that as we go down in software development from general to concrete code we find less collaboration.

Languages to talk about languages - we see collaboration Protocols - collaboration Highly - modular systems - collaboration Mouse driver - you won't see collaboration

Stever went on to note that work HC is messy. Each of us wants to do something different - not dissolve ourselves into each other. We need to collaborate while still getting credit.

James Chartrand: Collaboration at the Code Level

James talked about management of software projects. One thing that is overlooked is the need for leadership. In the private sector everyone knows that you need all the roles from project manager to product marketer. In academic HC we don't put in place the leadership or the roles.

What a project manager does is to establish a process for an individual project. What should come out is good code, well documented, appropriate for the client, and well marketed.

The problem with students is that they do work like assignments. They go off and do something without talking to anyone and then hand it back and think they are done.

Often the academic is not the right person to guide and lead the project. A related issue is that the academic doesn't have the time.

In the academic world the leader is not respected - you don't get credit for managing a process.

In CS there is a different culture. Every one is in one or more teams. There is a constant heart beat of meetings. It is the senior grad students who will manage code development often.

It's different when there isn't money on the line.

What we need are professional project managers. We also need to define a process. There isn't one proper process - different projects need different processes for collaborating.

James ended by proposing the Apache project as an examplar. Apache has a tight process.

Discussion

In my breakout group we listed a number of possible collaboration opportunities and then in the larger group voted on them.

Steve talked about how his breakout group moved into questions about disciplinarity. There are now faculty positions in HC. We are now training studeents formally, while before we had to learn it on our own. They talked about the double burden of HC where you have to be both a technologist and a humanist. You end up being accused of being bad at both. To collaborate do you need to know everthing? One has to become comfortable with ingnorance. They then talked about presenting to colleagues. They talked about the totally inability of colleagues to understand collaboration which isn't true in other disciplines - can we leverage the collaborative skills of other disciplines - let them lead us?

James' group talked about knowledge of following a process. "If we knew what we were doing it wouldn't be research." CS can probably give sound advice on process if you care to ask. They talked about the importance of people. There need to be visionary, but how do others get recognition? Finally they talked about the need for peer review of tool development to legitimize it.

At the end Stéfan thanked us and led us in discussion of next steps, especially the opportunity to create an alliance - a collaboration. Here are some random notes I took on the discussion.

You should instrument your tools to know how people are using them and why.

Some other ideas about what needs to be done:

We need to think about how TA can be used for learning. Analyze This?

Create a working group to start looking for funds. Monthly conference calls.

Lets create TADA (we did). What is there to be gained by creating a new structure? What could it provide that other organizations don't?

We then had a discussion about structure. The idea would be to have low organizational structure - keep a developers' alliance

We need an image and to brand ourselves.

-- GeoffreyRockwell - 09 May 2005


Use this box to quickly add a comment to the page.

more options...