click to enlarge
Jockers offers two examples of text-analysis that are used as “points-of-reference” when discussing the “products” or “results” of humanities computing. The first is authorship attribution studies, which Jockers describes as a ‘ sexy sort of literary forensics,’ in which text analysis tools help to identify and confirm authorship of texts, or to find patterns within a body of work that suggest an authorial ‘fingerprint.’ The second example is that of xml archives and text/corpus searching. Jockers poses the question, somewhat seriously, “What are rich markup and exciting new analysis tools adding to our appreciation of literature?”
Matthew goes on to describe his own XML aware tools – CaTools – and how the tools were created to satisfy a specific research need. Jockers tools were created as part of his Irish-American West project at Stanford University, a hypertext corpus of texts and research of Irish-American writing. Jockers’ tools were designed to work with large bodies of data collections that span a very long time period (1752-2003). Jockers’ tools allow not only a close, ‘micro’ reading of individual texts (data), but also an overview, or ‘macro’ view of corpus-wide data. The tool incorporates a simple search interface to query an XML file generated from his mysql database. The report generated from this file shows a range of standard information: dates, total records, title length, word counts, statistical data and so on, but is unique in grouping them based on user selected criteria. For example, the user selects a grouping category (e.g. year, decade, author-gender) and the tool then first sorts the records based on the grouping category and then executes its text-analysis functions on each unique group in the set. The tool also plots these results in bar graphs. Jockers emphasized the tool’s ability to graph these results in order to gain a macro perspective.
Jockers provided an example to illustrate the importance of the macro view. It had been suggested that Irish American literature went through a period of ‘cultural amnesia’ in the 1900-1930’s due to socioeconomic factors. The graphical representation of this period in relation to other periods suggest that this is not the case. Indeed, the period was one of great productivity.
Jockers also briefly described other uses for his tools: analyzing the functional change in novel titles from early stages to modern periods, or identifying patterns in publication numbers over time. Jockers also notes that using CaTools allows for a sort of “macro lit-o-nomics,” (macro-economics for literature) suggesting that the cyclical patterns that emerge are very similar to economic models of business cycles which rise and fall. He also points out that such patterns are not predictable, and can be difficult to tie to historic events such as famine or economic depression.
Matthew closed his presentation with a discussion of the purpose of the Text Analysis Summit. He hopes that this summit will result in the formation of some “action items”, perhaps a set of best practices and/or standards. While Matthew does not necessarily believe that the Internet is the ideal place for text-analysis applications to be housed (as overwhelming amounts of data are difficult to process across the Net), he does think that it is probably the most practical and also provides a perfect place for developers to collaborate and work towards the creation of good tools. Jockers points to TAPoR as the seemingly ideal space to bring developers together in this endeavour.
The first issue discussed was a comparison of XML and MySQL data, and the strengths and weaknesses of both formats. While XML affords more detail and metadata possibilities, processing is too slow for large online jobs. MySQL is important to online tools. “Data architecture and tools are geared towards convenience and low load times. Maybe we should be reconsidering our commitment to web based tools: Are the current online tools simply not adequate? Should we be indexing everything?”
Insight into the mindset of tool-users was offered up by several members of the group. “If the user has to wait longer than 11 seconds for results, the tool will not be successful.” Users waiting longer than this will be discouraged, will believe that the tool is not functioning, and will close the browser. Web delivery is most important to users, and allowing users to access these tools via the Net is crucial to developers.
This was followed by a discussion of possible strategies of combining XML and MySQL for online tools that would offer a bit of both worlds. Ideas included preindexing large corpi, tokenizing documents and using unique x-paths to locate data within documents. Another strategy discussed was allowing users to use tools with their own data, allowing them to upload data to a server, or by allowing users to input a url to their uploaded data.
While web based tools offer users convenience and heightened access, standalone tools provide user with more options and features that are difficult and costly to replicate online. Web based tools - or rather the browsers that they are viewed in – offer developers a flexible interface in which a wide scope of intertextual and design possibilities can be realized. Standalone tools offer much more privacy and security to users, which may be important if they are dealing with sensitive or confidential data. Web based tools should be geared towards low-level uses, and should be easy to use. “People are used to using Google, the Net is a comfortable environment for most users. Online tools, then, should be set up in a similar way to search engines.”
However, the largest segment of current tool-users are tool developers. Most tools were originally developed to satisfy an individual’s needs. It is difficult to accommodate other people’s needs, and often the complexity involved in creating ‘functions for all’ becomes too much for a single developer. “I originally made the tool for me, but now my personal playground has to grow to include others – now it has to be a theme park.” Interest in collaboration has often been expressed, but it has rarely happened. Collaboration is difficult because there is no framework or precedent for interoperable collaboration. How can tools be combined?
One possible solution presented was using a plug-in model that would incorporate a universal central structure that would serve as a hub to connect other tools through interoperable pipelines. The development of a universal ‘middle ground’ to accommodate a variety of developer needs was also discussed. Another viewpoint suggested that tool development was a highly idiosyncratic practice, with developers attempting to do new and different things specific to their own research. Therefore, developers stand against the collaborative ideal. “Rather than trying to achieve a level of utopian collaboration, perhaps we should concentrate on universal APIs and extensive libraries that will allow for easier tool-building for others in the community.”