Matthew Jockers: XML Aware Tools – CaTools

Matthew Jockers
click to enlarge
Matthew Jockers has noticed a trend of meta-analysis in the recent works of Humanities Computing researchers that focuses on methodologies rather than results. Much of the research and development in the text analysis community centres on the possibilities and future benefits of tools rather than how these tools have been applied to humanities research. Jockers suggests that we look not only at the possibilities that text analysis tools will facilitate, but also to the users of such tools in order to advance the design and creation of tools that Humanities scholars will be able to put to good use.

Jockers offers two examples of text-analysis that are used as “points-of-reference” when discussing the “products” or “results” of humanities computing. The first is authorship attribution studies, which Jockers describes as a ‘ sexy sort of literary forensics,’ in which text analysis tools help to identify and confirm authorship of texts, or to find patterns within a body of work that suggest an authorial ‘fingerprint.’ The second example is that of xml archives and text/corpus searching. Jockers poses the question, somewhat seriously, “What are rich markup and exciting new analysis tools adding to our appreciation of literature?”

Matthew goes on to describe his own XML aware tools – CaTools – and how the tools were created to satisfy a specific research need. Jockers tools were created as part of his Irish-American West project at Stanford University, a hypertext corpus of texts and research of Irish-American writing. Jockers’ tools were designed to work with large bodies of data collections that span a very long time period (1752-2003). Jockers’ tools allow not only a close, ‘micro’ reading of individual texts (data), but also an overview, or ‘macro’ view of corpus-wide data. The tool incorporates a simple search interface to query an XML file generated from his mysql database. The report generated from this file shows a range of standard information: dates, total records, title length, word counts, statistical data and so on, but is unique in grouping them based on user selected criteria. For example, the user selects a grouping category (e.g. year, decade, author-gender) and the tool then first sorts the records based on the grouping category and then executes its text-analysis functions on each unique group in the set. The tool also plots these results in bar graphs. Jockers emphasized the tool’s ability to graph these results in order to gain a macro perspective.

Jockers provided an example to illustrate the importance of the macro view. It had been suggested that Irish American literature went through a period of ‘cultural amnesia’ in the 1900-1930’s due to socioeconomic factors. The graphical representation of this period in relation to other periods suggest that this is not the case. Indeed, the period was one of great productivity.

Jockers also briefly described other uses for his tools: analyzing the functional change in novel titles from early stages to modern periods, or identifying patterns in publication numbers over time. Jockers also notes that using CaTools allows for a sort of “macro lit-o-nomics,” (macro-economics for literature) suggesting that the cyclical patterns that emerge are very similar to economic models of business cycles which rise and fall. He also points out that such patterns are not predictable, and can be difficult to tie to historic events such as famine or economic depression.

Matthew closed his presentation with a discussion of the purpose of the Text Analysis Summit. He hopes that this summit will result in the formation of some “action items”, perhaps a set of best practices and/or standards. While Matthew does not necessarily believe that the Internet is the ideal place for text-analysis applications to be housed (as overwhelming amounts of data are difficult to process across the Net), he does think that it is probably the most practical and also provides a perfect place for developers to collaborate and work towards the creation of good tools. Jockers points to TAPoR as the seemingly ideal space to bring developers together in this endeavour.

Breakout Notes:

The first issue discussed was a comparison of XML and MySQL data, and the strengths and weaknesses of both formats. While XML affords more detail and metadata possibilities, processing is too slow for large online jobs. MySQL is important to online tools. “Data architecture and tools are geared towards convenience and low load times. Maybe we should be reconsidering our commitment to web based tools: Are the current online tools simply not adequate? Should we be indexing everything?”

Insight into the mindset of tool-users was offered up by several members of the group. “If the user has to wait longer than 11 seconds for results, the tool will not be successful.” Users waiting longer than this will be discouraged, will believe that the tool is not functioning, and will close the browser. Web delivery is most important to users, and allowing users to access these tools via the Net is crucial to developers.
This was followed by a discussion of possible strategies of combining XML and MySQL for online tools that would offer a bit of both worlds. Ideas included preindexing large corpi, tokenizing documents and using unique x-paths to locate data within documents. Another strategy discussed was allowing users to use tools with their own data, allowing them to upload data to a server, or by allowing users to input a url to their uploaded data.
While web based tools offer users convenience and heightened access, standalone tools provide user with more options and features that are difficult and costly to replicate online. Web based tools - or rather the browsers that they are viewed in – offer developers a flexible interface in which a wide scope of intertextual and design possibilities can be realized. Standalone tools offer much more privacy and security to users, which may be important if they are dealing with sensitive or confidential data. Web based tools should be geared towards low-level uses, and should be easy to use. “People are used to using Google, the Net is a comfortable environment for most users. Online tools, then, should be set up in a similar way to search engines.”

However, the largest segment of current tool-users are tool developers. Most tools were originally developed to satisfy an individual’s needs. It is difficult to accommodate other people’s needs, and often the complexity involved in creating ‘functions for all’ becomes too much for a single developer. “I originally made the tool for me, but now my personal playground has to grow to include others – now it has to be a theme park.” Interest in collaboration has often been expressed, but it has rarely happened. Collaboration is difficult because there is no framework or precedent for interoperable collaboration. How can tools be combined?

One possible solution presented was using a plug-in model that would incorporate a universal central structure that would serve as a hub to connect other tools through interoperable pipelines. The development of a universal ‘middle ground’ to accommodate a variety of developer needs was also discussed. Another viewpoint suggested that tool development was a highly idiosyncratic practice, with developers attempting to do new and different things specific to their own research. Therefore, developers stand against the collaborative ideal. “Rather than trying to achieve a level of utopian collaboration, perhaps we should concentrate on universal APIs and extensive libraries that will allow for easier tool-building for others in the community.”

3 Responses to “Matthew Jockers: XML Aware Tools – CaTools”

  1. lachance Says:

    Bar graph output can be considered not only as evidence refuting a putative position but also as a point from which to launch further inquiry. In the particular case of the Irish-American corpus, it is not clear from the summary provided here if the data visualized was an absolute count of titles publishe or whether the visualization also took into account such factors as sales numbers, print run size, column inch of reviews, number of book clubs, etc. The visual representation of data need not follow the dichotomous mode of attribution studies (is or is not author).

    Likewise, users are not uniform. Some users are quite willing to submit a query and _wait_ for a response to be generated. Instead of the search engine model or library catalogue model, take blogging or the seminar model. Some users have drawn upon the asynchronous aspects of blogging (and threaded discussion via news groups in the pre-blog era) to conduct investigations. In this model both question and answers are open to observation. In a similar manner, the WWW can be used to make available the queries and the responses, all the while the data sets remain protected offline. Yes this is similar to the mainframe days when a process might be run at non-peak times and a report produced…

    Notwithstanding, this look at the “reserach chain” raises the question of sharing data sets (or distribution, publication and/or sale). And subsets of datasets …

    There are four categories at play here: user, tool owner, tool developer, observer of the actions of the other three.

  2. mjockers Says:

    A point of clairification regarding CaTools and how it works with data from mysql. The app does not itself make the queries to mysql, but instead works with files that our exported out of mysql (or any other db) into an XML format. The basic format of the file that CaTools requires is one in which allows for multiple unlimited number of siblings within the root element but they must all share the smae element name and then these elements may have unliimited direct children, but there can be no children beyond that level. The structure cooresponds to the rows and columns of a data table. Here is the basic structure:

    <root>
    <record>
    <dataField>data</dataField>
    <dataField>data</dataField>
    <dataField>data</dataField>

    </record>
    <record>
    <dataField>data</dataField>
    <dataField>data</dataField>
    <dataField>data</dataField>

    <record>
    </root>

    The program works by first getting the first child of the root (which will always coorespond to the first “row” or “record” in the data table) and then getting all of the children of the row as columns or fields.

    After this initial processing the user can select from one or more dropdown lists (depending on the selected tool) and pick a field for categorizing and another for the text analyisis

  3. » On Distant Reading and Macroanalysis Matthew L. Jockers Says:

    […] of the toolkit at the inaugural meeting of the Text Analysis Developer’s Alliance. An overview of my project was documented on the TADA blog. Later that summer (2005), I presented a more gener […]

Leave a Reply

You must be logged in to post a comment.