Mashing Texts: Supporting collections level text analysisPeter Organisciak Geoffrey Rockwell Stan Ruecker Susan Brown Stéfan Sinclair1. IntroductionAt the 2005 Summit on Digital Tools in the Humanities four opportunities for humanities computing tools were identified including the need for tools for the Exploration of Resources [1]. As a critical mass of evidence useful to humanities research becomes available on the web, researchers need tools for gathering the resources they need to ask questions, assembling and editing the evidence into study collections, and then analyzing those collections. This paper will discuss the Mashing Texts project [2] that has followed a persona usability design process to develop stories and a prototype for how a collections analysis tool might work to support humanities research. In particular, we will:
2. JiTR DemonstrationThe JiTR (Just-in-Time Research) prototype is the first pass at realizing the Mashing Texts project. The simple story about JiTR is that it lets you manage collections of digital items and run tools that either gather items, clean them or analyze them. In !!JiTR you can create collections, add items to collections (manually or with spiders/scrapers), edit the items (automatically or manually) and then pass a collection as a single "text" to a text analysis tool elsewhere. We will demonstrate a rapid research cycle of gathering, editing and analysing a collection. Mashing Texts embraces principles of Internet mashups in conceptualizing a recombinant environment whereby the resultant product is more valuable than the sum of its parts. JiTR is designed so that other tools can be plugged into it like search engines and spiders to generate text collections on demand from within the environment. As part of the project's emphasis on malleability of data, JiTR allows those collections to be organized by various tags and labels. The final part of the demonstration will show some of the analysis tools that JiTR offers.3. Persona-Scenario MethodologyJiTR is just a working prototype designed to test a design hypothesis about a type of tool that we think is needed. The research in this project has been the extensive design process. If the end purpose is user functionality, why not prototype around stories about users and what they could do? This is the idea behind the personas and scenarios design process, a usability design process that we have used on other projects like the TAPoR portal. Personas are imagined possible users that “act as stand‐ins for real users” [3]. Personas are examples of people who would potential like to use the system. They are created to stimulate thinking about people, rather than exclusively concepts. Once realistic user personas are created, usage scenarios are described, which consider ways users could possibly employ the final product. Scenarios move into specifics, detailing the steps that the user would follow in working with the system. Eventually, the various scenarios are prioritized into primary and secondary uses so you know what types of tasks the product needs to support. You can also use the scenarios to audit the prototype. In this project we started with two constituencies that we came to call DEEP (Distributed Electronic Editing Platform) and BROAD (which doesn't stand for anything.)
4. Technical and Political ImplicationsIt is one thing to prototype and test an idea for another tool, it is another to develop a production tool that others can use. We are particularly conscious that a tool like !!JiTR needs to work within the ecology of tools available. To that end we had a parallel process in the project to identify the other tools, frameworks, and standards that !!JiTR needs to work with so as to develop viable architectural specification for the development of a production version. This involved identifying the points of articulation between !!JiTR and other tools. An obvious example is how !!JiTR should work with repository systems life Fedora. While our prototype has its own MySQL? database, a production version should not manage the repository of texts in a collection. Instead it should have the ability to push and pull texts from a repository, whether it be Fedora or another system. Likewise !!JiTR should not include any tools, but should have a plug-in architecture for tools from spiders to text mining tools. Our prototype has tools built in, but a production system would have an interface for managing processes. As with any development project the design process is partly about deciding what you aren't going to do. We believe that a missing layer needed between respository tools and text analysis tools is a collaborative research collections management environment. At the end of the paper we will present the designs of how we think a full system could support the research work flow of our two user constituencies. How would an editor develop a workflow for the editing of electronic texts in !!JiTR? How would a researcher interested in the discourse on the web about high performance computing gather documents, clean them and analyze them?[1] Summit on Digital Tools for the Humanities, http://www.iath.virginia.edu/dtsummit/ [2] Mashing Texts is supported by a Social Science and Humanities Research Council of Canada Research and Development Initiative grant. The project is openly documented at http://tada.mcmaster.ca/Main/MashTexts [3] Calabria, Tina. "An introduction to personas and how to create them", http://www.steptwo.com.au/papers/kmc_personas/ Old DraftThe constantly growing amount of information online creates problems of retrieval precision for researchers. More material available means that the content most relevant to a researcher becomes obscured. This problem is magnified for humanists, who collect texts from a wide range of sources. What is required is a method to streamline the collection and management of research materials. The Just-in-Time Research (JiTR) environment addresses this need by maximizing the malleability of digital data. The benefits to end-user control that result from increased data malleability allow for users to filter their searches alongside the managemant of their texts. In short, JiTR provides tools for processing textual data, resulting in a workflow where collection does not diambiguate from management. JiTR adopts principles of mashups from online culture. A "mashup" is an artifact created by the repurposing and recombination of other content, in ways that provides additional value in its presentation. The concepts of mashups are embodied in two chief ways. First is the approach of JiTR as a common area for online content from a variety of sources. Rather than keeping textual content at its primary source and simply saving a URL, JiTR works as a sort of research scrapbook, calling the actual content into the service. In line with the principles of online mash‐ups, JiTR offers functionality to take advantage of the collected corpus, such as flexible tagging, creating collections that are greater than the sum of their parts. Instead of being sorted through exclusion and duplication, the intent for research material in JiTR is to be contained within one collection, with filters applied on demand by the user. Research material relevance is not simply a binary choice of good or bad, but a scale of relevance. In JiTR, a user can tag information by importance and content, and recall only the specificly desired sub-groups. In addition to recombining content with other content, the second way that JiTR adopts mashup concepts is in connected content to processes. This allows external tools to be accessed from within JiTR, for rapid-analysis of collections. For example, once the JiTR environment is completed a liguistic researcher will be able to quickly query processes to collect a corpus of material and then run TAPoR analysis tools on it, all within the environment. As JiTR progresses, it is developing complex document management, involving editorial workflow. In doing so, a much larger problem must be examined: how, if at all, can a system reconcile the static, traditional demands of an editor with the relative volatility of on-demand content? As this question is considered, the potential link appears to be in the filters. A digital workflow is concerned with what information is hidden to whom. In this respect, it is similar to the act filtering a large collection based on the need of the situation.[1] Summit on Digital Tools for the Humanities, http://www.iath.virginia.edu/dtsummit/ [3] -- PeterOrganisciak - 23 Oct 2008 | |