XTeXT and the TAPoR Portal
What is XTeXT?
XTeXT is a text search and retrieval/web application platform provided to the TAPoR project by
isagn inc. It is based on
search technology developed at the University of Waterloo.
It is fast, scalable, extensible and under continuous improvement. Read more
about XTeXT or
XTeXT workshops.
How can you use XTeXT through the portal?
Here is a quick introduction which assumes you have an appropriate
TAPoR account set up. The following steps walk you through creating a repository, adding text to it and using the XTeXT tools to search the repository and retrieve relevant results for use with other tools.
- Go to myTools
- click 'Click here to add XTeXT tools'.
- Go to myTexts
- click 'Add Repository'
- provide a name and description.
- click 'Add'
- click 'Load Text to Repository'.
- select the respository name.
- indicate the source of the input text:
- from a URL.
- from uploaded file.
- from typed/pasted input.
- from your aggregate texts.
- click 'Add'
- Go to Workbench
- element names
- select the repository name from the list.
- select 'List XML Elements' tool.
- click 'Use Tool on Source Text'
- when the Tool Broker panel appears, click submit.
- result: a list containing the element names and their counts is returned.
- element retrieval
- select 'Extract XML elements'.
- provide an XML element name occurring in the previous results.
- submit.
- result: a list of all of the named elements is retrieved.
- concordance listing
- still in 'Find in XML element'.
- provide terms to seek. this can be a word or phrase.
- optionally, provide span, the number of words to retrieve on either side of the match and limit, the maximum number of results to return.
- submit.
- result: a concordance listing of matches occuring within the selected elements is returned.
History of XTeXT TAPoR interface
- We had a demonstration for it working at the Face of Text conference. Now we are working at the full implementation
Specifications for the interface
- Users can create, delete, edit, and add to repositories. (Will it only be advanced users? What will be the storage issues? How the interface for this work?)
- There are XTeXT tools in the TAPoR tools list that can be used on repositories that are public or belong to you. These include:
- An XML element lister; returns a list of element names and counts.
- XML element extractor; returns a list of the matching elements. elements are specified by XPath or element names.
- find tools; return a concordance for a term or phrase in the repository. search is restricted by XPath or element names.
- These are tools not in place at present:
- Advanced find
- A list the texts tool that gives you a list of texts in a repository (and the XPath info to restrict a search to just that text)
- A word list tool that generates a list of words and counts
- Cooccurence and Collocates (Nice, but can be done with two passes)
Further neat ideas for later projects
- There might be advanced tools that will be developed later.
- Repositories could be exported and imported
- Comparison tool that compares two subsets of the repository (comparison of relative frequencies) (Later)
- For stuff with date tags like an RSS feed - be able to do a date search and get a distribution graph of some sort. (Later)
- Add data mining and cluster searching facility. Not sure what this would be, but it could be cool. (Later)
-
- Users will be able to get HTML for a form that they can integrate in another web site that would use any of the XTeXT tools and return results. Thus if I had a small e-text project I could create a repository and then provide a search function at my project site.
- Users should be able to launch a bot that gathers stuff for them.
- Allow XTeXT searching to happen from different sites - so that we can have installations at multiple sites. (????)
- Export and Import for Basic users without diskspace
- Look at indexing RSS feeds and other materials automatically.
- Develop bots that add stuff automatically to XTeXT repositories
To do
- Create an XTeXT Tutorial
- Get the public private model working
- Testing and fixing
- Develop the user manuals and a tutorial on using XTeXT through the portal (shared)
- Develop an installation manual for XTeXT for nodes and train them to install
- Backup of repositories
Done
- Install a version on TAPoR 1 so that it can be adapted to work with the portal
- Install for other nodes a version they can test and play with
- Figure out the interface for managing repositories
- Get XTeXT working as a Web service through the portal. Get the tools working through the portal (even if the repository is a test one).
- Get the managing interface integrated into the portal so we can create, edit, delete and add to repositories.
Notes By Others
--
GeoffreyRockwell - 21 Sep 2005