Anatomy of Web Intelligence
Below is a preliminary attempt to describe – in not overly technical terms – the anatomy of web intelligence, or what major components are needed to develop an integrated system for delivering useful and strategic knowledge from the masses of data available on the web. Although some of the relevant parts of web intelligence aren't as neatly discreet as the list below may suggest (and the choice of certain technologies at some points have an impact on the choice of technologies elsewhere), the anatomy below (more or less in sequential order) is a useful starting point for understanding what components are needed:
Data Acquisition or Web Crawling
Many of the corporate knowledge management solutions are predicated on the idea of a pre-existing document collection that needs to be normalized and parsed exhaustively (a common example of this paradigm is the
Google Appliance which is designed primarily to operate on a defined set of documents (such as from a hard drive or intranet). In contrast, a web intelligence system needs to constitute its collection of texts itself, based on mechanisms for growing a network of interlinked documents. A simple example would be to take a single HTML document and to follow each link on that document to a set of second-level documents; and then to pursue the crawling operation from the second-level documents to third-level documents, and so on. Even this simple example points to some key challenges of web crawling:
- there's no assurance that each link is of equal value and should be followed
- this single entry-point approach doesn't assure that all potentially valuable documents will be found
- some valuable document won't be referenced by others in the known collection
- some potentially valuable documents are inaccessible or only visible through database queries (the "dark web")
- some documents may not be HTML and may not contain explicit HTML links (though they may contain other types of references, like bibliographies of print resources)
- many web documents are dynamic and unstable which suggests the need for both archiving and updating mechanism
Despite these challenges, data acquisition is arguably one of the easier components of web intelligence (largely because of the availability of well-established tools), though in most cases at least minimal relevant filtering will be highly desirable. More sophisticated content filtering begins to involve tricker semantic analysis and possibly data mining techniques.
- readings
- commercial and closed-source software
- open-source software
Data Pre-Processing
Some of the web crawling packages offer built-in facilities for extracting certain metadata and normalizing documents before parsing, though any one of them is unlikely to offer all the desired extraction functionality. Fortunately, any additional metadata usually be determined through
regular expressions. Parsing multiple data formats (HTML, XML, PDF, Word,
PowerPoint?,
OpenOffice?, etc.) can be a challenge, though there are some good tools for automated document conversion (such as
PDFBox or
PDFTOHTML). An additional challenge is dealing with multiple
character encodings.
Data Parsing and Automated Categorization
Google's search engine is obviously very powerful and useful, but it relies mostly on lexical searches (exact words matches). Other types of data that could be indexed and exploited include:
- morphological variants (singular/plural forms, noun/verb forms)
- semantic similarities ("cat" and "feline")
- metadata (author, language, genre, etc.), when available
Essentially there are two types of parsing that must happen: document tagging (extracting metadata) and token tagging (identifying properties of words or "n-grams" – sequences of tokens). Once documents are acquired through a web crawl, useful information needs to be extracted before superfluous information (like formatting tags) need to be removed, prior to tokenization (the process of identifying words) and indexing. The type of information to be determined and indexed should be driven by use case scenarios: what would users want to do? For instance, would they want to limit their searches to a certain genre, like blogs? Would they want to do proximity searches that would require information about the position of a token in a document to be stored.
Tokenization (finding words) and token categorization (describing words) can be done in multiple ways, again depending on use case scenarios. Below is a disparate list of resources that might be useful:
- readings
- open-source software
- GATE: a general Java-based framework for token categorization
- LingPipe: a set of Java tools for token categorization
- Natural Language Toolkit: a Python-based suite of tools
- WordNet: a database of semantic concepts (with tools in various languages)
- TreeTagger: a language-independent morphological tagger by Helmut Schmid (in C)
Data Mining
Although they have some affinities, it is useful to distinguish between Automated Categorization (part of the section above), which is usually concerned with categorizing lexical and semantic information locally, and data mining, which is concerned with finding relevant features globally (as such, data mining is also a form of querying and analysis). However, it's likely that a web intelligence system will want to perform certain data mining operations as part of the
indexing process (and not only as a user-defined query).
Data mining can be done in a supervised or unsupervised manner. Supervised data mining involves manually categorizing a subset of documents and then invoking an algorithm to determine which uncategorized documents share features with the categorized documents. Unsupervised data mining uses known features to automatically categorize documents. Wikipedia has a decent list of software solutions, though it has missed
t2k (Text to Knowledge), which is in fact a multi-tiered system.
Data Indexing and Storage
Data Querying and Analysis
Data Summary and Visualization
--
StefanSinclair - 25 Feb 2007