Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Anatomy of Web Intelligence

Below is a preliminary attempt to describe – in not overly technical terms – the anatomy of web intelligence, or what major components are needed to develop an integrated system for delivering useful and strategic knowledge from the masses of data available on the web. Although some of the relevant parts of web intelligence aren't as neatly discreet as the list below may suggest (and the choice of certain technologies at some points have an impact on the choice of technologies elsewhere), the anatomy below (more or less in sequential order) is a useful starting point for understanding what components are needed:

Data Acquisition or Web Crawling

Many of the corporate knowledge management solutions are predicated on the idea of a pre-existing document collection that needs to be normalized and parsed exhaustively (a common example of this paradigm is the Google Appliance which is designed primarily to operate on a defined set of documents (such as from a hard drive or intranet). In contrast, a web intelligence system needs to constitute its collection of texts itself, based on mechanisms for growing a network of interlinked documents. A simple example would be to take a single HTML document and to follow each link on that document to a set of second-level documents; and then to pursue the crawling operation from the second-level documents to third-level documents, and so on. Even this simple example points to some key challenges of web crawling:

  • there's no assurance that each link is of equal value and should be followed
  • this single entry-point approach doesn't assure that all potentially valuable documents will be found
    • some valuable document won't be referenced by others in the known collection
    • some potentially valuable documents are inaccessible or only visible through database queries (the "dark web")
  • some documents may not be HTML and may not contain explicit HTML links (though they may contain other types of references, like bibliographies of print resources)
  • many web documents are dynamic and unstable which suggests the need for both archiving and updating mechanism

Despite these challenges, data acquisition is arguably one of the easier components of web intelligence (largely because of the availability of well-established tools), though in most cases at least minimal relevant filtering will be highly desirable. More sophisticated content filtering begins to involve tricker semantic analysis and possibly data mining techniques.

Data Pre-Processing

Some of the web crawling packages offer built-in facilities for extracting certain metadata and normalizing documents before parsing, though any one of them is unlikely to offer all the desired extraction functionality. Fortunately, any additional metadata usually be determined through regular expressions. Parsing multiple data formats (HTML, XML, PDF, Word, PowerPoint?, OpenOffice?, etc.) can be a challenge, though there are some good tools for automated document conversion (such as PDFBox or PDFTOHTML). An additional challenge is dealing with multiple character encodings.

Data Parsing and Automated Categorization

Google's search engine is obviously very powerful and useful, but it relies mostly on lexical searches (exact words matches). Other types of data that could be indexed and exploited include:

  • morphological variants (singular/plural forms, noun/verb forms)
  • semantic similarities ("cat" and "feline")
  • metadata (author, language, genre, etc.), when available

Essentially there are two types of parsing that must happen: document tagging (extracting metadata) and token tagging (identifying properties of words or "n-grams" – sequences of tokens). Once documents are acquired through a web crawl, useful information needs to be extracted before superfluous information (like formatting tags) need to be removed, prior to tokenization (the process of identifying words) and indexing. The type of information to be determined and indexed should be driven by use case scenarios: what would users want to do? For instance, would they want to limit their searches to a certain genre, like blogs? Would they want to do proximity searches that would require information about the position of a token in a document to be stored.

Tokenization (finding words) and token categorization (describing words) can be done in multiple ways, again depending on use case scenarios. Below is a disparate list of resources that might be useful:

Data Mining

Although they have some affinities, it is useful to distinguish between Automated Categorization (part of the section above), which is usually concerned with categorizing lexical and semantic information locally, and data mining, which is concerned with finding relevant features globally (as such, data mining is also a form of querying and analysis). However, it's likely that a web intelligence system will want to perform certain data mining operations as part of the indexing process (and not only as a user-defined query).

Data mining can be done in a supervised or unsupervised manner. Supervised data mining involves manually categorizing a subset of documents and then invoking an algorithm to determine which uncategorized documents share features with the categorized documents. Unsupervised data mining uses known features to automatically categorize documents. Wikipedia has a decent list of software solutions, though it has missed t2k (Text to Knowledge), which is in fact a multi-tiered system.

Data Indexing and Storage

Data Querying and Analysis

Data Summary and Visualization

-- StefanSinclair - 25 Feb 2007


Use this box to quickly add a comment to the page.

more options...