Skip to content.

Find topic

Web tools

Help

Tools

       Analysis Tool Bar  +

Environmental Scan: Data Mining

To clarify:
Data mining is a process of "nontrivial extraction of implicit, previously unknown, and potentially useful information from data"{{W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). "Knowledge Discovery in Databases: An Overview". AI Magazine: pp. 213-228.}} Brilliantly expressed, but I am sure this can be more simply explained as "Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data."{{Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 0471228524. }} One of the crucial aspects to this exercise is to entirely automate a process which in the past may have been carried out by analysts, empowered by tools. Now the tools are employed with an intelligence engine to learn about datasets which are too large for human appreciation. When they work, they can "They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations." {{}} Data Mining is the process of running data through sophisticated algorithms to uncover meaningful patterns and correlations that may otherwise be hidden.
Not that we are trying to prejudice initial exploration, but JSR 73 and JSR 247 specifications relate directly to data mining.
Standard data mining operations include:
  • data preprocessing
  • clustering
  • classification
  • regression
  • visualization
  • feature selection
How Data Mining Works
  • Choosing a Model - Analysts can work with a range of models graphically. These include many advanced forms of data mining such as clustering, segmentation, decision trees, random forests, neural nets, and principal component analysis.
  • Adding Data - Value-added features can be added to the data. For example, you can specify thresholds and have the system automatically “bucket” or derive data to create new columns for analysis.
  • Adapting - Each model works to adapt its parameters to attempt a best fit to the sample data. Analysts can let this happen automatically, or manually adjust parameters (depending on the model).
  • Exploratory Data Analysis
  • - I think there is an additional step here where cursory examination presents general structural information about the subject dataset.
  • Evaluating - Results can be evaluated by applying the model to historical data to test its predictive power compared to actual results.
  • Perfecting - The cycle of adapting the model until it is optimized is known as “training” the model. Once properly trained, the model will reliably yield the best results for the specific business purpose it is being applied to.
  • Delivering - Output can be in a multitude of forms. For example, you might choose to include a simple statement within another application, or output a graphical decision tree that users can navigate

Further Exploration

It appears that Weka is the Open Source solution of choice for data mining. It is the foundation for both the Open Source RapidMiner (I think we should give this one a try) and also for the commercial Pentaho. There are a variety of plug-ins for both available that allow it to be tailored to the specific context of the dataset being mined. I have identified series of additional frameworks or plug-ins that appear to be useful for our experiment with the Globalisation dataset.

Best condensed info on WEKA is via an article on Wikipedia.

Features

multi-relational data mininggraphical UI
It appears hat one of the bigger challenges in dealing with datasets using early algorithms is the ability to walk relations in the dataset.Obvious factor that differentiates flavours or even versions of data mining solutions is whether they provide a graphical UI. Why might this be important? In the case of Pentaho they demonstrate the value of their UI in being able to handle data preprocessing via an interface that displays patterns visually allowing for the user to adapt an existing model to their datastore.

Inputs

The datamining api is looking for:

  1. a description of the data structure - parameter/attributes
  2. a dataset comprising structured data - in the weka case, essentially csv
  3. a model which indicates how parameters are related
  4. a single or series of statistical test(s) to run against the data

Outputs

The dataminer will return:

  1. A table summarizing the statistical results for the dataset

Scan List

Testing Plan

Our intention is to take the dataset result of crawling the Globalisation site and run a data mining routine against it to see what happens. I think it may be germane to start to pose some questions in advance of the analysis to frame the model applied to the dataset. I believe that one of the initial ones was looking at the nature of contributions for humanists versus those from social scientists. Differentiating contributions using these parameters would be done during the data pre-processing stage.

  1. Need a plan - what do we expect the outcomes of the datamine to be?
  2. In this case, let's make something up - ??
  3. Need to construct an ARFF file which will be a translation of the ARC file generated by Heritrix
  4. The ARFF file has the following components:
    1. Commenting via the '% ' character
    2. Internal name for the dataset: '@relation '
    3. Nominal Attributes:
    4. Numeric Attributes:
    5. Target:
    6. Dataset proper: '@data'
    7. The dataset as a csv file one line per case

Test Results

Weka is widely used by many open source and commercial applications as the engine for conducting datamining for specialised purposes. Weka provides a richy extensible API that requires specialised development for a particular purpose. There are a variety of such specialised implementations available. It is important to note that Weka must know the structure of the data being sent to it for analysis. I cannot accept unstructured data and discover patterns in it. Moreover, it does not carry out untargeted discovery. You must have a sense of what you want to explore for and describe this using either Weka's graphical process builder, or via the java API. For example, if you were seeking to conduct Bayesian network analysis on a dataset, a series of purpose built algorithims are availabe from the University of Waikato. Similarly, to provide multi-processor support for Weka, an implementation of Grid Weka provides such functionality.

Sources

Open Directory Analytic Bridge - Social Network for Analytic Professionals Command Line tutorial Explorer

%FOOTNOTELIST%

-- ShawnDay - 30 Mar 2008

  • Data_Mining_Business_Cases.gif:
    Data_Mining_Business_Cases.gif

TestThing


Use this box to quickly add a comment to the page.

more options...