Environmental Scan: Data MiningTo clarify:
Data mining is a process of "nontrivial extraction of implicit, previously unknown, and potentially useful information from data"{{W. Frawley and G. Piatetsky-Shapiro and C. Matheus (Fall 1992). "Knowledge Discovery in Databases: An Overview". AI Magazine: pp. 213-228.}} Brilliantly expressed, but I am sure this can be more simply explained as "Data mining is the entire process of applying computer-based methodology, including new techniques for knowledge discovery, from data."{{Kantardzic, Mehmed (2003). Data Mining: Concepts, Models, Methods, and Algorithms. John Wiley & Sons. ISBN 0471228524. }} One of the crucial aspects to this exercise is to entirely automate a process which in the past may have been carried out by analysts, empowered by tools. Now the tools are employed with an intelligence engine to learn about datasets which are too large for human appreciation. When they work, they can "They scour databases for hidden patterns, finding predictive information that experts may miss because it lies outside their expectations." {{}} Data Mining is the process of running data through sophisticated algorithms to uncover meaningful patterns and correlations that may otherwise be hidden. Not that we are trying to prejudice initial exploration, but JSR 73 and JSR 247 specifications relate directly to data mining. Standard data mining operations include:
Further Exploration
It appears that Weka is the Open Source solution of choice for data mining. It is the foundation for both the Open Source RapidMiner (I think we should give this one a try) and also for the commercial Pentaho.
There are a variety of plug-ins for both available that allow it to be tailored to the specific context of the dataset being mined. I have identified series of additional frameworks or plug-ins that appear to be useful for our experiment with the Globalisation dataset.
Best condensed info on WEKA is via an article on Wikipedia.
Features
InputsThe datamining api is looking for:
OutputsThe dataminer will return:
Scan List
The following projects have been identified and investigated (comments attached):
Testing PlanOur intention is to take the dataset result of crawling the Globalisation site and run a data mining routine against it to see what happens. I think it may be germane to start to pose some questions in advance of the analysis to frame the model applied to the dataset. I believe that one of the initial ones was looking at the nature of contributions for humanists versus those from social scientists. Differentiating contributions using these parameters would be done during the data pre-processing stage.
Test Results
Weka is widely used by many open source and commercial applications as the engine for conducting datamining for specialised purposes. Weka provides a richy extensible API that requires specialised development for a particular purpose. There are a variety of such specialised implementations available.
It is important to note that Weka must know the structure of the data being sent to it for analysis. I cannot accept unstructured data and discover patterns in it. Moreover, it does not carry out untargeted discovery. You must have a sense of what you want to explore for and describe this using either Weka's graphical process builder, or via the java API. For example, if you were seeking to conduct Bayesian network analysis on a dataset, a series of purpose built algorithims are availabe from the University of Waikato. Similarly, to provide multi-processor support for Weka, an implementation of Grid Weka provides such functionality.
SourcesOpen Directory Analytic Bridge - Social Network for Analytic Professionals Command Line tutorial Explorer %FOOTNOTELIST% -- ShawnDay - 30 Mar 2008
| |||||