Large scale data analysis and predictive modeling in data mining
Google has done ‘it’ for more than 10 years; amazon has adapted it, and we will start from now. I am talking about data analysis, with data mining mechanisms. My first ‘fault’ is to name data mining only, but calling ‘it’ knowledge discovery in database (KDD), would be more correct, I guess. Data mining only describes the analysis process. But there is a more comprehensive process to do before data mining activities can be started. Before I describe the KDD-process, I would also like to mention, that different types of data mining categories exist, which represent a different algorithm each. So we have to watch out when discussing data mining and KDD, and not to loose sight of the wood for the trees. The picture below illustrates the complete KDD process.
Starting point are large quantities of data. First, we must select the attributes, which we will use for our analysis. This step is mandatory for all mining processes. It differs in the way how attributes are selected. I will discuss this in a further post. After data selection, a preprocessing step transforms the data into suitable models for analysis. This step is done with the help of an ETL (Extract Transform Load) tool, usually. Both steps help to reduce and to structure the vast amount of data.
After that, data mining itself can start. Data mining is the process of detecting new correlations, patterns, and rules. This is done by sifting through the data stored as preprocessed data, using suitable algorithms. The illustration below shows the different principles of data mining.
There are two main principles in data mining: directed and undirected data mining. Each principle produces a different result type. The directed method is based on machine learning. Based on sample data, which represents the desired scenario e.g. spark plug falls out, the system will learn the classification patterns. Those patterns are used to select the desired data to forecast expected events. The undirected data mining delivers associations of attributes of multiple dimensions (e.g. temperature, speed, pace, acceleration, …). Statistical methods help to bring the dependencies between the dimensions to light.
Last step is to evaluate the result. Depending on the analytical method this is usually done by integrated visualization or statistic tools.
These four steps are integrated in RapidMiner, which we use for our data mining prototype to find out the exciting facts and figures in the data. RapidMiner is an open source tool, which was released in 2001 for the first time as YALE. It offers flexible and extensive support and service options. I know RapidMiner and its predecessor YALE since 2006, personally. The biggest strength in my opinion is the complete support of the whole KKD process. Many data mining algorithms are incorporated in RapidMiner. A thing which has to be improved yet, is the connectivity to NoSQL databases.
In my opinion, data mining has not (yet) the status it is entitled to have, especially when it comes to large scale data. Which is your experience?