Google has done ‘it’ for more than 10 years; amazon has adapted it, and we will start from now. I am talking about data analysis, with data mining mechanisms. My first ‘fault’ is to name data mining only, but calling ‘it’ knowledge discovery in database (KDD), would be more correct, I guess. Data mining only describes the analysis process. But there is a more comprehensive process to do before data mining activities can be started. Before I describe the KDD-process, I would also like to mention, that different types of data mining categories exist, which represent a different algorithm each. So we have to watch out when discussing data mining and KDD, and not to loose sight of the wood for the trees. The picture below illustrates the complete KDD process.

Knowledge discovery in database (KDD) Process

Knowledge discovery in database (KDD) Process

Starting point are large quantities of data. First, we must select the attributes, which we will use for our analysis. This step is mandatory for all mining processes. It differs in the way how attributes are selected. I will discuss this in a further post. After data selection, a preprocessing step transforms the data into suitable models for analysis. This step is done with the help of an ETL (Extract Transform Load) tool, usually. Both steps help to reduce and to structure the vast amount of data.

After that, data mining itself can start. Data mining is the process of detecting new correlations, patterns, and rules. This is done by sifting through the data stored as preprocessed data, using suitable algorithms. The illustration below shows the different principles of data mining.

Directed and undirected data mining

Directed and undirected data mining

There are two main principles in data mining: directed and undirected data mining. Each principle produces a different result type. The directed method is based on machine learning. Based on sample data, which represents the desired scenario e.g. spark plug falls out, the system will learn the classification patterns. Those patterns are used to select the desired data  to forecast expected events. The undirected data mining delivers associations of attributes of multiple dimensions (e.g. temperature, speed, pace, acceleration, …). Statistical methods help to bring the dependencies between the dimensions to light.

Last step is to evaluate the result. Depending on the analytical method this is usually done by integrated visualization or statistic tools.

These four steps are integrated in RapidMiner, which we use for our data mining prototype to find out the exciting facts and figures in the data. RapidMiner is an open source tool, which was released in 2001 for the first time as YALE. It offers flexible and extensive support and service options. I know RapidMiner and its predecessor YALE since 2006, personally. The biggest strength in my opinion is the complete support of the whole KKD process. Many data mining algorithms are incorporated in RapidMiner. A thing which has to be improved yet, is the connectivity to NoSQL databases.

Concluding for now, I would really like to recommend a book from Silvia Fleischmann, although it is only available in German language: Silvia Fleischmann: Assoziationsanalysen im Rahmen des Data Mining

In my opinion, data mining has not (yet) the status it is entitled to have, especially when it comes to large scale data. Which is your experience?

About The Author

Alexander Rieger

In 1997 I was among the six computer scientists co-founding Bosch Software Innovations. Right from the beginning, I was responsible for the development of the supply chain management system of Europe´s 3rd largest retailer, dealing with billions of data sets. I was also responsible for the initial development of Singapore´s eMobility charging infrastructure platform. I have many years of experience in software engineering and architecture, especially when it comes to artificial intelligence and data management issues. I was engaged with programming genetic algorithms and neuronal networks, ever since I started studying. After graduating I was involved in the development of a data mining application for one of the leading integrated financial services providers worldwide. Currently I am developing an enterprise-wide concept for NoSQL data storage topped by a KDD process (Knowledge Discovery in Database).

Leave a Reply

Your email address will not be published.