How to avoid pitfalls with data analytics projects
A recent Capgemini study found that 15% of big data initiatives in Europe fail. To ensure your project belongs to the 85% that are successful, I’ve summarized the four major pitfalls to watch out for. (This blog post contains the first two pitfalls, the other two will be published in a different blog post.)
Being aware of these and taking them into consideration will significantly increase the chances of your data analytics project being a success. Don’t worry: you are by no means the only one facing these challenges and pitfalls. In our initial data analytics workshop, we regularly see participants who are encountering them, right through to the end of the project. Here I’d like to share my insights with you from many successful workshops and projects, point out the major pitfalls, and illustrate them with example use cases.
1. The initiator – IT vs. department
Data analytics and big data are not one and the same – even if they are often used interchangeably.
IT departments often view projects through “big data glasses”. They provide the infrastructure for collecting large amounts of data; for example, in the form of database clusters. These databases store huge volumes of data, which in itself does not create added value for the company. That’s why the data analytics project should always have a clearly defined technological as well as commercial goal. Collecting data just for the sake of it does not bring the company any benefits at all.
Added value only arises when the company leverages the data and the resulting insights. This is where its (non-administrative) departments come in. They define what goals they want to achieve with data analytics – not with big data. They provide the technical understanding that allows data scientists to work with the data in a targeted way. Close cooperation between the ideas provider (department) and the data scientists is therefore an absolute must in order to achieve the defined project goal.
In other words: the success or failure of a data analytics project depends on what and how much technical process understanding is passed on to the data scientists. Data analytics engineers also play an important role here. They support the “translation” and knowledge transfer between the different disciplines. Data analytics engineers draw on their operational experience in manufacturing or logistics and a sound basic understanding of data analytics approaches. The data experts must not only understand the project goal, but also and in particular the correlations in the data. More importantly, they must see its relationship to the real world (machines, sensors, etc.) and the related process steps.
As the Capgemini study shows, IT departments are often the initiators of data analytics projects. This is not in itself a problem, as long as the other departments are closely involved and define the technical objectives of the project.
a) Quality of the data
Here it is important to consider what format the data is available in, where to look for what data, and whether the data is transparent across different sources.
To integrate a data set from several sources, you need a unique identifier that allows the data to be collated correctly. This may be a time stamp or a part number, for instance. Using a time stamp makes integration more complicated if different date/time formats are used in the individual data sources (German vs. US date format, time in UTS, etc.); however, it is still possible. By contrast, it is virtually impossible if different time bases are used. This is the case where there is no uniform time synchronization that generates the time stamps for all data sources.
b) Quantity of the data
The more, the better – so the saying goes. But with respect to data analytics, this is only partly true. Generally speaking, of course, the more data you have, the better. However here, too, there are a number of key aspects to consider.
Depending on the technical goal definition, it may, for example, be important for the underlying data to contain not only positive outcomes, but also a sufficient number of negative outcomes.
Example: predicting a negative outcome
If the goal of the project is to develop a model for predicting a negative outcome, the training data set used to train the prediction model must contain a sufficient number of negative outcomes. Otherwise, the model is not able to learn these negative outcomes and will therefore be incapable of predicting them – consequently, you can’t achieve the project goal with this data set! For this reason, when compiling the training data set, you should make sure it contains a sufficient quantity of the parameter to be predicted (target variable) – in the above example, negative outcomes. One way of achieving this is to expand the time period from which the data is being gathered.
c) The “right” data
So it’s clear that the quantity of data isn’t the only criterion. Above all, you need the right data!
What do we mean by the “right data”?
The data must contain the relevant information required to achieve the technical project goal. If, for example, you want to develop a model for predicting the product quality as defined by a surface roughness measurement, this variable must be represented in the data set. If you carry out the measurement without subsequently storing the measured value, you won’t be able to develop a corresponding model. This, too, is not an unsolvable problem, but it can delay progress because an adequate data basis first has to be generated (e.g. with the help of additional sensor technology, saving the relevant data, etc.).
Who will make sure that your data analytics project will succeed?
To help experts achieve a), b), and c), we have taken the experience we have gained in many successful projects and pooled it in data quality guidelines, which we provide at the start of a project. We also deal with this topic in the initial workshops by identifying those use cases that will deliver quick wins. In this way, we raise manufacturing experts’ awareness of these topics, which always proves to be a clear advantage for the next steps in the process.