Modern analytics, big data and the IoT
The Internet of Things (IoT) and big data are closely related, however, one could even say that the IoT is big data driven to another extreme. Most big data problems circle around data arising from social or customer data but IoT makes it even more complex: it’s all about creating new business models or improving existing services by intelligently integrating devices – and of those we have much, much more. Where most big data problems focus on distributed storage and analysis, IoT applications also need to take things like distributed filtering or aggregation into account before storage is even considered an option.
Either way, it is not terribly interesting to join the usual discussion about what “big” really means in this context. Ultimately everybody’s interpretation contains a grain of truth: it is essentially about huge and/or highly heterogeneous and/or semi or unstructured data. However the real issue is: what do we use this data for?
To answer this question we need to take a look at the different stages of innovation or discovery. Initially all we do is to try and poke holes in the darkness. We look for connections of some sort such as correlations or clusters, basically anything that can help us gain insight, no matter how small, into the underlying system; sometimes that’s enough to already create business value. Secondly, later-stage data analysis goes on to create more abstract descriptions by focusing on parts of the system. This is classical data analysis, which utilizes more advanced data analysis methods. In the third and final stage we obtain a complete picture of the underlying system and our goal is to match the model’s parameters. This last stage is no longer really part of data analysis as we already fully understand how the system works. Typical advanced analytics focuses mainly on the second phase whereas exploratory data analysis concentrates more on the first phase.
Creating value by discovering insights
Current IoT applications drive us back towards this first phase of that process. We, again, don’t really know much, if anything, about the underlying system and are looking for interesting connections in the underlying data to gain initial insights. Establishing these connections can create more value than before as they are now derived from much bigger/more complex data sources. This sometimes leads to the claim that in the age of big data all we need are such correlations. True, correlations based on much more and more diverse data are likely, albeit not always, more meaningful (beware of spurious correlations!). But that’s only part of the story – we should not lose sight of the big picture: finding a model that describes as much as possible of the underlying system is the goal of data science. Big or small.
Tools to create these initial insights from big data are all the hype right now. However this also raises an interesting question: if we know how little we know about the big data world, how can we trust any one monolithic, proprietary platform provider to know what will keep us innovating and discovering new insights now and in the future?
Open environments open the doors
The need for open platforms in classic data analytics is therefore even more pressing now, at the dawn of the era of the IoT, when data analysts have easy access to an ever-growing number of internal and external data sources. To tackle this challenge they need quick and easy access to best-of-breed tools to intuitively explore new analysis ideas unburdened by the artificial barriers of closed environments.
Therefore here too, the five key pillars of an open analytics platform are vital for success:
- Open platforms are integrative. They play with existing systems (but don’t have to). They support various data sources, both large and small. They integrate new, existing, and legacy tools – inhouse or from external specialists. This is instrumental to securing the best of both worlds: emerging big data platforms and advanced analytic tools.
- Open platforms are transparent. They are intuitive to use – which enables quick prototyping and they can be used in production and as templates for business analysts. They are, of course, open source, so anyone can build on top of them. It is interesting to note that many of the big data storage and processing platforms are open source already!
- Open platforms are flexible and agile. They allow reproducibility and reusability of processes and enable users to quickly and effortlessly explore alternative processes at the same time. They also future proof your big data investments – there is no need to wait for a proprietary vendor to add desired functionalities: the community or one of the many platform partners will have already done that.
- Open platforms are collaborative at many levels. Users of, but also developers for, open platforms know that they can boost the impact of their work by sharing their latest tools and learnings ‒ vs. keeping it all to themselves.
- As a result, open platforms are much more powerful than any monolithic application can ever be. Due to the simplicity of mixing and matching best-of-breed technology within the same intuitive environment, breakthrough discoveries and innovation can come from anyone and anywhere.
And here is another thought: history keeps repeating itself. Expert Systems were supposed to be the solution to knowledge capturing – but ended up “only” being an important piece of a larger puzzle. Data warehouses were supposed to solve the need for, I guess all of ETL, once and for all; they ended up being a solution for fairly static data structures but never really captured all of the data. And now we believe that by pumping all of our data into a large, distributed data storage environment we will solve all of our “data problems”?
I much rather foresee an interesting mix of unstructured, messy, heterogeneous, distributed data storage facilities playing in concert with more organized, much better structured data repositories. Do we already know what this mix will look like? Does any of the analytic tool vendors know? Do we really want to repeat that mistake of locking ourselves and/or our data in with one single vendor and trust that vendor will know what we will need in a year or two from now?
My personal bet is on an open platform that allows selection of the best resources (data, tools, or expertise), unconstrained by a proprietary toolbox. Now and in the exciting years to come.