The Data Genie is out of the bottle
Big Data is seen as one of the most important enabling technologies for the IoT. So this time, my post is all about Big (and not-so-big) Data management for M2M and IoT applications. Please welcome one of the most respected figureheads of the Big Data movement: Mike Olson, Chief Strategy Officer and Founder of Cloudera, who I recently met for an interview.
DIRK SLAMA Mike, Cloudera is the company behind Apache Hadoop, one of the most widely used Big Data frameworks. Can you talk about some of the most interesting IoT use cases that you’ve seen using your technology?
MIKE OLSON Sure. You mentioned the Large Hadron Collider (LHC) at CERN as one of the case studies in your book, and how the LHC is not only capturing enormous amounts of sensor data, but is also building up an efficient, multitiered processing pipeline for the data from the experiments. As it turns out, at least one of the research facilities in tier 2 of this processing pipeline is using Hadoop: the University of Nebraska-Lincoln. They are using it to perform advanced physics analytics on the massive amounts of data generated by the particle collisions at CERN.
There are many other interesting cases. Take energy, for example; the smart grid is obviously an important IoT application using Big Data technology, and we know that Hadoop is being used by a number of those companies. A good example is Opower, which sells its services to utilities that want to capture, analyze, and use the smart grid observations streaming from their smart meters. Smart meters are now generating an observation every six seconds, which is vastly larger than the one reading per month that they used to collect. This means that they can establish a very fine-grained signal of demand over the course of a day. They can even determine when significant events happen, such as turning your washing machine or your refrigerator on. As such, they can observe demand in real time, and then use those observations to predict demand fluctuations, not just on the basis of smart grid activity, but also based on weather reports, forecasts, and significant upcoming events and celebrations. They can even use gamification to manage demand – for example, by letting customers know what their usage is as compared to their neighbor’s usage and encouraging people to compete a little to preserve energy.
There’s a pretty broad range of use cases, but generally, all of this data is being generated by sensors and machines at machine scale. Collecting and analyzing it using older technologies is a challenge. Big, scale-out infrastructure makes the processing and analysis of this data much cheaper.
So what challenges does the IoT present that haven’t been faced by previous generations of computing and data management?
My own view is that we are only seeing the very early days of IoT data flows, and already those data flows are almost overwhelming. Take the amount of information streaming from the smart grid, from taking readings once a month to 10 times a minute; that’s 150,000 times more observations we’re getting per meter per month. Those data volumes are guaranteed to accelerate. In the future, we will be collecting more data at finer grain, and from a lot more devices.
Look at the City of San Francisco: some estimate that we already have 2 billion sensors in the city. Not only in smartphones and cars, but in many other places, such as the city’s many high-rise buildings, where they measure air pressure, temperature, vibration, etc. The most interesting thing about those sensors right now is that most of them are not connected to a network. I predict that most of them will be on a network within the next half decade; that those devices will be swapped out with network- and mesh-connected sensors. And that will bring about an absolute tsunami of data! So designing systems that can capture, process, summarize, manage, and then analyze this data is the big challenge for IT. We have never seen a flood of information like the one we are about to see.
So, are there any key advances in data management that are making the IoT a reality?
We are building scale-out storage and compute platforms today that we didn’t have a decade ago. We didn’t need them then, because we weren’t collecting information on this scale, and we weren’t trying to analyze it in the way we are today. The emergence of machine-generated data has forced us to rethink how we capture, store, and process data ;it’s now completely commonplace to build very large-scale, highly parallel compute farms. So that transformation has already happened. If we look to the next 5 or 10 years of advances, the state of the software should continue to improve; we will have more and better analytic algorithms; we’ll have cheaper scale-out storage architectures; we will be able to manage with less disk space, because we’ll be smarter about how we encode and replicate data.
But the thing I think is going to be most interesting are advances in hardware. The proliferation of network-connected sensors in mobile devices and in the environment in general is going to continue or may even explode. That will produce a lot of new data. Think about the Intel Atom Chip-Line and its equivalent from all other vendors. On the data capture/storage/analysis side, we will see chips that are better suited to this scale-out infrastructure. Memory densities will increase of course, and we will see solid-state drives replace disks in many applications. Networking interfaces at the chip level will become more ubiquitous and much faster. We will see optical instead of just electrical networks available widely at chip level. The relative latencies of storage – that is, disk to memory – will shift; solid-state disk to RAM is going to be a much more common path in the future. The speed of optical networks will make remote storage much more accessible. So I think we will see a lot of innovation at the hardware level that will enable software to do way more with the data than was ever possible before.
What are the biggest risks for companies that want to build IoT solutions that leverage Big Data? How can they mitigate these risks?
Mike Olson: The technologies for generating this type of data – the scale-out proliferation of sensor networks – as well as the infrastructure to capture, process, and analyze this data, are new. Our experience has been that it is a very smart idea to start with a small-scale proof of concept. Instead of a million devices, maybe start with a thousand devices. And then build a data capture and processing infrastructure that can handle that. This should allow you to check that it works, and to educate your people and your organization about how these systems function, and what they are capable of. These are new technologies, and adoption of new technologies requires learning and new processes for successful deployment. It’s important to learn those at small scale before you go for infinite-scale IoT.
We talk to a lot of people who are fascinated by IoT technology. They are excited about Big Data for its own sake. Those are bad people for us to work with because they are not fundamentally driven by a business problem. It’s important when you start thinking about the IoT to think about why it matters. What questions do you want to answer with the sensor data streaming in? What are the business problems you want to solve? What are the optimizations you want to make? And then design your systems to address these problems. The “shiny object syndrome” of engineers who want to play with new technology – I totally get that, I am one of those guys, but those projects generally fail because they don’t have clear success criteria.
Are there clear indicators that tell you when to recommend using Big Data technology, and when not?
If you have a traditional transaction processing or OLAP workload, we will point you toward Oracle, SQL Server, Terra Data, etc., because those systems evolved to work on those kinds of problems. New data volumes and new analytical workloads are where Big Data technology works best. When the goal is, “we want to rip out our existing infrastructure and replace it with Hadoop,” we generally walk away from those opportunities; they don’t go well. If you have new problems or business drivers, and new data volumes, those are the cases where we are most successful. A wholesale, blind desire to rip and replace never works.
Can you quantify these indicators? How big does data really have to be to require Big Data technology?
When the industry talks about Big Data, they always talk about volume, variety, velocity; meaning that you can have a very large amount of data, you can have wildly different types of data that you have never been able to bring together before, or you can have data that is arriving in at a furious pace. There is one other criterion that we see which doesn’t fit under the Vs, but which is the analytic algorithm you should really be using, namely: What is the computational approach you want to take toward the data? Sometimes, if you need to do a lot of computation – such as for machine learning – based on modest amounts of data, a scale-out infrastructure like Hadoop makes sense. Satisfying any one of the volume, variety, velocity, or computational complexity requirements is enough to make a Big Data infrastructure attractive. In our experience, if you have two or more of those requirements, then the new platform is critical.
What advice would you give to readers who are developing their strategy and implementation for Big Data and the IoT?
My consistent advice to organizations we work with is to take a use case or two – ones that really matter, where a successful result would be meaningful for the business – and then attack those on a modest scale. This will allow you to learn the necessary skills, and will demonstrate that the technology actually solves the problem. Once you have done that, scaling big – given the infrastructure – can be done for a simple linear cost, and works great.
Who stands to gain the most from the IoT and Big Data? And who stands to lose the most?
My heartfelt conviction is that data will transform virtually every field of human endeavor. So, in your lifetime and mine, we will find cures for cancer, because we will be able to analyze genetic and environmental data in ways that we never could before. In the production and distribution of clean water; the production and distribution of energy; in agriculture – where we will be able to grow better, denser crops that feed 9 billion instead of 7 billion people; in every endeavor, I believe that data is going to drive efficiency. If organizations fail to embrace that opportunity, they risk losing out to the competition.
This text was excerpted from the book Enterprise IoT by Dirk Slama, Frank Puhlmann, Jim Morrish, and Rishi M Bhatnagar (O’Reilly, 2015).
There has been a lot of discussion recently about privacy and Edward Snowden and the NSA, and one concern I have is that we will have an unfair backlash because of those examples. But think about the advantages if we could, for example, monitor student behavior at a very fine grain, and design courses that cater expressly to the learning modalities of individual students. We will have smarter people learning things faster and better than ever before. Think about the quality of the healthcare we could deliver.
Privacy does and will matter; we need sensible policy and meaningful laws and penalties to enforce reasonable guarantees of privacy. The “Data Genie” is kind of out of the bottle; I don’t think we will be able to stuff it back in. Most of all, I think it would be a mistake for us to try to curtail the production and collection of data, because I believe that it is such an opportunity for good in society.