Tech of the Town: Big Data From A to Zettabyte
“Space is big. Really big. You just won’t believe how vastly, hugely, mind-bogglingly big it is. I mean, you may think it’s a long way down the road to the chemist, but that’s just peanuts to space.” – Douglas Adams
The human mind isn’t really built to hold the answer to gigantic, imponderable, impossible questions. Say we were to ask “how many stars are there in the observable universe?” Sure, we can estimate the answer and come up with a number – one astronomer guesses about 1,000,000,000,000,000,000,000,000 – but what does that number mean? When I look at it, my eyes glaze over after about the third set of zeroes. That number represents more stars than all the grains of sand on all of the beaches on Earth. It’s fundamentally too big for me to really make any sense of.
Overwhelmed yet? Sit down while you can, because according to IBM, the human race created 2.5 billion gigabytes of data every day in 2012. Nobody actually knows how much data currently exists on Earth, but one estimate is between five and fifty zettabytes. (That’s 1,000,000,000,000,000,000,000 bytes, if you’re scoring at home.) Not quite as many bytes as stars in the sky, but since each byte is 1,024 bits, there really are more bits on Earth than stars in the universe.
Excuse Me, You Broke My Data Model
The traditional tools of statistics and analytics don’t really work when they confront datasets with millions or billions of observations. In my statistics classes, I might analyze a few hundred or a few thousand observations to try and find a causal relationship between variables. Let’s say I run my tests and find that the gas mileage of a car has a statistically significant effect on its price, but the length of the car does not. By seeing which variables are significant and which aren’t, I gradually build a model that predicts the price of cars reasonably well.
But for big data, the datasets are so enormous that for traditional analytics tools, everything is significant. You might have a variable for whether or not the carburetor was made in Alabama, and because there are so darn many observations, you’ll find that there is a statistically significant difference between cars with Alabamian carburetors and cars without them. But just because it’s statistically significant doesn’t mean there’s any practical application or value for that information. I can’t really insert “Alabamian-carburetor cars sell for $25 more” into a brochure. And if every variable you could possibly add looks like it matters, how are you supposed to figure out which ones are truly significant and useful to you?
The answer turns out to be using correlation instead of causation. It’s hard to figure out whether A causes B. It’s easier to prove that A and B are correlated, or that they often appear together. That’s obviously a weaker conclusion, but you don’t have to run just one. If some computer looks at my data and determines that I searched for baby powder, maybe it thinks I should get ads for other baby-related products. But if it also determines that I searched for diapers, pacifiers, cribs, Lamaze breathing classes, and “Oh my God I’m about to become a parent what should I do”, it can be pretty sure that I’m in the market for baby supplies.
Imagine that kind of targeted, predictive analytical capability repeated trillions of times over. Different searches, different targets, different predictions. Not all of them will be right; maybe I’m just having some kind of weird fantasy and am not really about to be a parent. But the more of these correlations you have for your customers, and the more you apply them as part of your business, the more powerful your ability to target your sales will be.
Uh, My Hard Drive Can’t Store a Zettabyte
Storage actually turns out not to be a problem. It’s so cheap and easy to store data these days, whether in a traditional data center or distributed in the cloud, that storage is effectively not a barrier to companies with any kind of ready cash. The problem has more to do with how those data can be stored, accessed, and analyzed.
Relational databases are what most people think of when they think of data. Imagine a giant Excel spreadsheet. Each line in the table is a piece of data with a unique identifier and a bunch of information about it; in other words, that data is structured. Each piece of data in your database is defined by how it links to the rest of your data. It’s neat, tidy, and easy to search.
Unfortunately, relational databases have trouble keeping up with big data. If you have thousands of industrial sensors feeding you ridiculous quantities of data every minute, traditional relational databases will struggle to store it neatly in a relational format. And if you want to correlate it with some other form of data that can’t be easily expressed in a table, like the contents of phone calls or pictures, your relational database is likely to be overwhelmed. To combat this, NoSQL databases – which operate based on a non-relational model – had to be designed. This is also where data lakes come into play, which we talked about last time; Hadoop’s file system is designed as a data lake, able to store many different kinds of data and retrieve them on command.
The Stars Are Not the Limit
We have practically unlimited storage capacity. We’ve invented new ways to store, organize, and query data that allows us to make use of bigger and more diverse datasets than before. And we have the power of repeated correlations, which allows us to be more granular and more predictive with our analytics than businesses have ever been able to be before. As much information as we have, however, we’d still be struggling to make proper use of it, were it not for the power of machine learning. We’ll tackle that topic next time on Tech of the Town.
By Andy Tisdel