Tech of the Town: Understanding Metadata
If you’re anything like me, which is a non-data person, you know you should understand data but somehow you do not. I think I understand the concept of metadata, and the idea of a data warehouse makes some sense to me – like one colossal filing cabinet that happens to be a computer – but things like data lakes or data cleansing just sound like Charlie Brown’s teacher in Peanuts: “wah, wah, waoah”. But since data is one of the fundamental parts of our modern business world, today’s Tech of the Town is about getting to know it a little better.
What is Metadata? Is That Data About Philosophy?
Imagine that you’re a librarian. For reasons unknown, you don’t use computers or the Internet; maybe you’re conducting your operations in the middle of Antarctica, or the jungles of the Amazon, or someplace else where you can’t get readily available wireless coverage. Imagine further that your library has recently been through a tremendous earthquake and, though the building is okay, the quake has shaken all the books off the shelves and somehow destroyed all the labels with the Dewey decimal numbers. To categorize your library, you have to start from scratch.
In this analogy, each book is like an email or a text message, a Salesforce profile or a Paycom account. The contents of the book constitute the data. It’s impossible for you to read every single book and use what you’ve learned to create a filing system. You’d go insane. But the cover and title page will tell you the title, author’s name, the publisher, the year it was published, and maybe even the genre or the number of pages. Without reading each individual book, you can use the metadata to rank and order the books in your library, turning them into a collection instead of a random jumble.
This metadata – data about data – forms a massive amount of the data we generate every day. When I upload a photo to Facebook, the website records the picture’s dimensions, how big the file was, what kind of camera it was captured on, a vague idea of what’s in the image, and of course the date, time, and location it was taken. The website doesn’t need to actually see the image in order to know a great deal about it. Entire industries have been built around this business model, even down to website analytics for humble bloggers. If my metadata scrubbers tell me that the most common viewers of this post are Nepalese virtualization engineers and the NSA, you can be absolutely certain that I’ll be posting VMware jokes in Nepali and avoiding references to metadata’s more terrifying applications.
Data Lakes and Data Warehouses
So, let’s return to the jumble of books on your floor. This is obviously an unsatisfactory data storage solution; you can’t find anything. You decide to come up with a classification system, with sections and shelves, and then use the metadata (and the data) to load the books back onto the appropriate shelves. This is basically what a data warehouse looks like. You’re performing the extract, transform, load process whereby data are introduced into the warehouse: extracting the books from the floor, labeling them in a way that’ll be useful for your classification system (the transformation), and literally and figuratively loading them onto the shelves. Everything is orderly, preassigned, and neat.
If you’re a busy library-goer, this is great! You don’t have to be an expert in library science to use the library. You can just sprint to the appropriate section, grab the book or books that catches your fancy, and sprint back out again. (Side note: who is in a rush to get in and out of the library??) Data warehouses are designed for just such busy people who need information that behaves properly, such as business analysts or decision-makers. They’re not concerned with how the data was collected – they just want to query or analyze it for insights about their sales, consumers, or what have you.
But sometimes it’s unnecessary or impossible to do the tedious work of transforming the data and loading it into your library/warehouse. Suppose those books were mixed up with a bunch of magazines, newspapers, printed-out blog posts, CDs, DVDs, Post-Its and crossword puzzles. If you decide to extract, transform, and load every single item using the same criteria – name, author, date published, etc. – you’ll end up with a mess! Your warehouse will be full of mislabeled or incomplete files, your librarians will tear their hair out, and Mr. I’m-Too-Busy-to-Spend-Time-in-a-Library will have a fit of apoplexy and die on your floor.
Obviously, with a sufficiently good and sophisticated ETL process and data warehouse, you can get all of this information “loaded” into your system with the proper labels and categorization. This is also where data cleansing comes in, which is when someone goes through your messily transformed data line-by-line and rectifies the errors as best they can. But another option is to leave all the junk on your floor and let the librarygoers sort through only the parts of it that they need. This is a data lake: a central repository for all data in your organization, from whitepapers to sensor data to cat GIFs. It’s harder for just anyone to look through, but for people who want to run their own analyses instead of analyzing only the pre-transformed data in a warehouse, it can provide a fuller and richer picture of the data you’ve collected. (Endearingly, a disused and un-analyzable data lake is known as a data swamp.)
I hope this article has provided you data on what metadata, data warehouses, and data lakes look like in real life. Next time: How does Big Data differ from traditional databases? (Spoiler alert: it’s big.)
By Andy Tisdel