What is a data catalog?
Here’s a short story to help you understand the definition and value of a data catalog.
Two data scientists walk into a library at the end of a long day....
Data scientist #1: “Can I get a copy of this book on
Data scientist #2: “That book is super obscure. They’ll never be able to find it.”
The librarian (clacks away on the keyboard for a couple of seconds before replying): “Found it! Here are the details of its author, publishing house and borrowing history. Oh, and someone left a comment saying they found it super useful for understanding logistic regressions. I can grab it for you in a jiffy.”
Data scientist #1: “Ummmm… why can’t the same thing happen with our data?”
But what if it could? And it turns out, that’s exactly what a data catalog can help you do.
Ok, enough of the simple explanations. Here’s a more serious answer to the question, "What is a data catalog?"
"A data catalog creates and maintains an inventory of data assets through the discovery, description and organization of distributed datasets. The data catalog provides context to enable data stewards, data/business analysts, data engineers, data scientists and other line of business (LOB) data consumers to find and understand relevant datasets for the purpose of extracting business value."
- Gartner, Augmented Data Catalogs 2019. (Access for Gartner subscribers only.)
How does a data catalog work?
A data catalog links data with the assets that make it meaningful — documentation, queries, history, glossaries, etc. By combining metadata with data management, governance, and search capabilities, a data catalog helps a company organize its data, discover the right data assets, and evaluate if an asset is right for a specific use case.
- Here’s what a truly powerful data catalog can do:
- Create a repository of all your data from various data sources, including notes on a data set’s structure, quality, definitions and usage.
- Allow users to access the metadata alongside the data itself.
- View and understand the lineage of the data—including the data source, the transformations applied and who has been using it.
- Ensure data consistency and accuracy by updating itself auto-magically, while allowing humans to edit and remain in the loop.
- Simplify data governance and compliance by providing a graphical representation of the lineage of the data assets—tracing it across its lifecycle.
Why is a data catalog important?
According to Booz Allen Hamilton’s Data Science Playbook, businesses that deploy analytics across most of the organization, align daily operations with senior management’s goals, and incorporate big data will see a 1,000% increase in ROI.
We all know that data is important. But nowadays, it’s not enough to have data. Only the companies that can actually harness the enormous power of data are expected to win.
The pain of siloed and missing data is real, and it’s felt across organizations. Here’s what we saw on Reddit:
The problem with lack of data curation. Image courtesy: Reddit
Ensuring that teams can easily discover, truly understand and effectively consume the data they need is a huge challenge to using data effectively. The solution? A data catalog.
"The two biggest challenges in data management are centered around data catalogs—finding and identifying data that delivers value, and supporting data governance, data privacy and data security."
- Gartner, Gartner Data Management Strategy Survey 2017
Why do you need a data catalog?
If you're serious about becoming a data-driven organization, you'll need a data catalog. A data catalog helps organizations create a home for their data—a single place where all data and information about the data lives. This makes it quicker and easier for teams to access and use data in their daily work.
- Here’s a six-step checklist to find out if you need a data
- Do you spend way more time looking for the data you need than the time you spend using it?
- Do you know less about the data than you think you should?
- Do you know the source of your data?
- Do you know the quality of the data?
- Can you rate your data assets?
- Can you get and give data access easily and securely?
If your answer to any of the above is a big resounding “UMMMMM”, the writing’s on the wall. It’s time to get a data catalog.
- A data catalog will help solve team messages like these:
- I requested access to this data 7 days ago. Can you give me access?
- What does this column name mean?
- There are 4 versions of the data files. Which is the final one?
- Can you rerun the report so we can send it to the boss?
- The pipeline has failed. I will need some time to fix it.
How can a data catalog help?
- Here are the benefits of a data catalog:
- Save money through productivity gains and improved data asset monitoring.
- Save time—deliver more data projects with 30% less data team time.
- Increase data quality for better business decisions.
- Reduce dependencies and save the IT team’s time by enabling self-service access to data.
- Improve data culture for higher retention of quality data professionals.
- Reduce data risk with improved compliance for GDPR, PII.
Data catalogs help with metadata management. They let you easily access both your data and its important business context. And that too from across all your data sources, from the cloud to your BI tools.
Here’s what that means in a modern context:
Modern machine-learning-augmented data catalogs automate various
tedious tasks involved in data cataloging, including metadata
discovery, ingestion, translation, enrichment and the creation of
semantic relationships between metadata. These next-generation data
catalogs can therefore propel enterprise metadata management projects
by allowing business users to participate in understanding, enriching
and using metadata to inform and further their data and analytics
- Gartner, Augmented Data Catalogs 2019. (Access for Gartner subscribers only.)
Examples of data catalog tools
To solve their internal problems of handling data, a number of big companies have built their own data discovery and cataloging solutions. This includes the likes of Facebook’s Nemo and Shopify’s Artifact. A number of these tools have even been made available as free open-source data catalogs, like Linkedin’s DataHub, Lyft’s Amundsen and WeWork’s Marquez.
While these tools may be free, they do come with their own set of challenges—such as difficulty in deployment, need for engineering resources to set up, lack of IT teams to manage maintenance and support.
On the other hand, there are paid data catalog tools that take care of most of these challenges, but may have other downsides like heavy upfront prices and license lock-ins.
Whether open-source or paid, most of these tools profess to provide
the same, oft-lauded features:
- A catalog of your data and metadata in one place
- Mechanisms to govern your data and make it usable
However, don’t forget that simply plugging in an isolated tool within your data lake may not be the answer to your data woes. The problem with many of these data catalog tools is that they fail to deliver on the promise of data democratization.
While they bring your data and metadata in one place, the overall data experience is far sub-optimal. Often technical features are built at the expense of navigability, and diverse, non-technical data users in the organization find it hard to adopt them. Thus these tools are very likely (and ironically) doomed to become siloed tools themselves!
So what is the answer, you ask? It’s two-fold. First, choose the right data catalog — one that’s built for both technical users and non-technical users. Second, build a culture, rather than just tools, around data.
"Many companies have invested heavily in technology as a first step toward becoming data-oriented, but this alone clearly isn’t enough. Firms must become much more serious and creative about addressing the human side of data if they truly expect to derive meaningful business benefits."
- Randy Bean and Thomas H. Davenport, HBR
Keep in mind that your data users are humans, and consist of both technical and non-technical users. Consider their respective needs and challenges, and build a data culture that will support data teams and help them succeed.
A data catalog will help you:
- Create a single source of truth for your data across all its applications
- Make data cataloging a part of your data processes, not an isolated activity
- Quickly access and share the insights you need via a centralized repository
- Enforce and simplify data security and compliance (GDPR, CCPA, etc.)
And that’s it! Time to go forth and jumpstart your data management strategy to create one source of truth for your data.