The 4 Essential Modern Data Catalog Features to Look for in 2022

June 10th, 2022

header image for The 4 Essential Modern Data Catalog Features to Look for in 2022

The 4 fundamental features of third-gen data catalogs are:

  1. Programmable bots: Lets you build custom AI/ML algorithms to automatically tackle the various challenges in data management such as data search and discovery, updates, classification, anomaly detection, and observability
  2. Embedded collaboration: Weaves into the daily workflows of data teams seamlessly, rather than existing as a standalone tool, to simplify data sharing and monitor access requests
  3. End-to-end visibility: One interface to get all the context about data — who owns a data set, where it comes from, how has it changed and how to use it, thereby supporting data lineage and impact analysis
  4. Open-by-default: An openly accessible API layer to drive and support numerous use cases for the modern data stack such as custom data syncs or sharing query logs and SQL codes

However, if you run a search for essential data catalog features, you’ll be bombarded with aspects such as business glossaries, data lineage, semantic discovery, and built-in integrations for prominent data stack tools.

While we could populate an unending list of features, it doesn’t guarantee anything. Features alone don’t translate into business value. HBR says it best:

The secret to a good experience isn’t the multiplicity of features on offer.

Ultimately, you don’t need “yet another tool” for your data stack. And the last thing you want to do is open up another tool (aka the traditional data catalog), look for the dashboard, and browse through metadata just to understand a data set.

Instead, you want a solution that’s collaborative, enables self-service, and lets you access context wherever you need it — Slack, JIRA, the query editor, Salesforce, or the data warehouse.

That’s where third-generation data catalogs with characteristic features like custom AI/ML bots, two-way data flow, end-to-end visibility and open API layers stand out.

Unlike their predecessors, modern data catalogs won’t passively collect and store metadata in yet another silo — akaexpensive shelfware. Instead, they leverage passive and active metadata, and send them back into every tool in the data stack.

Let’s explore each data catalog feature to wrap our heads around their significance to modern cataloging and data management.

How does a data catalog system look like: An example

How does a modern data catalog system look like: An example | Source: Atlan

1. Programmable bots

Data catalogs should simplify search and discovery and enable effective metadata management. The feature that drives these outcomes is programmability in AI/ML algorithms.

That’s why a key feature for third-generation data catalogs is programmable bots — they let you use AI/ML algorithms to build custom bots. These bots automate several aspects of finding, compiling, and inventorying data, irrespective of the type, format, or source.

Existing data catalogs still require manual data handling processes while juggling trillions of big data. That’s why a core premise of third-generation data catalogs is programmable bots tackling the various challenges in data search, discovery, and analytics.

Let’s see how by taking a brief detour to the past.

Programmable bots: As it became a need over time

Initially, finding and documenting data sets involved several arduous, manual processes such as consolidating data, updating and enriching data sets, and curating metadata. The toughest aspect was updating data sets — a challenge data practitioners have been facing since the 1980s.

Here’s how industry veteran Andy Hayler puts it:

As soon as you document the applications, who owns them and what data they have, new systems spring up around the company. People running projects putting in new systems usually have no incentive to tell the data catalog team what they are doing, so it inevitably becomes out of date. When business users consult the data catalog and discover it does not have an up-to-date list of data assets, they start to distrust it and go elsewhere to find what they need.

As the number of data sources, types, users, and deployment models exploded, the challenge became more pressing, with data teams spending more time looking for data, understanding what lives where, and demystifying its meaning.

This paved the way for augmented data catalogs.

In 2019, Gartner recognized augmented data catalogs — that use ML to automate manual efforts in cataloging — as a must-have for data teams:

Modern machine- learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata.” Gartner Augmented data catalogs 2019 (Access for Gartner subscribers only)

Soon, almost every data catalog solution started offering augmented catalogs with a “one size fits all” approach.

But here’s the catch — one algorithm cannot magically create context, identify anomalies, and achieve the intelligent data management dream — for every industry, company, and use case.

That’s why third-generation tools rely instead on programmable bots — a framework that lets teams create their own machine learning or data science algorithms.

Programmable bots: In action

The best way to visualize this feature in action is to think of Slack bots — you can configure Slack bots to react to channel messages or prompt users to engage with a post.

Similarly, programmable bots in data catalogs can be set up for use cases in security and compliance, classification, and observability. You can use these bots to automatically:

  • Flag risky or bad data and outliers
  • Classify and tag data assets according to data types and geography-specific regulations
  • Deduce the data table owners and experts
  • Offer recommendations using active metadata analysis
  • Create their own bots based on CIA (Confidentiality, Integrity, Availability) ratings for information security
  • Customize observability algorithms to their data ecosystems and use cases.

2. Embedded collaboration

Data catalogs are meant to solve the context problem — what does this data mean and where do I use it? That’s why an essential feature for modern data catalogs is embedded collaboration, which borrows principles from the modern tools that teams already use and love.

A core principle behind these tools is flow — micro-flows powering two-way movement of data. As a result, context will be available wherever and whenever you need it.

But before looking at it in action, let's recap the evolution of embedded collaboration as a need.

Embedded collaboration: As it became a need over time

Historically, data catalogs were built with IT users in mind. Today, data users have become more diverse, and yet data catalogs haven't caught up.

Second-generation data catalogs offer some business-facing UIs, but as an afterthought. However, as Chris Williams and John Bodley at Airbnb said, ”designing the interface and user experience of a data tool should not be an afterthought.”

Derek Steer sums it up perfectly using this Yelp analogy in his Forbes article:

Think about Yelp, a consumer app built upon data to allow people to make choices in their daily lives. You don’t have to be an analyst to utilize Yelp. The app creates a seamless experience that enables people to make data-driven decisions when choosing a restaurant, a dentist, a bar or a beauty salon. Nobody has to force people to use Yelp — they just do it on their own because they find the information to be useful.”

This is where the idea of embedded collaboration comes in handy — embedded collaboration is about work happening where you are, with the least friction.

Instead of forcing your data practitioners to go to a separate tool, third-generation catalogs bring context to existing tools like Looker, dbt, and Slack with active metadata management.

Need a refresher on the fundamentals of active metadata management? Check out our comprehensive explainer on Active Metadata Management.

Embedded collaboration: In action

Embedded collaboration is the solution to tool fatigue for data teams, as it ensures data flows from various sources to the warehouse or lake, and vice versa. So, business users can find and use the data they need without switching to a separate catalog for context.

With a feature like embedded collaboration, you could implement use cases such as:

  • Requesting and getting access to data assets via a link
  • Approving or rejecting access requests using your favorite collaboration tool
  • Configuring data quality alerts on Slack so that your team can ask questions about a data asset and get context directly in Slack
  • Triggering support requests on JIRA without leaving the screen where you’re investigating a data asset

Truly democratize data analytics with colloboration

Truly democratize data analytics with colloboration. Source: Atlan

3. End-to-end visibility

Another core promise of data catalogs has been setting up a single source of truth and the feature that can make it possible is end-to-end visibility.

As you add more tools to the data stack, the data sources and volumes will continue to skyrocket. Without knowing the origins of a data set or the transformations it has undergone, it’s not easy to understand, trust, and use that data for anything.

That’s why end-to-end visibility is yet another essential feature of modern data catalogs. Let’s delve deeper into the modern data stack to understand the need for this feature.

End-to-end visibility: As it became a need over time

As the modern data stack became mainstream, it introduced several powerful tools to the data tech stack — data lineage tools, data quality tools, data prep tools, and more.

However, these disparate tools also generated large volumes of data spread across different places. As a result, data consumers need to ask multiple people and check information across multiple tools to get the complete picture of a data asset.

Previous versions of data catalogs improved data discovery, but they still didn’t solve the problem of a “single source of truth”.

As the data catalog was a separate entity, different teams would download data sets, work on them, make changes, and use them, without also updating the catalog. So, multiple versions of the same data sets, inconsistencies, and redundant data sets still existed.

That’s why end-to-end visibility is a vital feature of third-generation data catalogs as it answers questions such as:

  • Who owns a data set or which columns are used most, using auto-generated query histories
  • Where data sets come from, using automated lineage
  • If data is trustworthy, with quality scores and information on the latest updates

As a result, you get a preview of each data set in one place, rather than toggling between various data tools.

Visualize data lineage — both upstream and downstream

Visualize data lineage — both upstream and downstream. Source: Atlan

End-to-end visibility: In action

End-to-end visibility is what you get with column-level lineage, data quality profiling, visual data previews, and custom metadata from ETL tools, orchestration tools, and more.

Third-generation data catalogs also let you share queries and data sets with a link, so that everyone’s on the same page.

End-to-end-visibility in a data catalog

End-to-end-visibility in a data catalog. Source: Atlan

4. Open-By-Default

As data grows in volume and veracity, there’s no way to predict all possible use cases in analytics. However, if the data catalog is like a fundamental meta store with an openly accessible API layer, data teams can innovate and drive these several futuristic operational use cases.

Third-generation data catalogs will connect to all parts of the data stack so that data practitioners can understand and trust their data more. That comes when being open by default is a core feature.

Open-By-Default: As it became a need over time

The modern data stack became prominent once cloud-native data warehouses with massively parallel processing (MPP) capabilities — like Redshift — became mainstream. This led to the development of low code, cloud-native tools with low overheads that are easy to integrate and scale — the Modern Data Stack (MDS).

While several characteristics define these tools, two stand out:

  • Open-source core components with paid add-on features
  • Vast communities supporting the creative ecosystems around these tools

When a catalog has to support such an ecosystem of open standards and tools, it must also be open by default. Moreover, it must actively leverage metadata to connect the entire data ecosystem.

Let’s see what that feature looks like in action.

Open-By-Default: In action

Here’s an example of the benefits of an open-by-default data catalog with active metadata management:

Query logs are just one kind of metadata available today. By parsing through the SQL code from query logs in Snowflake, it’s possible to create column-level lineage automatically, assign a popularity score to every data asset, and even deduce the potential owners and experts for each asset.”

Think of open-by-default catalogs along the lines of features like custom design plugins on Figma or code sharing on NPM. So, if you use MongoDB, you can build custom plugins for new data asset syncs.

Features of future data catalogs: Built around agility, trust and collaboration

Data catalogs should offer reliable context wherever you want it, in real-time. They shouldn't be yet another tool built solely to browse for data, because that ultimately makes it redundant.

Josh Wills summarized this issue in a tweet that resonates with data practitioners everywhere:

To my many friends/followers doing metadata/catalog startups, I have a request: please integrate the metadata info with my BI tool so that I can see it while I am doing queries. I have no desire to ever visit a third website to just "browse the metadata."

Building such a future, where data catalogs don’t live in their own “third website” but in your tool of choice, is the driving force behind third-generation data catalogs. These new data catalogs will be built around diverse data assets, “big metadata”, end-to-end data visibility, and embedded collaboration.

That, in turn, means moving away from passive data catalogs and switching to a solution that activates metadata and weaves it into the daily workflows of data teams — like reverse ETL, but for metadata. Metadata is the key to achieving seamless data flow across the data stack.

Activating metadata can support several use cases such as observability, cost management, remediation, quality, security, programmatic governance, auto-tuned pipelines, and more.

Active metadata management in tandem with the four features covered above, third-generation data catalogs will finally check all the right boxes:

  • Facilitate data search and discovery
  • Enable open knowledge-sharing and collaboration
  • Build trust in data
  • Ensure governance and regulatory compliance
  • Enable data democratization without compromising data security and privacy

Interested in knowing how to find and implement such a catalog and drive business value?

Do take Atlan for a spin. Atlan is a third-generation modern data catalog built on the framework of embedded collaboration that is key in today’s modern workplace, borrowing principles from GitHub, Figma, Slack, Notion, Superhuman, and other modern tools that are commonplace today.

Ebook cover - metadata catalog primer

Everything you need to know about modern data catalogs

Adopting a modern data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!