What is a Machine Learning Data Catalog (MLDC)?

March 4th, 2021

Machine Learning Data Catalog

What is a machine learning data catalog (MLDC)?

A machine learning data catalog (MLDC) is a next-generation data catalog that enables real-time data discovery and automates cataloging, crawling of metadata, and classification of PII data.
    Once configured, the MLDC would:
  1. Crawl your data sources (on-premise or cloud data warehouses, lakes, databases)
  2. Understand and interpret technical metadata
  3. Create business descriptions and other such information to catalog data with context automatically
  4. Run periodic audits to verify the accuracy, quality and integrity of data

Not quite familiar with data catalogs? Here’s a useful guide to help you out!

"Modern machine-learning-augmented data catalogs automate various tedious tasks involved in data cataloging, including metadata discovery, ingestion, translation, enrichment and the creation of semantic relationships between metadata. These next-generation data catalogs can therefore propel enterprise metadata management projects by allowing business users to participate in understanding, enriching and using metadata to inform and further their data and analytics initiatives."

-Gartner, Augmented Data Catalogs 2019

Why do you need machine learning data catalogs?

Machine learning data catalogs (MLDCs) that simplify finding and inventorying siloed data assets are a crucial first step in data and analytics projects. Gartner predicts that over 60% of traditional data catalog projects that don’t use machine learning to find and inventory data will fail.

Challenges with traditional data catalogs

    Maintaining traditional data catalogs is excruciating because organizations:
  1. Generate petabytes of data every day
  2. Store data in messy, unclassified and unusable formats
  3. Handle most aspects of cataloging manually

Data consumers cannot use obsolete, unverified data to inform business decisions. With increasingly tighter regulations on data security, integrity, and privacy, that can burn a hole in the pocket. Even a minor GDPR infringement would cost either €10 million or 2% of your annual revenue, whichever amount is higher.

The consequences of using traditional data catalogs

    Legacy data catalogs require extensive manual intervention, leading to:
  1. Endless delays in projects
  2. Hefty fines for not complying with data-related regulations (GDPR, CCPA, and cohort)
  3. Difficulties in cross-team collaborations (in an increasingly distributed environment)

What are the key capabilities of a machine learning data catalog?

    According to G2, a machine-learning-powered data catalog should:
  1. Organize and consolidate data in a single repository (i.e., a single source of truth)
  2. Allow data consumers (especially business users) to search and access the data they need
  3. Let users categorize, comment and share data sets easily to improve collaboration
  4. Offer intelligent recommendations (using machine learning algorithms) to relevant data
  5. Enable user access management (UAM) for better data governance

Six essential features to expect from modern data catalogs

    While evaluating machine learning data catalogs (MLDCs), look for:
  1. Auto-cataloging
  2. Google-like semantic search
  3. Automated data lineage mapping
  4. Easy collaboration
  5. Automated quality audits and governance
  6. Intelligent recommendations

1. Auto-cataloging

A machine learning data catalog should automate tedious aspects of cataloging such as crawling metadata, classifying PII data, profiling for quality (missing values, outliers, and other anomalies).

Regardless of where the data comes from (cloud warehouses, data lakes, or RDBMS), the catalog must be able to find and organize it.

Auto-classified data sets with adequate context help data consumers
            interpret the data and use it to make strategic decisions.

Auto-classified data sets with adequate context help data consumers interpret the data and use it to make strategic decisions.

2. Google-like semantic search

Smart data catalogs like MLDCs should empower business and technical users alike to run “Google-like” searches on the metadata to address business outcomes.

Typing “Sales” on the search window should display a list of relevant data
            sets, which can be fine-tuned using filters on data type,
            source, format, and more.

Typing “Sales” on the search window should display a list of relevant data sets, which can be fine-tuned using filters on data type, source, format, and more.

The catalog should also provide one search window for all data and dashboards to improve user experience and make working with data a breeze.

3. Automated data lineage mapping

Data lineage shows the origins of data sets, how they’ve evolved through their lifecycles, and foresees the assets that will be affected by future changes. Proving lineage for building trust in data and ensuring compliance warrants tracking the transformation that data sets undergo.

Lineage also helps build better, more relevant models. So, MLDCs should be able to parse through your query logs in your data warehouses, data lakes and other data sources automatically to create a visual map of data lineage.

An MLDC would track every transformation that a data set undergoes and
            represent it graphically to help users verify its lineage.

An MLDC would track every transformation that a data set undergoes and represent it graphically to help users verify its lineage.

4. Easy collaboration

In the post-pandemic era, distributed teams are here to stay. MLDCs should facilitate collaboration across teams and geographies within organizations with in-built features for in-line chats, comments and annotations, data set ratings and sharing data sets with a single URL.

Modern data catalogs let data consumers discuss and collaborate within the
            platform through features like chats, comments,
            and more.

Modern data catalogs let data consumers discuss and collaborate within the platform through features like chats, comments, and more.

5. Automated quality audits and governance

Automated quality audits are an excellent way to ensure data quality, integrity and trustworthiness. Running scheduled audits to spot data regressions, data loss or distributional shifts over time can help certify data accuracy.

Running scheduled audits to verify the quality and integrity of data is a
            great way to certify its accuracy and usability.

Running scheduled audits to verify the quality and integrity of data is a great way to certify its accuracy and usability.

Tracking data usage right from the source is essential for better governance. Additionally, modern data catalogs simplify governance for IT and data stewards by providing a single dashboard to establish policies, manage access logs and requests.

 Handling access from a single window reduces delays in authorizing requests,
            which removes bottlenecks and simplifies the lives of 
            your IT teams.

Handling access from a single window reduces delays in authorizing requests, which removes bottlenecks and simplifies the lives of your IT teams.

6. Intelligent recommendations

Intelligent recommendations of other data sets that might be relevant or of interest to data consumers enhance the overall user experience.

Just like the “People also ask” and “Searches related to…” sections on Google search, this feature can show similar data sets or curate data that matches the user’s search criteria to increase the value derived from data.

Can machine learning data catalogs increase business value?

    Yes. To increase business value, the machine learning data catalog should:
  1. Demonstrate value that reflects on the business outcomes (i.e., cost savings, efficiency or revenue growth)
  2. Offer a risk-averse, as-a-service model to avoid massive upfront costs and archaic licensing fees
    Enterprises using Atlan’s machine-learning-augmented data catalog have reported:
  1. Up to 60X speed to insight
  2. 70% greater business engagement
  3. ROI realization within two weeks

Quick recap

  1. Machine learning data catalogs (MLDCs) enable real-time data discovery and automatic cataloging of data sets with adequate context.
  2. With MLDCs, organizations can build a single source of truth for all data, track lineage, search and access the right data via a single dashboard.
  3. Modern, augmented data catalogs facilitate improved collaboration within the organization, empower all data users to make data-driven decisions, simplify governance, and facilitate data democratization.
  4. While evaluating data catalogs, look for automated ingesting, inventorying, tagging, profiling, lineage mapping, and enrichment of data sets.
  5. Opt for a pay-as-you-go pricing to avoid the risks (and regrets) that come with six-figure upfront licensing fees and long-term commitments.
Ebook cover - data catalog primer

Data Catalog Primer - Everything You Need to Know About Modern Data Catalogs.

Adopting a data catalog is the first step towards data discovery. In this guide, we explore the evolution of the data management ecosystem, the challenges created by traditional data catalog solutions, and what an ideal, modern-day data catalog should look like. Download now!