Google Data Catalog: Guide on Everything You Need to Know

Updated September 12th, 2023
header image

Share this article

If you’re evaluating data catalogs for your GCP, then Google Data Catalog might seem to be a solid bet. However, before zeroing down to Google Data Catalog, you must look at certain key factors such as data connectors, cloud-agnostic nature, scalability, ease of governance, and more.



This article will explore the key factors to consider during your gcp data catalog evaluation, followed by an assessment of a list of suitable data catalogs for your cloud environment.


Table of contents #

  1. Google cloud Data catalogs: Factors to consider
  2. 4 Popular Google-cloud Data catalogs
  3. Dataplex - Google Data Catalog
  4. Atlan
  5. Apache Atlas
  6. Lyft’s Amundsen
  7. GCP Data catalog: Summing up
  8. Google data catalog: Related Resources

Let’s begin by looking at the primary evaluation criteria for gcp data catalogs.

Google cloud data catalogs: factors to consider #

  1. Data governance capabilities: Granular, role-based access controls for data security, privacy, and compliance with data regulations
  2. Lineage mapping and impact analysis: A visual mapping of data lineage that depicts the origins, and transformations undergone, and helps monitor the flow of data for impact analysis
  3. Stack maintenance: The entity responsible for managing the data stack — is it your engineering team or is it a shared responsibility between you and your service provider
  4. Data connectors available: The data sources supported and the ease of requesting as well as setting custom connectors for upcoming sources
  5. Scalability: The ease of scaling resources for your data stack
  6. Cloud agnostic nature: The ability to support and enable interoperability across multiple cloud environments
  7. Hosting model: On-premise or hosted
  8. Technical support: The level of support offered to set up and manage the data catalog


While there are several solutions available, we’re going to focus on four popular data catalog platforms to be considered:

  1. Dataplex - Google Data Catalog
  2. Atlan
  3. Apache Atlas
  4. Lyft’s Amundsen

Each tool supports various data sources and has its pros and cons. Some require further setup to ensure complete compatibility with the GCP.

We’ll explore these aspects further to help you pick the right tool for your data stack. Let’s start with GCP’s own Data Catalog - Dataplex.


1. Dataplex - google data catalog #

The Google Data Catalog or Dataplex is a metadata management service from GCP. It is serverless and fully managed by GCP.

Supported data sources #


  • Google data sources: Google BigQuery, Pub/Sub
  • On-premise data sources: Connectors developed but not officially supported by GCP. Examples include connectors for MySQL, Amazon Redshift, Apache Hive, and SAP HANA.

Required gcp components #


Since Google Data Catalog is part of the GCP suite of services, it doesn’t require any additional components to ensure compatibility with the GCP.

As we’re looking at cataloging data sets for the GCP, this might seem to be the obvious choice. However, the Google Data Catalog has its fair share of pros and cons.

Pros #


  • Use the Python API to access and modify the metadata components
  • Mask and securely access the PII, sensitive data
  • Define roles as per user categories

Cons #


  • Not cloud agnostic and offers limited cross-cloud compatibility
  • Limited support for data source connectors outside Google
  • Tech-centric UI makes it less ideal for business consumers of data
  • Integration isn’t straightforward using the available Python API
  • Data lineage and evolution don’t get mapped

To overcome the shortcomings of the Google Data Catalog, you can consider an enterprise data catalog like Atlan, or open-source tools such as Apache Atlas and Lyft’s Amundsen. Let’s explore each solution further.


Google data catalog resources #


Google Data Catalog overview | Google Data Catalog API reference | Google Data Catalog: How it works

Download 👉 The Forrester Wave™: Enterprise Data Catalogs, Q3 2024


2. Atlan #

Atlan is a third-generation enterprise data catalog powered by active metadata management that serves as a modern data workspace for business and technical data consumers. Atlan is built on the principles of embedded collaboration, programmable bots, and end-to-end visibility, and is open by default.

The user-friendly UI with a Google-like search lets you discover and look up data sets across your data ecosystem. The catalog offers rich context on every data set with active and passive metadata compiled from various data sources and applications.

Also, Atlan’s column-level lineage provides a bird’s-eye view of your data so that you can perform root cause analysis, assess impact, and understand how data sets connect with one another.

Atlan: Modern data catalog for Google Cloud Platform.

Atlan: Modern GCP data catalog.


Supported data sources #


Atlan supports connecting to various data sources, such as:

  • MySQL
  • PostgreSQL
  • Google BigQuery
  • AWS Redshift
  • Snowflake

Additionally, Atlan offers support for data connectors from BI tools such as Tableau, PowerBI, and Looker and transformation engines such as dbt cloud and dbt core.


A Demo of Atlan Data Catalog for Google Cloud Platform (GCP)


Pros #


  • Get a hyper-personalized experience with personas for user roles, business domains, and projects
  • Get context in your data tool of choice — Slack, JIRA, Tableau, Looker, etc. — using Atlan’s Chrome plugin
  • Discuss data sets, raise support tickets, look for business definitions, and get alerts on your workflows without leaving Atlan or switching between apps
  • Set up custom classifications and access policies and propagate them via the relationships created in lineage mapping
  • Control who can read and edit metadata, data definitions, metric formulas, and business taxonomies
  • Host the infrastructure yourself or let Atlan take over the reins
  • Use open APIs to connect Atlan with any tool in your data stack
  • Pay-as-you-go pricing model

Cons #


  • The support for Kafka, and other Hadoop data sources is still under development.

Atlan data catalog resources #

Atlan data catalog overview | Atlan API reference | Atlan Data Catalog: User stories


Take a test drive, explore and try your hands on Atlan data catalog

Access Atlan demo


3. Apache Atlas #

Apache Atlas is an open-source metadata management service that works within Hadoop. However, it’s not limited to Hadoop data sources and integrates easily with the enterprise data environment.

Supported data sources #


Required GCP components #


Apache Atlas maintains the metadata in HBase using a graph model. The Apache Atlas core includes JanusGraph and HBase to store the metadata.

As a result, a Hadoop filesystem is essential for setting up Apache Atlas. This includes:

  • Data Proc Hadoop Data storage: To install Apache Atlas on a Hadoop filesystem
  • Google Kubernetes Engine (GKE) pods running Apache Solr or Elasticsearch and Apache Ranger: For the search capability and masking of PII data
  • Cloud Pub/Sub or Confluent Cloud on GCP: A messaging service to communicate and publish the changes to the metadata

Apache Atlas: Data discovery and metadata management for Google Cloud

Apache Atlas: Data catalog and metadata management for Google Cloud. Source: Atlas


Pros #


  • Supports both Hadoop and non-Hadoop data sources
  • Offers out-of-the-box access to the REST APIs for external integration
  • Has features for the effective handling of PII and sensitive information
  • Lets you observe data lineage and evolution through its UI

Cons #


  • Works within the Hadoop environment
  • Involves a steep learning curve for setting up and managing the data stack
  • Must be integrated with Apache Ranger for data-masking
  • Doesn’t have official docker images or helm charts to spawn the whole stack

Apache Atlas resources #


Apache Atlas Overview | Apache Atlas Demo | Apache Altas GitHub Repository


4. Lyft’s Amundsen #

Amundsen is an open-source data catalog originally developed by Lyft.

Supported data sources #


Amundsen can connect to databases via a dbapi or a sqlalchemy interface, which includes this list of table connectors. Examples include Amazon Redshift, Apache Cassandra, and Google BigQuery.

Required GCP components #


Amundsen maintains the metadata in Neo4J and uses Elasticsearch for data discovery. The metadata is ingested using the Amundsen data builder service and can be orchestrated using Apache Airflow.

The cloud components required to set up Amundsen for the GCP include:

  • Google Kubernetes Engine (GKE) pods running Amundsen Helm Charts

OR

Lyft Amundsen: Data catalog and metadata management for Google Cloud

Lyft Amundsen: Data catalog and metadata management for Google Cloud. Source: Amundsen


Pros #


  • Supports a variety of Hadoop and non-Hadoop data sources
  • Officially supports Helm charts, which help in scaling the platform
  • Offers connector support for visualization platforms such as Tableau, Redash, Apache Superset, etc.

Cons #


  • No support for masking PII or sensitive information
  • Dependent on Apache Atlas for lineage support
  • Architecture management will be an overhead

Amundsen data catalog resources #


Amundsen Lyft Overview | Amundsen Demo | Amundsen Setup Guide | Amundsen GitHub Repository | A Guide to Configure and Set up Amundsen on Google Cloud Platform (GCP)


Google cloud data catalog: Summing up #

The access to orchestration services such as Google Kubernetes Engine (GKE) has bridged the gap in hosting and scaling your favorite catalog tools or platforms on GCP.

However, adopting a solution with minimal engineering overhead, seamless interoperability across cloud environments, and a simple UI is ideal. In addition to the points mentioned above, also consider our data catalog evaluation criteria to choose the perfect data catalog for your GCP environment.


Share this article

resource image

Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!

[Website env: production]