Data Catalog for the Google Cloud Platform (GCP) — Everything You Need to Know

August 16th, 2022

header image for Data Catalog for the Google Cloud Platform (GCP) — Everything You Need to Know

If you’re evaluating data catalogs for the Google Cloud Platform (GCP), the Google Data Catalog might seem to be a solid bet. However, before zeroing down to Google Data Catalog, you must look at certain key factors such as data connectors, cloud-agnostic nature, scalability, ease of governance, and more.

This article will explore the key factors to consider during your data catalog evaluation, followed by an assessment of a list of suitable data catalogs for your cloud environment.

Let’s begin by looking at the primary evaluation criteria.

Data catalogs on GCP: Factors to consider

  1. Data governance capabilities: Granular, role-based access controls for data security, privacy, and compliance with data regulations
  2. Lineage mapping and impact analysis: A visual mapping of data lineage that depicts the origins, and transformations undergone, and helps monitor the flow of data for impact analysis
  3. Stack maintenance: The entity responsible for managing the data stack — is it your engineering team or is it a shared responsibility between you and your service provider
  4. Data connectors available: The data sources supported and the ease of requesting as well as setting custom connectors for upcoming sources
  5. Scalability: The ease of scaling resources for your data stack
  6. Cloud agnostic nature: The ability to support and enable interoperability across multiple cloud environments
  7. Hosting model: On-premise or hosted
  8. Technical support: The level of support offered to set up and manage the data catalog

[Download ebook] → The Ultimate Guide to Evaluating a Data Catalog


Data catalogs for GCP

While there are several solutions available, we’re going to focus on four platforms that are popularly considered:

  1. Google Data Catalog
  2. Atlan
  3. Apache Atlas
  4. Lyft’s Amundsen

Each tool supports various data sources and has its pros and cons. Some require further setup to ensure complete compatibility with the GCP.

We’ll explore these aspects further to help you pick the right tool for your data stack. Let’s start with GCP’s Google Data Catalog.


Google Data Catalog

The Google Data Catalog is a metadata management service from GCP. It is serverless and fully managed by GCP.

Supported Data Sources

  • Google data sources: Google BigQuery, Pub/Sub
  • On-premise data sources: Connectors developed but not officially supported by GCP. Examples include connectors for MySQL, Amazon Redshift, Apache Hive, and SAP HANA.

Required GCP Components

Since Google Data Catalog is part of the GCP suite of services, it doesn’t require any additional components to ensure compatibility with the GCP.

As we’re looking at cataloging data sets for the GCP, this might seem to be the obvious choice. However, the Google Data Catalog has its fair share of pros and cons.

Pros

  • Use the Python API to access and modify the metadata components
  • Mask and securely access the PII, sensitive data
  • Define roles as per user categories

Cons

  • Not cloud agnostic and offers limited cross-cloud compatibility
  • Limited support for data source connectors outside Google
  • Tech-centric UI makes it less ideal for business consumers of data
  • Integration isn’t straightforward using the available Python API
  • Data lineage and evolution don’t get mapped

To overcome the shortcomings of the Google Data Catalog, you can consider an enterprise data catalog like Atlan, or open-source tools such as Apache Atlas and Lyft’s Amundsen. Let’s explore each solution further.

Google Data Catalog resources

Google Data Catalog overview | Google Data Catalog API reference | Google Data Catalog: How it works

Download report →Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022


Atlan

Atlan is a third-generation enterprise data catalog powered by active metadata management that serves as a modern data workspace for business and technical data consumers. Atlan is built on the principles of embedded collaboration, programmable bots, end-to-end visibility, and is open by default.

The user-friendly UI with a Google-like search lets you discover and look up data sets across your data ecosystem. The catalog offers rich context on every data set with active and passive metadata compiled from various data sources and applications.

Also, Atlan’s column-level lineage provides a bird’s-eye view of your data so that you can perform root cause analysis, assess impact, and understand how data sets connect with one another.

Atlan: Modern data catalog for Google Cloud Platform.

Atlan: Modern data catalog for Google Cloud Platform.


Supported data sources

Atlan supports connecting to various data sources, such as:

  • MySQL
  • PostgreSQL
  • Google BigQuery
  • AWS Redshift
  • Snowflake

Additionally, Atlan offers support for data connectors from BI tools such as Tableau, PowerBI, and Looker and transformation engines such as dbt cloud and dbt core.


A Demo of Atlan Data Catalog for Google Cloud Platform (GCP)


Pros

  • Get a hyper-personalized experience with personas for user roles, business domains, and projects
  • Get context in your data tool of choice — Slack, JIRA, Tableau, Looker, etc. — using Atlan’s Chrome plugin
  • Discuss data sets, raise support tickets, look for business definitions, and get alerts on your workflows without leaving Atlan or switching between apps
  • Set up custom classifications and access policies and propagate them via the relationships created in lineage mapping
  • Control who can read and edit metadata, data definitions, metric formulas, and business taxonomies
  • Host the infrastructure yourself or let Atlan take over the reins
  • Use open APIs to connect Atlan with any tool in your data stack
  • Pay-as-you-go pricing model

Cons

  • The support for Kafka, and other Hadoop data sources is still under development.

Atlan data catalog resources

Atlan data catalog overview | Atlan API reference | Atlan Data Catalog: User stories


Take a test drive, explore and try your hands on Atlan data catalog

Access Atlan demo


Apache Atlas

Apache Atlas is an open-source metadata management service that works within Hadoop. However, it’s not limited to Hadoop data sources and integrates easily with the enterprise data environment.

Supported data sources

Required GCP components

Apache Atlas maintains the metadata in HBase using a graph model. The Apache Atlas core includes JanusGraph and HBase to store the metadata. As a result, a Hadoop filesystem is essential for setting up Apache Atlas. This includes:

  • Data Proc Hadoop Data storage: To install Apache Atlas on a Hadoop filesystem
  • Google Kubernetes Engine (GKE) pods running Apache Solr or Elasticsearch and Apache Ranger: For the search capability and masking PII data
  • Cloud Pub/Sub or Confluent Cloud on GCP: A messaging service to communicate and publish the changes to the metadata

Apache Atlas: Data discovery and metadata management for Google Cloud

Apache Atlas: Data catalog and metadata management for Google Cloud. Source: Atlas


Pros

  • Supports both Hadoop and non-Hadoop data sources
  • Offers out-of-the-box access to the REST APIs for external integration
  • Has features for effective handling of PII and sensitive information
  • Lets you observe data lineage and evolution through its UI

Cons

  • Works within the Hadoop environment
  • Involves a steep learning curve for setting up and managing the data stack
  • Must be integrated with Apache Ranger for data-masking
  • Doesn’t have official docker images or helm charts to spawn the whole stack

Apache Atlas resources

Apache Atlas Overview | Apache Atlas Demo | Apache Altas GitHub Repository


Lyft’s Amundsen

Amundsen is an open-source data catalog originally developed by Lyft.

Supported data sources

Amundsen can connect to databases via a dbapi or a sqlalchemy interface, which includes this list of table connectors. Examples include Amazon Redshift, Apache Cassandra, and Google BigQuery.

Required GCP components

Amundsen maintains the metadata in Neo4J and uses Elasticsearch for the data discovery. The metadata is ingested using the Amundsen data builder service and can be orchestrated using Apache Airflow.

The cloud components required to set up Amundsen for the GCP include:

  • Google Kubernetes Engine (GKE) pods running Amundsen Helm Charts

OR

Lyft Amundsen: Data catalog and metadata management for Google Cloud

Lyft Amundsen: Data catalog and metadata management for Google Cloud. Source: Amundsen


Pros

Cons

  • No support for masking PII or sensitive information
  • Dependent on Apache Atlas for lineage support
  • Architecture management will be an overhead

Amundsen data catalog resources

Amundsen Lyft Overview | Amundsen Demo | Amundsen Setup Guide | Amundsen GitHub Repository | A Guide to Configure and Set up Amundsen on Google Cloud Platform (GCP)


Data catalog for GCP: Summing up

The access to orchestration services such as Google Kubernetes Engine (GKE) has bridged the gap in hosting and scaling your favorite catalog tools or platforms on GCP.

However, adopting a solution with minimal engineering overhead, seamless interoperability across cloud environments, and a simple UI is ideal. In addition to the points mentioned above, also consider our data catalog evaluation criteria to choose the perfect data catalog for your GCP environment.


Free Guide: Find the Right Data Catalog in 5 Simple Steps.

This step-by-step guide shows how to navigate existing data cataloging solutions in the market. Compare features and capabilities, create customized evaluation criteria, and execute hands-on Proof of Concepts (POCs) that help your business see value. Download now!