Data Catalog for the Google Cloud Platform (GCP) — Everything You Need to Know
August 16th, 2022
If you’re evaluating data catalogs for the Google Cloud Platform (GCP), the Google Data Catalog might seem to be a solid bet. However, before zeroing down to Google Data Catalog, you must look at certain key factors such as data connectors, cloud-agnostic nature, scalability, ease of governance, and more.
This article will explore the key factors to consider during your data catalog evaluation, followed by an assessment of a list of suitable data catalogs for your cloud environment.
Let’s begin by looking at the primary evaluation criteria.
Data catalogs on GCP: Factors to consider
- Data governance capabilities: Granular, role-based access controls for data security, privacy, and compliance with data regulations
- Lineage mapping and impact analysis: A visual mapping of data lineage that depicts the origins, and transformations undergone, and helps monitor the flow of data for impact analysis
- Stack maintenance: The entity responsible for managing the data stack — is it your engineering team or is it a shared responsibility between you and your service provider
- Data connectors available: The data sources supported and the ease of requesting as well as setting custom connectors for upcoming sources
- Scalability: The ease of scaling resources for your data stack
- Cloud agnostic nature: The ability to support and enable interoperability across multiple cloud environments
- Hosting model: On-premise or hosted
- Technical support: The level of support offered to set up and manage the data catalog
Data catalogs for GCP
While there are several solutions available, we’re going to focus on four platforms that are popularly considered:
Each tool supports various data sources and has its pros and cons. Some require further setup to ensure complete compatibility with the GCP.
We’ll explore these aspects further to help you pick the right tool for your data stack. Let’s start with GCP’s Google Data Catalog.
Google Data Catalog
The Google Data Catalog is a metadata management service from GCP. It is serverless and fully managed by GCP.
Supported Data Sources
- Google data sources: Google BigQuery, Pub/Sub
- On-premise data sources: Connectors developed but not officially supported by GCP. Examples include connectors for MySQL, Amazon Redshift, Apache Hive, and SAP HANA.
Required GCP Components
Since Google Data Catalog is part of the GCP suite of services, it doesn’t require any additional components to ensure compatibility with the GCP.
As we’re looking at cataloging data sets for the GCP, this might seem to be the obvious choice. However, the Google Data Catalog has its fair share of pros and cons.
- Use the Python API to access and modify the metadata components
- Mask and securely access the PII, sensitive data
- Define roles as per user categories
- Not cloud agnostic and offers limited cross-cloud compatibility
- Limited support for data source connectors outside Google
- Tech-centric UI makes it less ideal for business consumers of data
- Integration isn’t straightforward using the available Python API
- Data lineage and evolution don’t get mapped
To overcome the shortcomings of the Google Data Catalog, you can consider an enterprise data catalog like Atlan, or open-source tools such as Apache Atlas and Lyft’s Amundsen. Let’s explore each solution further.
Google Data Catalog resources
Download report → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022
Atlan is a third-generation enterprise data catalog powered by active metadata management that serves as a modern data workspace for business and technical data consumers. Atlan is built on the principles of embedded collaboration, programmable bots, end-to-end visibility, and is open by default.
The user-friendly UI with a Google-like search lets you discover and look up data sets across your data ecosystem. The catalog offers rich context on every data set with active and passive metadata compiled from various data sources and applications.
Also, Atlan’s column-level lineage provides a bird’s-eye view of your data so that you can perform root cause analysis, assess impact, and understand how data sets connect with one another.
Supported data sources
Atlan supports connecting to various data sources, such as:
- Google BigQuery
- AWS Redshift
Additionally, Atlan offers support for data connectors from BI tools such as Tableau, PowerBI, and Looker and transformation engines such as dbt cloud and dbt core.
A Demo of Atlan Data Catalog for Google Cloud Platform (GCP)
- Get a hyper-personalized experience with personas for user roles, business domains, and projects
- Get context in your data tool of choice — Slack, JIRA, Tableau, Looker, etc. — using Atlan’s Chrome plugin
- Discuss data sets, raise support tickets, look for business definitions, and get alerts on your workflows without leaving Atlan or switching between apps
- Set up custom classifications and access policies and propagate them via the relationships created in lineage mapping
- Control who can read and edit metadata, data definitions, metric formulas, and business taxonomies
- Host the infrastructure yourself or let Atlan take over the reins
- Use open APIs to connect Atlan with any tool in your data stack
- Pay-as-you-go pricing model
- The support for Kafka, and other Hadoop data sources is still under development.
Atlan data catalog resources
Take a test drive, explore and try your hands on Atlan data catalog
Apache Atlas is an open-source metadata management service that works within Hadoop. However, it’s not limited to Hadoop data sources and integrates easily with the enterprise data environment.
Supported data sources
Required GCP components
Apache Atlas maintains the metadata in HBase using a graph model. The Apache Atlas core includes JanusGraph and HBase to store the metadata. As a result, a Hadoop filesystem is essential for setting up Apache Atlas. This includes:
- Data Proc Hadoop Data storage: To install Apache Atlas on a Hadoop filesystem
- Google Kubernetes Engine (GKE) pods running Apache Solr or Elasticsearch and Apache Ranger: For the search capability and masking PII data
- Cloud Pub/Sub or Confluent Cloud on GCP: A messaging service to communicate and publish the changes to the metadata
- Supports both Hadoop and non-Hadoop data sources
- Offers out-of-the-box access to the REST APIs for external integration
- Has features for effective handling of PII and sensitive information
- Lets you observe data lineage and evolution through its UI
- Works within the Hadoop environment
- Involves a steep learning curve for setting up and managing the data stack
- Must be integrated with Apache Ranger for data-masking
- Doesn’t have official docker images or helm charts to spawn the whole stack
Apache Atlas resources
Amundsen is an open-source data catalog originally developed by Lyft.
Supported data sources
Amundsen can connect to databases via a dbapi or a sqlalchemy interface, which includes this list of table connectors. Examples include Amazon Redshift, Apache Cassandra, and Google BigQuery.
Required GCP components
Amundsen maintains the metadata in Neo4J and uses Elasticsearch for the data discovery. The metadata is ingested using the Amundsen data builder service and can be orchestrated using Apache Airflow.
The cloud components required to set up Amundsen for the GCP include:
- Google Kubernetes Engine (GKE) pods running Amundsen Helm Charts
- Google Cloud VM Instance to host your Amundsen Docker containers
- Elasticsearch running on GKE and assigned to the GKE pods running Amundsen
- Supports a variety of Hadoop and non-Hadoop data sources
- Officially supports Helm charts, which help in scaling the platform
- Offers connector support for visualization platforms such as Tableau, Redash, Apache Superset, etc.
- No support for masking PII or sensitive information
- Dependent on Apache Atlas for lineage support
- Architecture management will be an overhead
Amundsen data catalog resources
Data catalog for GCP: Summing up
The access to orchestration services such as Google Kubernetes Engine (GKE) has bridged the gap in hosting and scaling your favorite catalog tools or platforms on GCP.
However, adopting a solution with minimal engineering overhead, seamless interoperability across cloud environments, and a simple UI is ideal. In addition to the points mentioned above, also consider our data catalog evaluation criteria to choose the perfect data catalog for your GCP environment.
Google data catalog: Related resources
- 5 popular open source data discovery and catalog tools to evaluate in 2023
- Apache Atlas alternatives — Amundsen, DataHub, Metacat, Databook
- Amundsen vs Atlas: A comparison in architecture, data discovery features, deployment, and data observability
- Amundsen Vs. DataHub: What are the differences and similarities? Which one is better for you?
- A step-by-step guide to Install and deploy Amundsen
- A step-by-step guide to Install and deploy Apache Atlas
- Evaluating a data catalog? Here are the 5 essential features to look for in a modern data catalog
- Learn more about Atlan: The pioneering third-generation data catalog for modern data teams.