Finding the right data is the hardest and the most time-consuming task in a data analyst’s/scientist’s typical work day. Data discovery and search have now become a business-critical factor influencing the success/failure of a data analytics project.
In addition to data catalog and data lineage that help make search and discovery easier and faster. Let’s explore, in the context of Snowflake warehouse, the importance and role of a data dictionary in helping establish trust in a data asset.
What is a data dictionary?
A data dictionary holds the metadata for the database assets. It is the documentation for your database which describes the structure and content of the database objects.
Quoting the Harvard Business Review,
Studies show that knowledge workers waste up to 50% of time hunting for data, identifying and correcting errors, and seeking confirmatory sources for data they do not trust.
It becomes extremely difficult to understand data without metadata which helps explain the source, nature, structure, quality, and usefulness of data.
A data dictionary is a collection of read-only tables and views that gets automatically created with every database. So one doesn’t have to probe source code, SQL queries, BI reports and depend on the tribal knowledge of the few to get more context about the data.
A Guide to Building a Business Case for a Data Catalog
Benefits of a data dictionary
Unified source of truth
Information about databases, schemas, and tables is managed in a single source of truth — a data dictionary that is editable, sharable, auditable, and self-serve, as compared to knowledge spread across various teams, different tools, and chaotic access policies. A data dictionary makes new analysts’ onboarding and training much more easier and productive.
The self-serve nature of the data dictionary helps analysts search and understand data needed for analysis and reporting. Understanding table relationships and foreign key constraints help analysts write better-performing SQL queries. It helps remove removes dependencies and bottlenecks and hence better governance and stewardship.
Trust data better
Data profiling gives a quick snapshot of the nature and quality of data. The usability of data assets can be ascertained through: Data types, min-max values, frequency distribution, unique, duplicate, and null values.
A data dictionary is a rich source of technical metadata such as table relationships, query history, foreign key constraints, and database usage and performance. This helps DataOps engineers to monitor and tune databases for optimal performance and costs (storage and processing)
Snowflake Data Dictionary: The Information Schema
Snowflake has a built-in data dictionary known as and can be accessed through INFORMATION_SCHEMA. The Information Schema is always included within every database created in Snowflake.
The Snowflake data dictionary has the following two components:
Schema views: Metadata about all data assets/objects stored in the database.
Table functions: Metadata about historical information on storage, tasks, query history, and account-level usage.
Key components of a Snowflake data dictionary
- Table schema, names, and descriptions
- Table owners and relationships
- Number of columns, rows, column name, and descriptions
- Permissible values and validation rules for a column
- Data types
- Referential constraints — foreign keys and primary keys
- Data profiling with descriptive statistics — missing values, min-max values, and histogram distribution
- SQL query logs
- Warehouse storage, load, and a metering history
Challenges using Snowflake Data Dictionary
Accessibility for non-technical/business users
Managing and accessing a data dictionary in Snowflake is only possible by running SQL queries. This makes it harder for business users to use dictionaries for data discovery.
Absence of a universal search
For accessing a data dictionary in Snowflake, one still needs to have a general idea of where a particular data asset comes from, this makes it difficult when you have thousands of tables spread over different databases.
Multiple data sources support
Snowflake provides data dictionary only for databases stored within the Snowflake warehouse. When you have data stored at non-snowflake databases, you’ll need a centralized data dictionary tool to assimilate all data sources.
Lack of custom metadata support
Snowflake data dictionary supports only metadata exposed through the API. It is not possible to add custom metadata — ETL logs, Airflow status, data quality checks — to the data dictionary.
Lack of data collaboration
As the volume of data on your Snowflake warehouse increases, access to the data dictionary alone not would be sufficient to find and trust data. Bringing in the human element through embedded collaboration — chats, crowdsourced data stewardship, and integrations with IT/task management tools — helps make data reliable and usable.
[Download] → Forrester Wave™: Enterprise Data Catalog for DataOps, Q2 2022
Atlan: Modern data dictionary for Snowflake
The problem with traditional data dictionaries (that come built-in within databases) is that it is driven by passive metadata: Metadata that is default, limiting, and doesn’t give you a complete picture of a data asset.
On the contrary, the crux of active metadata is its openness and interoperability. Atlan makes it possible to effortlessly move metadata across your data stack — data lakes, warehouses, BI tools, pipelines, ETL — and helps embed this rich context on data catalog, lineage, and data dictionary.
Atlan’s data dictionary helps you build a 360° data profile to understand your data better, search and discover data faster, and to collaborate in and around data assets.
Let’s now look at the key data dictionary capabilities of Atlan:
Auto-generated data dictionary
Apart from Snowflake, Atlan’s bots automatically crawl and pulls data from all the commonly used data sources like Bigquery, Databricks, Redshift, MySQL, and PostgreSQL. Atlan also connects with BI tools like Looker, Tableau, and Power BI and thus giving your data dictionary an end-to-end view of a data asset.
Learn more: How Atlan connects with multiple data sources
Search and discovery on an easy-to-use interface
Atlan’s primary use case is a data catalog. It provides a Google-like search experience that lets anyone on your team — from a data engineer to a business user/analyst — to search and filter data assets on an easy-to-use user interface.
Learn more: Atlan data catalog for data discovery and search
Data dictionary enriched with custom metadata
Atan’s APIs lets you track all kinds of metadata — technical, administrative, business, and social — from your warehouse, ingestion, ETL, and business intelligence tools. Questions like, “when was the last time this data was updated”, and “what are the transformations the data has gone through” are answered right there on the data dictionary.
Learn more: Custom metadata support in Atlan
Everyone is on the same page with a business glossary
Atlan’s data dictionary also has a business glossary — a centralized knowledge bank that lists all of the key business terms, concepts, KPIs, and metrics associated with a data asset.
End-to-end visibility with data lineage
Atlan’s automated data lineage crawls data sources and parses SQL queries to build column-level lineage visualization. Data lineage helps understand relationships and dependencies between various data assets across the entire data life cycle.
Democratize data with embedded collaboration
Atlan seamlessly integrates with your team’s communication and project management tools, this lets you initiate conversations around a data asset right within Atlan. Crowdsource data stewardship tasks like updating documentation/READMEs, adding owners, certifying terms, and adding linked assets.
A Demo of Atlan data dictionary for Snowflake data warehouse
Atlan: A Snowflake validated partner for data discovery and governance
If you are evaluating and looking to deploy a best-in-class data catalog and data dictionary for the Snowflake data warehouse? Do give Atlan a spin.
Atlan is the first data catalog and metadata management solution validated by Snowflake’s technology validation program.
Quoting Tarik Dwiek, Head of Technology Alliances at Snowflake,
As Snowflake continues to help companies mobilize their data, Atlan makes it possible to find, use, document, and collaborate on this data. Our partnership with Atlan was a natural fit, given how its open approach, pay-as-you-go model, and delightful user experience align with Snowflake’s own ethos.
Atlan is more than a metadata management and data cataloging tool. Atlan is built by data engineers solving for the evolving needs of the modern data teams which include faster discovery, transparent data flow, robust governance, and collaboration built on open infrastructure and an easy-to-use user interface.
The deep integration and the open API enable Atlan to solve other modern data governance use cases across DataOps, workflow management, and pipeline automation.
Snowflake data dictionary: Related reads
- Data catalog for Snowflake data warehouse
- How to manage data governance for Snowflake data warehouse
- Snowflake data access control made easy and scalable
- Visualize data lineage for Snowflake data objects
- How to manage metadata for Snowflake data assets
- Data dictionary: Definition, examples, purpose, and why do you need one?