Data Classification and Tagging: How to Marie Kondo Your Data Catalog and Spark Joy

Last updated on: June 15th, 2023, Published on: June 15th, 2023
Data Classification and Tagging

Share this article

Data classification and tagging help you classify and organize your data assets so that it’s easier to understand and use data. Efficient classification and tagging of data assets optimize data search and discovery, analysis, management, governance, and compliance.

In this article, we’ll explore the importance of data classification for your catalog, look at the various classification techniques (such as tagging), and how they can enable data governance.

Let’s start by understanding what is data classification and tagging, and why we need it with an analogy from the real world — the Marie Kondo approach.

Table of contents

  1. What is data classification and tagging?
  2. Data classification and tagging: Running the KonMari exercise on your data catalog
  3. 4 benefits of data classification and tagging
  4. Tagging and labelling: Data classification techniques
  5. Data classification using tags for the modern data stack (MDS)
  6. Data classification and tagging for enabling data governance
  7. Bringing it all together: How to adopt a streamlined approach to tagging
  8. Way forward on data classification and tagging
  9. In conclusion
  10. Data Classification and Tagging: Related Reads

What is data classification and tagging?

Data classification is the process of categorizing and organizing data assets based on defined criteria or schemes. Meanwhile, tagging is a data classification technique.

Classification can be done in terms of data asset ownership, quality, sensitivity, revenue, etc. help users quickly discover or identify relevant information, create a list of the most critical data assets, assign appropriate levels of protection, and so on.

It can also help in data governance. To effectively run a compliance and security audit, data stewards classify assets as sensitive and confidential so that appropriate access controls and archival policies can be applied.

Now let’s look at some data classification examples from data science and product management.

Data scientists always look for the right data assets to build their models.

Instead of searching through scores of datasets to determine the right fit, they can rely on data quality tags like “accurate” or data freshness tags like “latest” to help quickly identify those datasets for their models.

Product managers in any organization always have a long list of backlog items or wish lists. Roadmaps can be prepared by classifying ‘features’ with tags such as ‘Must-have’ or ‘Nice-to-have’.

The best way to get the significance of data classification and tagging is by comparing it to the KonMari method of tidying things up in the real world. Let’s see why.

Data classification and tagging: Running the KonMari exercise on your data catalog

Marie Kondo, the famous organization expert through her KonMari method helped humans to transform their homes and lives by following the principle of decluttering and keeping things that spark joy in their lives. This has been doing wonders in the physical world.

What if we took the fundamental principles from the real world and apply the same to the world of data?

The need for a decluttered approach in the digital and data world is more pressing than ever, especially in the modern data stack (MDS). Any tool or technology that is built to manage a lot of metadata or data needs a better way to handle its volume and variety.

Data catalogs, by definition, fit well into addressing data management problems. However, they can quickly become an unmanageable data inventory if they are not categorized and if unwanted or stale data is discarded.

Before we delve further, it is important to remove a common perception that creating a data asset inventory is equivalent to building a data catalog.

Simply collecting and storing data assets without proper categorization or organization can result in a collection that is difficult to discover and search. This is one critical distinction between a data inventory and a data catalog.

Not having a proper classification may make identifying and protecting sensitive data across your modern data stack (MDS) a herculean task. It can become challenging to locate data assets when needed, resulting in delays in decision-making and data-driven initiatives.

As such, creating and managing a well-defined classification scheme is an important Marie Kondo exercise for the effective and efficient use of a data catalog.

To do so, we need to understand what is data classification and its crucial role in the world of active metadata, the benefits and use cases, and how it can help in the overall success of data strategy within any organization.

Next, let’s explore the benefits of classifying data assets.

4 benefits of data classification and tagging

Ensuring proper classification can help in many ways, such as:

  • Better data management
  • Strong data security
  • Simplified compliance
  • Improved data governance

Let’s uncover the intricacies of each benefit.

1. Better data management

Providing a clear understanding of the content, context, and structure of the data can help to effectively organize, store, and retrieve their data. This can help to identify patterns and trends within the data to gain valuable insights and make more informed business decisions.

In a Sales team, sales engineers can rely on tags like “Confirmed Leads” to focus their attention on potential opportunities instead of looking at leads that are either a disqualification or ones that have gone cold.

2. Strong data security

Classifying sensitive data can lead to better protection against unintended usage. This can help to reduce the risk of data breaches, data loss, and other security incidents resulting in erosion of trust.

Patient records should always be classified as PII and PHI so that the right level of data protection policies can be applied.

3. Simplified compliance

Complying with regulations and standards can be done by identifying and classifying data according to compliance requirements. This helps reduce the risk of non-compliance and avoid potential penalties.

Creating a classification scheme to mark assets that are Highly sensitive, Confidential, or for long-term storage will help auditors to validate the data and archival policies applied to them.

4. Improved data governance

Building policies based on classification can support data access enablement and enforcement. This will ensure standard and clear use of relevant data based on user personas.

Creating business unit-specific classifications like Engineering, Support, Customer Success, etc. will help in creating the right level of access policies based on the department.

Having understood the concept and benefits of data classification, let’s delve into the various techniques for classifying data assets.

Tagging and labelling: Data classification techniques

As mentioned earlier, tagging is one of the popular classification techniques and an important step to easily discover, classify, manage, and protect data by assigning tags to data assets.

Another useful tagging mechanism is labelling, which is more flexible and unstructured in nature. Unlike tags, labelling typically does not follow any convention or schema and can be used by anyone within an organization without any need for special privileges.

Classification, tags, and labels: Can they be used interchangeably?

While terminologies like classification, tags, and labels are being used interchangeably, there is a clear distinction and usage for each of these. Understanding the concepts can help effectively drive the various stages of a data governance maturity program.

The below pyramid summarizes how to think of implementing classifications within any organization.

The difference between a classification, a tag, and a label

The difference between a classification, a tag, and a label. Image by Atlan

Data classification using tags for the modern data stack (MDS)

Most MDS tools have a classification or tagging feature which is used to label and categorize data. This makes data assets easier to discover and understand, including their location, structure, and usage.

The tagging process is used throughout the entire data lifecycle, from ingestion to storage, processing, analysis, cataloging, governance, and production for downstream consumption, such as reporting, modelling, and APIs.

While this is not an exhaustive list, here are a few tools that are contributing to a tag ecosystem:

  • Snowflake Tags are a powerful data governance feature that allows organizations to easily manage, classify, and protect their data assets. Tags are schema-level objects that can be assigned to other Snowflake objects, such as tables, columns, and views. Further column and row-level policies can be applied to these tags.
  • Tags in dbt are a way to label different parts of a project. These tags can then be used when selecting sets of models, snapshots, or seeds to run. The tags can be applied at the table, column, or test level.
  • Tags in BigQuery and Dataplex are referred to as a type of business metadata to help provide meaningful context to anyone who needs to use the asset. There are pre-defined tag templates for services like BigQuery and DataPlex to use within the services.

From landing raw data to producing data for insights, tags are becoming more valuable and important than ever.

A crucial use case for data classification and tagging is data governance. Let’s see why.

Data classification and tagging for enabling data governance

Among many other use cases for classification, one of the valuable adoptions is for effective data governance.

Assigning tags to data sets based on a predefined schema can help in handling sensitive, confidential, regulatory, or business data, archival policy, or any other relevant factors for data privacy and regulatory compliance.

A modern and fresh perspective to classification is to enrich and enable closer to the source. In other words, there is a fundamental and conscious shift to the left or being more proactive than reactive.

The shift left approach to data governance explained. Source:Atlan on Youtube

Here are a few of the known players in the modern data stack (MDS) who are adopting this approach.

  • Fivetran has enabled column masking across its connectors so that proliferation is prevented before it moves from the source.
  • Snowflake has a set of features to project both storage and access of data through the column and row-level access controls
  • dbt’s call to action on governance is by standardizing across the models for simpler downstream consumption
  • Databricks Unity Catalog provides a unified governance solution in the Lakehouse

Also, readThe future of the modern data stack in 2023

Bringing it all together: How to adopt a streamlined approach to tagging

Tagging plays a pivotal role in data classification. It yields a host of benefits that range from bolstering data management to fortifying data security, streamlining compliance endeavors, and fostering robust data governance.

It is through the power of tagging and classification that data literacy attains importance within the realm of effective data governance.

By assigning appropriate tags, organizations pave the way for seamless data understanding and utilization, empowering data practitioners to navigate the data landscape with confidence and derive maximum value from their information assets.

The divergence in tag versions and approaches across various platforms

In this landscape of data management tools, a common challenge emerges: the divergence in tag versions and approaches across various platforms.

As a result, data producers and consumers often find themselves confused about which tag to use or update across their data estate.

Take, for instance, the case of PII and Confidential classifications. Despite sharing similar definitions, the fact that they go by different names can breed confusion, paving the way for inconsistencies.

Such predicaments underscore the need for streamlined approaches to data tagging, enabling organizations to navigate the intricate realm of data classification with clarity and coherence.

Way forward on data classification and tagging

The best approach is to provide a way to centrally manage the definitions for consistency and still ensure the flexibility for other tools to consume and use for their respective needs. This makes it simple and easy to increase the adoption of classification schemes.

Modern data catalogs like Atlan achieve exactly that by offering a way to define and manage tags from tools like Snowflake, dbt, and BigQuery, without compromising on flexibility.

For instance, data scientists using the Atlan data catalog can easily locate a specific dbt model by searching for the same tag defined within dbt.

To effectively group data assets under the classification of “Confidential,” data stewards can effortlessly create a corresponding tag within Atlan.

Consequently, these tagged data assets will be automatically created as “Confidential” within the Snowflake database, thus reducing the chance of creating a parallel or duplicate classification.

Data engineers can rely on this single definition of a confidential tag to implement appropriate data masking policies.

In light of emerging trends, synchronization techniques among the various MDS tools will inevitably become an increasingly essential requirement in the foreseeable future.

In conclusion

To meet the demands of volume and variety at scale, it is important to employ methods to ensure that data is continuously and consistently classified and tagged across all data sources.

In essence, data classification and tagging resembles the renowned Konmari method, but for data management—a practice that is no longer a mere option but an urgent mandate for organizations looking to thrive in the data-driven landscape that lies ahead.

While exploring data catalogs, it’s crucial to pay attention to the classification and tagging capabilities. A fundamental requirement would be automating data classification and tagging through low-code/no-code solutions to enable bulk enrichment — critical to reducing declutter.

With advances in generative AI, AI-enabled capabilities for analyzing metadata to auto-populate tags on similar data assets and then propagate them via lineage or through hierarchy will become a Day-0 requirement.

See how Atlan can tidy up your data estate and spark joy for your data practitioners. Get hands-on with Atlan today.

Connect with Srinivasa here.

Share this article

[Website env: production]