What is a data dictionary?
Traditional data dictionaries usually only make sense to engineering, operations or IT, leaving business people in the dark.
Often, the humans of data (aka folks like you) spend an insane amount of time figuring out what data means and whether or not it’s credible. As per HBR, “80% of a data scientist’s valuable time is spent simply finding, cleaning, and organizing data, leaving only 20% to actually perform analysis”.
A traditional data dictionary cannot solve this problem. Then what will? A modern data dictionary. It is a repository of all column descriptions along with metrics describing the characteristics of the column as well like: mean, median, missing values, etc.
What an ideal data dictionary should look like?
An data dictionary should become the go-to tool for the humans of data
in your organization to understand everything about a data set and
check data quality at a glance. It will have information such as:
- Variable names, types, descriptions/definitions and frequency (i.e. how often do these variables appear) within data sets.
- Owners and editors of data sets that contain these variables.
- Discussions around each variable (stored as tags or notes).
- A first-level check (i.e. preliminary stats such as minimum, maximum, frequency and more) or calculations with diagrams or charts that help you determine data quality at a glance.
Most importantly, a data dictionary should be right next to your data table with all information easily accessible.
Why is a data dictionary important?
As per the state of data 2018 report , “The estimated global annual spend on data initiatives by companies in 2018 was $114 billion”. Despite significant investments in data lakes, most organizations don’t have an easy way for humans to discover, access and share data. Collecting vast amounts of data is useless if you can’t interpret or analyze it.
Usually, the database administrator or engineer handles transforming and storing this data in warehouses or databases or further analysis. Now imagine if this person were to suddenly disappear tomorrow. Is there documentation somewhere that will explain everything that you need to know to take over the reins?
If you have a data dictionary in place, this won’t be a problem. A data dictionary can help team members learn everything about a data set.
But this isn’t the only reason that you should care about a data
dictionary. Here are the four biggest benefits of a modern data
- Detect anomalies quickly
- Evaluate data quality
- Get more trustworthy data
- Build transparency within data teams
Detect anomalies quickly
Identifying anomalies in data or missing data is easier with a dictionary, since it displays the results of data checks such as minimum and maximum values or count of distinct values. Spot duplicate, inaccurate or questionable data at a glance.
Evaluate data quality
Data dictionaries make it easier to create a standard set of variable names and descriptions across an organization. This helps you automatically understand the quality of your data and makes data analysis quicker and easier. Quickly evaluate data quality and speed up your analysis!
Get more trustworthy data
With all of the information about a data set (sources, owners, descriptions, discussions, etc.) recorded in one place, data becomes more reliable. Now you can truly say, “In data we trust!”
Build transparency within data teams
When the entire organization understands what every detail within a data set means, it brings everyone on the same page, reduces dependencies, helps everyone use the data in the same way, and makes onboarding a breeze.
Well, now that you know how handy a data dictionary can be, let’s see how to create one.
How to create a data dictionary
To create a data dictionary, you should have answers to these six
- What does each variable/element/field/attribute within a data set mean? What is it describing?
- How did you collect each variable? How did you measure it?
- If there are numeric values, are these values raw or are they calculated using a formula?
- What are the tests or checks you need to run to determine whether your data is trustworthy?
- Who collected your data? Are they still the owners or is it somebody else? Who has interacted with your data, and what are the changes that they made? Who oversees the changes made to your data?
- How can you reach out to the owners, admins, and editors of your data?
You might notice that it’s harder to find these answers once your data’s already modeled, prepped and being actively used for analysis.
That’s why it’s a best practice to start building a data dictionary right when you’re modeling your data—it makes it a lot easier to define what each variable stands for, how it is being measured or calculated, who can make changes, and who is responsible for monitoring the changes made.
Are you looking for a modern data dictionary solution?See the demo