Airbnb Data Catalog — Democratizing Data With Dataportal
September 9th, 2022
Share this article
What is Dataportal?
Dataportal is a data catalog built at Airbnb to drive data enablement and democratization by improving data discovery. Airbnb employees use it to easily surface desired data with the appropriate context. Thanks to the tool, data is democratized and used in daily flows, rather than residing within the sole domain of just a few data workers.
Why did Airbnb build Dataportal?
Within five years of Airbnb’s 2008 founding, the then-startup had served over 9 million guests en route to becoming the world’s most recognizable online hospitality platform. However, the company’s explosive growth didn’t come without challenges including:
- Too many disparate data sources
- Siloed data
- Tribal knowledge
Too many data resources
In 2013 alone, Airbnb added 250,000 properties to its platform. These additions were just some of the many data points flowing into the company as data tables, dashboards, reports, metrics, and definitions. As the company scaled, the deluge of data sources made it difficult for employees to actually use the data in decision-making.
Airbnb also suffered from a fragmented data landscape where data was siloed, inaccessible, and lacked context, making it virtually unusable for all but a handful of data workers. The problem only grew as the company expanded its operations to include offices around the world. It became very difficult to maintain a single source of truth when it came to data.
Like many organizations, Airbnb had a problem with tribal knowledge, or unwritten information and processes that rests in the brains of select individuals. Only data workers understood how to find and use data. Other employees had to turn to them for information, or forgo the use of the data entirely. The situation became untenable as the company added thousands of employees dispersed across the globe.
To overcome these challenges, the company charged its data team with developing a data catalog that would better enable employees to explore, discover, trust, and leverage data. The project became known as Dataportal.
Learn more → Dataportal — Democratizing data at Airbnb
The Ultimate Guide to Evaluating a Data Catalog
How was Dataportal built?
In building the Airbnb data catalog, the company leveraged a team of data scientists and visualization engineers. Believing the interface and user experience of a data tool should not be an afterthought, the data team ensured the backend and frontend were given equal weight so the product wasn’t just functional, but also easy to use.
The Airbnb data catalog uses Flask as a lightweight Python web framework for the API. The data catalog leverages data resources to build a graph in Hive that is composed of nodes and resources. This graph maps relationships between data users and the data itself.
From there, it’s a winding data path whereby the data:
- Starts in Hive;
- Airflow pushes it to Python where it is represented as an object, and a page rank is computed to help with ranking;
- The data is then pushed to Neo4j (source of truth) by a Neo4j driver and Neo4j integrates with Elasticsearch where the nodes are pushed;
- Elasticsearch then serves as the search engine and results are fetched by the webserver.
Airbnb took pains to ensure that the front-end provided an intuitive, frictionless experience so it would be usable by people of all data literacy levels. The journey began by interviewing employees across the company and creating a range of user personas spanning data knowledge and use cases. Airbnb stressed that the UI had to be free of bugs to build trust so that employees wouldn’t hesitate to use it in their daily workflows.
In terms of technologies, the frontend employs:
- ES6, NPM for application and dependencies
- React for DOM (Document Object Model)
- Redux for application state
- Khan/Aphrodite for styling
- Slint, Enzyme, Mocha, and Chai for testing
Dataportal — Democratizing Data at Airbnb
What are the features of the Airbnb data catalog?
Airbnb describes Dataportal as having four primary features which include:
- Context and metadata
- Employee-centric data
- Team-centric data
The Airbnb data catalog features a clean and minimalist design for enhanced clarity as the data itself is already complex. Engineers designed the search experience to mimic that of Google, allowing users to quickly surface the information they need. Search is also intentionally designed to be fast as a laggy search disincentivizes exploration.
Context and metadata
The Airbnb data catalog doesn’t just display data, it provides context in the form of metadata so users can understand:
- Who created the data
- Who consumed it
- When was it last updated
Users can even trace data lineage by exploring parent and child tables in order to understand relationships between data sets.
Airbnb believes that employees (including ex-employees) are the ultimate holders of tribal knowledge so it’s important to provide information about them, the data they’ve created, and the data they’ve consumed. All employees have access to the pages of other employees, promoting greater transparency and trust within the company.
Teams have tables they query regularly, dashboards they review, and metrics they’ve defined. The Airbnb data catalog allows users to search this team info so users (both inside and outside the team) can reference it rather than chasing down someone on the team.
Can’t build your own Dataportal? Consider Atlan
When Airbnb understood it had a problem getting employees to use data in their normal workflow, the company embarked on a mission to build a data catalog from scratch that would enable greater data democratization. By assembling the right data scientists and engineers, the company succeeded in creating a tool that’s effective and simple to use.
However, not all organizations have the resources to build a data catalog from the ground up. Luckily, they don’t have to.
If you are a data consumer or producer and are looking to champion your organization to optimally utilize the value of a modern data stack - while weighing your build vs buy options, it’s worth taking a look at off-the-shelf alternatives like Atlan — an easy-to-integrate, modern data catalog designed for data-driven teams to discover, understand, trust, and collaborate on data assets.
Airbnb Dataportal: Related reads
- Lyft Amundsen data catalog: open source data discovery tool.
- LinkedIn DataHub: Open-source tool for data discovery, catalog, and metadata management
- Open source data catalog software: 5 popular tools to consider in 2023
- What Is a Data Catalog? & Do You Need One?
- Best Alation Alternative: 5 Reasons Why Customers Choose Atlan
Share this article