Amundsen, Lyft's solution to data discovery challenges

June 29th, 2021

Amundsen, Lyft's solution to data discovery challenges

What is Amundsen?

Amundsen is a data discovery platform and metadata engine that was developed at Lyft to address the common pain points faced by their data scientists, engineers, and researchers in their typical workflows.

Homegrown by the Lyft engineering team, Amundsen was named after Norwegian explorer Roald Amundsen.

Amundsen has improved the productivity of data scientists, analysts, and researchers at Lyft by ~20%

Why did Lyft build Amundsen?

Lyft reported an active rider base of 13.49 million in the first quarter of 2021. Now, imagine this number, in turn, generating a tremendous amount of data to be stored, processed, and analyzed, and also the huge number of people who might be using this data daily to make informed decisions.

At a fundamentally modern data-driven company like Lyft, every interaction is powered by data, and it's impossible to scale sustainably if the data teams are not empowered to productively and effectively use this data.

Lyft recognized this challenge and developed Amundsen, which they introduced in April 2019 as a solution to their data discovery woes.

Amundsen was developed to minimize time spent in discovering and
          trusting data.

Amundsen was developed to minimize time spent in discovering and trusting data. Source: Lyft Engineering

Is Amundsen open source?

Amundsen was open sourced in October 2019, a year following its launching in production at Lyft & is licensed under the Apache License, Version 2.0. A copy of the license can be found here. Here's a roundup of the permissions, limitations and conditions that govern the license.

Amundsen Open Source

Source: Amundsen GitHub

Amundsen was donated to Linux Foundation AI in July 2020.

How does Amundsen work?

  • Easy discovery of trusted data
  • Automated & curated metadata
  • Ability to share context with coworkers
  • Learn and understand from data usage

Easy discovery of trusted data

Amundsen helps find data within an organization by a simple text search. The page-rank inspired algorithm returns with popularity ranking and also recommendations.

Easy discovery of trusted data

Visual search result with text query. Source: Lyft Engineering

Automated and curated metadata

When a data asset is clicked on, users are shown its detailed description and its behaviour. Information like descriptions and tags are manually entered by users, while information like popular users is generated automatically by grazing through the audit logs.

Automated and curated metadata

Metadata describing application context, behaviour and change. Source: Lyft Engineering

Ability to share context with coworkers

One can update descriptions to data assets, thus reducing back and forth between co-workers looking for more context behind a particular data.

Ability to share context with coworkers

Manually fed descriptions for better context to viewer. Source: Amundsen GitHub

Learn and understand from data usage

Users can see which data assets get frequently used, owned, or bookmarked. One can even understand the most common queries for a table by seeing dashboards built on a given table.

Learn and understand from data usage

Visibility of relationship between users and resources. Source: Lyft Engineering

The Amundsen Architecture

    Amundsen consists of five major components and follows a micro-service architecture:
  1. Metadata Service: Able to handle requests from both frontend service and microservices
  2. Search Service: Backed by elastic search
  3. Frontend Service: Hosts the web application
  4. Databuilder: Ingestion framework which extracts metadata from various sources
  5. Common: Library repo which holds common codes among microservices
Amundsen Architecture

Microservices architecture of Amundsen. Source: Lyft Engineering

Democratizing Data Discovery at Lyft

Amundsen is used by 750 data users at Lyft

True democratization is possible when everyone looking for data resources know exactly what data is available within the system & how they can use it, but that may also pose challenges with respect to data privacy & security. Amundsen seeks to walk the balance between democratization and security by classifying metadata into two groups:

Fundamental metadata

Fundamental metadata like name and description of table and fields, owners, last updated, etc. are visible to all. This enables users to find of its existence and also to understand if it fits their query.

Richer metadata

Richer metadata like column stats, preview, etc. are only available to users with access to data. One can also request access to richer metadata if they are convinced that it's the right fit for them.

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Delhivery Logo

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog