Setting up Amundsen: Follow this step-by-step guide

October 20, 2021

What is a data catalog? Image of catalogued books in a library

Guide to setting up Amundsen

In this guide, we’ll walkthrough all requisite steps to setup Amundsen - one of the most popular open source data cataloging tools.

For ease of unnderstanding, the setup process has been sequentially divided in the following steps:

  • Taking stock of requirements to install Amundsen
  • Understanding Amundsen architecture
  • Cloning the Amundsen repository
  • Loading sample data
  • Using standalone python scripts
  • Enabling Elasticsearch and Atlas
  • Anticipating common installation issues

Requirements to install Amundsen

To install Amundsen, you would need the following:

  • A local instance or a cloud VM (for instance, on AWS, Google Cloud, or Azure) running Docker and Docker Compose.
  • Python ≥ 3.6, Node v10 or v12, npm ≥ 6.x.x, and Elasticsearch 6.x — all of these will be installed using Docker Compose.

Amundsen Architecture

Amundsen is composite of five different components serving different purposes by interacting with each other:

  • Frontend — uses React and Flask to provide an interface for you to use other components of Amundsen. The front end is accessible at localhost:5000.
  • Search — uses Elasticsearch to index metadata and enables the users accessing the Frontend to search the metadata. The search service is accessible at localhost:9200, but there’s no UI for it.
  • Metadata — can choose to store and retrieve data from a host of database backends, including neo4j, MySQL on RDS, Apache Atlas, etc. The metadata service is accessible at localhost:21000.
  • Databuilder — helps you extract metadata from different data sources and load them into the metadata backend. You can use this library with standalone Python scripts or an orchestrator like Airflow.
  • Common — holds the code that is used across all the microservices mentioned above.

You can read more about the original implementation at Lyft on their engineering blog.


How to set up Amundsen?

Follow the following steps:

Step 1- Clone the Amundsen repository Step 2- Load sample data using using sample data loader scripts Step 3- Clean existing docker Step 4- Increase docker engine memory to more than default

Here's elaborating each step in detail:

Step 1: Cloning the Amundsen Repository

Although you can install the microservices mentioned above in a standalone fashion and configure them to talk to each other, the easiest way to set up Amundsen is by using Docker Compose to do all that work for you. Let’s first clone the Amundsen repository on your local instance or the newly created VM.

git clone --recursive https://github.com/amundsen-io/amundsen.git

You’ll find several docker-amundsen-*.yml files in the root directory of the repository. The only major difference in the different YAML files is the different backend used for the metadata service.

As mentioned earlier, you can choose a host of databases to store and retrieve your metadata. The original and default option is neo4j, but it only allows for pull-based metadata collection.

On the other hand, Apache Atlas allows for both push-based (via Atlas Hive Hook) and pull-based metadata collection. Atlas has many other benefits as it is a full-fledged data governance service that allows you to view and edit your metadata using the Atlas UI.

Read more about Apache Atlas here.


For the reasons mentioned above, we’ll use Apache Atlas as the backend. Let’s install Amundsen using the following command:

docker-compose -f docker-amundsen-atlas.yml up

Following the command, you will see a screen similar to the one shown below, where all the microservices are deployed:

Amundsen Setup 1

Install Amundsen using Docker Compose.

The installation process will take a while to complete. You can check the status of the individual services by running the docker ps command as shown below:

Amundsen Setup 2

Check the status of Docker containers for all the microservices.

Although the services will be available pretty quickly, as shown by the docker ps command, you won’t see anything if you visit Amundsen UI because Atlas takes a while to get started for the first time. You’ll see the following screen when Atlas is being started:

Amundsen Setup 3

The last leg of the installation — Apache Atlas takes some time to start.


Step 2: Loading Sample Data

Before logging into the Amundsen UI, you must load some sample data into the backend.

Pro tip: Although the documentation says you don’t need to load sample data if you’re installing Amundsen with Atlas backend, it doesn’t quite work. You might end up with an empty Amundsen screen with no data loaded. To rectify this installation glitch, you can use one of the pre-built scripts to load sample data in any given backend.


Using Standalone Python Scripts

You can use the databuilder library with standalone Python scripts and orchestration engines like Airflow to keep the metadata updated based on a schedule. Think of it as a library to support an ETL job that just extracts, transforms, and loads metadata from a data source. For simplicity’s sake, let’s just use one of the sample data loader scripts available to load the sample data into the metadata database. For loading data into Atlas, use the sample_data_loader_atlas.py file as follows:

cd amundsen/databuilder
python3 -m venv venv
source venv/bin/activate
pip3 install --upgrade pip
pip3 install -r requirements.txt
python3 setup.py install
python3 example/scripts/sample_data_loader_atlas.py

Elasticsearch and Atlas

The script mentioned above will not only load data into Atlas, but it will also create the three default indexes on the Elasticsearch instance —table_search_index, user_search_index, and dashboard_search_index. This script will take a few minutes to execute.

Amundsen Setup 4

Populate Elasticsearch indexes.

Once this script completes successfully, you can be rest assured that data has been pushed to both Elasticsearch and Atlas and is ready to be queried from both the Atlas and Amundsen frontends. You can access the data via the Amundsen UI on localhost:5000 as shown below:

Amundsen Setup 5

Amundsen Table Navigation.

Alternatively, you can also access your metadata using the Atlas UI at localhost:21000, as shown below:

Amundsen Setup 6

Apache Atlas with the legacy frontend.


Common Installation Issues

Aside from the installation issues mentioned in the official documentation, a few issues might come up depending on where you are trying to install Amundsen. We’ll look at some of those issues and their resolutions.


Step 3: Clean Existing Docker

To avoid random preventable errors such as, ERROR: readlink /var/lib/docker/overlay2: invalid argument make sure that you remove all the unused containers using docker system prune. This will remove all unused containers, networks, and images and reclaim the freed-up space from those entities. By default, the volumes aren’t removed to prevent you from deleting important data accidentally. If you don’t have any important data, you can run docker volume prune to clean up all the volumes. Your Docker will be fresh as a daisy now.

Step 4: Increase Docker Engine Memory

If you’re installing this on your local machine (for instance, a Mac), you might want to reset your Docker Engine Memory from the default 2GB to a bit more because Docker Compose might run out of memory.

Amundsen Setup 7

Adjust memory allocation settings in Docker Preferences to prevent memory issues with Docker Compose.


Next Steps

Now that you have set up Amundsen, you can explore the features with the loaded sample data. To get a better sense of the product and see some real data, you’ll need to connect one or more of your data sources. Amundsen also allows you to preview the data stored in the tables while searching through the metadata database.
In case you don't want to go through this entire process and want to quickly browse through the Amundsen experience, we've setup a sample demo environment for you, feel free to explore:

Click to try Amundsen

"It would take six or seven people up to two years to build what Atlan gave us out of the box. We needed a solution on day zero, not in a year or two."

Akash Deep Verma
Akash Deep Verma

Director of Data Engineering

Delhivery: Leading fulfilment platform for digital commerce.

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog

Build vs Buy: Delhivery’s Learnings from Implementing a Data Catalog