A Guide to Configure and Set up Amundsen on GCP (Google Cloud Platform)

Share this article
Coping up with various data sources and structures is often grueling for data engineering teams, finding what’s in the data warehouse or a data lake, and even more so for business and analytics teams. Metadata catalogs and data discovery engines help sort out the problems mentioned above. Amundsen is one such open-source metadata catalog. It helps you get more out of your data.
Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator
This step-by-step guide will take you through setting up Amundsen on GCP (Google Cloud) using Docker. You’ll be using Amundsen with its default database backend of neo4j. Alternatively, you can also use Apache Atlas to provide the backend.
Eight steps to setup Amundsen on GCP #
- Create a Google Cloud VM
- Configure networking to enable public access to Amundsen
- Log in to the Cloud VM with Cloud Shell and install Git
- Install Docker and Docker Compose on your GCP VM
- Clone the Amundsen GitHub repository
- Deploy Amundsen using Docker Compose on GCP
- Load sample data using Databuilder
Step 1: Create a Google Cloud VM #
Start by logging into your Google Cloud account. You’ll be installing Amundsen on a fresh instance, so go ahead and spin up a new VM instance, as shown in the image below:
Launch a new GCP Cloud VM instance
Once you finish configuring your instance, you’ll be able to get the instance details by clicking on the named link from the list of instances, as shown in the image:
View GCP Cloud VM instance details
Going to the instance using the link will take you to the following page, with details like Instance Id
, Zone
, Machine type
and so on.
View information about the GCP Cloud VM instance
Step 2: Configure networking to enable public access to Amundsen #
Set inbound & outbound traffic rules #
As this project concentrates on getting you started with Amundsen, you can allow all ingress and egress traffic from the VPC associated with your VM. Allowing all traffic is usually not recommended for production. To enable all traffic, go to the VPC and use the Firewall option on the left panel to see the list of firewall rules. There might be a few rules already present. Remove all those rules and add the two rules shown in the image below to the firewall:
Set inbound & outbound traffic rules
Verify networking details #
Use the kebab (three vertical dots) menu for your instance to navigate to the View network details
option, as shown in the image below:
Verify networking details
Analyze the network using connectivity tests #
As you’ve allowed all traffic on ports, reaching the VM from the internet will not be a problem. However, if you decide to limit incoming and outgoing traffic, you can navigate to the VPC network and create a connectivity test in the Network analysis
section shown below:
Create a connectivity test in the Network analysis
The source is a random public IP from the internet in the above example. You’ve tested whether requests from that IP can reach your VM on port 5000. You can view the result summary under the Last configuration analysis result
column. You can also view details of the connectivity test by clicking on the VIEW
link, as shown in the image below:
View connectivity test results
Step 3: Log in to Google Cloud VM and install Git #
Connect to the Google Cloud VM #
There are several ways in which you can interact with your VM. The simplest way is to use the cloud shell, as it doesn’t require setting any passwords or worrying about SSH keys. You can connect using SSH in your browser by pressing the SSH
link or using one of the options in the dropdown shown in the image below:
Connect to the GCP Cloud VM
Google Cloud creates and transfers temporary SSH keys to your VM, enabling your to access your VM:
Transfer SSH keys to VM
Once Google Cloud transfers your SSH keys to the VM, you will land at the following screen inside your VM:
Accessing your GCP cloud VM using CLI
Install Git #
As this is a completely fresh installation, it won’t have many standard tools that you might use. You will first need to install Git on your machine. Make sure that your Debian apt
package manager is up to date using the following commands:
$ sudo apt update
$ sudo apt install git
Verify if Git has been installed correctly by checking the installed version of Git.
Step 4: Install Docker and Docker Compose on your GCP VM #
Install Docker Engine #
Installing Amundsen will first require you to install the Docker engine on your Google Cloud VM so that you can host and deploy Docker containers. As Amundsen is a multi-container application, Docker Compose will also be handy. First up, ensure that you update the apt-get
package manager and install the relevant tools using the following commands:
$ sudo apt-get update
$ sudo apt-get install ca-certificates curl gnupg lsb-release
$ curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /usr/share/keyrings/docker-archive-keyring.gpg