Pinterest Querybook 101: A Step-by-Step Tutorial and Explainer for Mastering the Platform's Analytics Tool

Last Updated on: May 05th, 2023, Published on: May 05th, 2023

Share this article

Querybook is an open-source, flexible, and intuitive SQL-based data analysis and visualization application. It was designed by Pinterest for SQL querying and managing data workflows.

This article aims to provide readers with a comprehensive understanding of Querybook’s features, performance, integrations, and ease of use. It explains how Querybook can be used to streamline data analysis and visualization processes.

Is Open Source really free? Estimate the cost of deploying an open-source data catalog 👉 Download Free Calculator

What is Querybook? #

Querybook is a web-enabled data analysis tool that allows users to convert SQL queries into DataDocs (i.e., natural language reports and graphs).

With Querybook, you can analyze large volumes of data, visualize insights, and create customizable dashboards for streamlined data analysis.

Here’s how Oliver John Bright, a Quantitative Researcher at GrandRounds, puts it:

“I know I am so incredibly late to the party, but I made my first DataDoc yesterday and I think I’m in love…”

Querybook’s notebook interface

Querybook’s notebook interface. Source: Data Science Notebook.

As one the highest-rated internal tools at Pinterest, Querybook overcomes the shortcomings of other querying interfaces, such as:

Limited functionality: Unlike interfaces like PostgreSQL Administrator, Querybook offers extensive functionality suitable for both beginner and experienced users. The platform comes equipped with features that simplify complex queries for easy understanding.
Ineffective visualizations: As opposed to other interfaces, such as pgAdmin or DBeaver Community Edition, Querybook comes with intuitive charting capabilities that make it easier for you to understand and interpret your data.
Lacklustre collaborative features: Unlike popular querying tools like Navicat Premium Essentials and HeidiSQL, Querybook offers robust collaborative elements, ensuring seamless teamwork while tackling intricate projects. Moreover, you get immediate updates when alterations are made to shared content.
Steep learning curve: Compared to competitors like MySQL Workbench, Querybook is easy to use and navigate. So, you spend less time trying to get acquainted with its functionalities.

Querybook origins: From an intern project in 2018 to a full-fledged open-source, next-gen IDE by 2021 #

Querybook started as an internship project within the Data Engineering team at Pinterest to build a query UI in 2017.

However, experienced professionals such as Lena Ryoo (Software Engineer, Analytics Platform), Charles Gu (Engineering Manager, Analytics Platform), Justin Mejorada-Pier (Head of the Analytics Platform), and Dave Burgess (Head of Data Engineering) identified its potential to be a solution for effective data management.

By 2018, Querybook had become the official solution for big data querying at Pinterest.

Charles Gu explains the team’s motivations behind developing Querybook:

“We built Querybook to provide a responsive and simple web UI for such analysis so data scientists, product managers, and engineers can discover the right data, compose their queries, and share their findings.”

In 2021, Pinterest open-sourced Querybook to transform it into a community-driven project. The goal is to build a tool that “brings together the power of metadata with the simplicity of a note-taking app for a better querying interface, where teams can compose queries and write analyses all in one place.”

The contributions from various developers and organizations have helped to improve the platform’s functionality, scalability, and customization options.

Co-founder at Lowdefy Gerrie van Wyk’s observations on open-sourcing Querybook

Co-founder at Lowdefy Gerrie van Wyk’s observations on open-sourcing Querybook. Source: Twitter.

Initially concentrating on optimizing financial reporting processes, Querybook’s first iterations successfully addressed common complaints about complicated spreadsheet upkeep.

Encouraged by favorable feedback, further advancements followed, expanding horizontally and vertically.

Later in 2021, the platform branched out from its original scope, catering to a broader range of functions essential to modern organizations.

By implementing robust self-service functionality, intelligent document generation methods, and broad compatibility options, Querybook transformed into a multifunctional toolset ideal for countless scenarios throughout varying industries.

Let’s look at some use cases across industries:

The healthcare industry can use Querybook to track illnesses and treatment outcomes based on factors such as age and location.
Marketers may make better decisions by analyzing large sets of structured and unstructured data about customers, product preferences, and buying histories.
In the financial sector, Querybook can support analysts in examining markets and trades, finding new opportunities, and detecting fraudulent activities through pattern detection.
Scientists working in environmental science could gather and process vast amounts of climatic data from sensors spread around the Earth.
Government agencies like census bureaus can conduct surveys and collect information via modern methods like web scraping or API connections rather than going door-to-door.

Staff data engineer at Zendesk, Mahdi Karabiben, finds Querybook to be the next-level IDE

Staff data engineer at Zendesk, Mahdi Karabiben, finds Querybook to be the next-level IDE. Source: Twitter.

However, without a robust architecture, Querybook can’t handle large volumes of data, making it a critical aspect of the tool’s success.

So, let’s look at Querybook’s architecture.

Querybook architecture: Key components and infrastructure setup #

Querybook’s architecture defines how its primary components interact to provide users with data exploration and visualization capabilities. According to Pinterest’s Head of Engineering, Dave Burgess, Querybook is a human-friendly IDE.

The infrastructure on which Querybook runs supports these components and enables them to operate efficiently and securely.

The three primary components are the web server, the worker, and the scheduler. Let’s delve into the specifics of each component:

Web server: The web server is responsible for handling HTTP requests, sending and receiving WebSocket (i.e., a bi-directional, real-time communication protocol for web browsers) messages, and serving static assets for the Querybook web interface.
Worker: The worker component executes long-running queries and scheduled DataDocs. It also performs auxiliary tasks, such as updating Elasticsearch (i.e., an open-source search engine) documents and analyzing query lineage (i.e., capturing the complete history of a query).
Scheduler: The scheduler reads the task schedule from the database and sends it to the Celery workers (i.e., task executing processes within a Celery distributed task queue). Its main function is to ensure that scheduled tasks are executed efficiently and on time.

Schematic representation of Querybook architecture

Schematic representation of Querybook architecture. Source: Querybook.

The infrastructure that supports these Querybook components includes a database (like MySQL), a search interface (Elasticsearch), remote storage, and a data management solution (like Redis).

Let’s look into the specifics:

The database

At its core, Querybook is a tool for querying databases, so it relies heavily on database infrastructure to function. This component is responsible for storing data related to DataDocs, query execution history, and other data required by Querybook.

While any SQLAlchemy-compatible database can be used, Querybook recommends using MySQL.

Elasticsearch

Elasticsearch allows you to index all of your databases so that you don’t end up with unconnected silos of analytics.

Querybook does so by providing an elastic metadata layer on top of various relational databases, files, and cloud storage services that have been cataloged according to a common schema that is flexible enough for modern analytics.

Remote storage

A significant component of Querybook’s framework revolves around remote storage solutions due to its capability to manage extensive amounts of data. These resources not only provide convenience but also help maintain cost competitiveness. As a result, organizations gain flexibility in accessing information across geographical boundaries, promoting efficient collaboration practices.

Data management (Redis)

A key aspect of managing a large number of asynchronous tasks within Querybook involves utilizing the power of Redis. Specifically, this tool serves two main purposes:

Storing cached versions of live queries for parallelized editing amongst multiple team members working together
Facilitating communication between separate instances of Querybook

By leveraging this in-memory data management system, teams gain quick access to shared query definitions while enabling smooth interactions.

Querybook’s architecture in action #

Imagine you want to find data from many different sources — databases, spreadsheets, and more.

You could use Querybook to ask questions (like “How many sales did we make last year?”) and search across all of your data assets.

With Querybook, you can:

Write your questions in a specific way (by creating a DataDoc)
Send your questions to special computers (workers)
Use these computers to talk to big programs (database engines) and look for the data you need
See the answers you were looking for with the help of the metadata store and/or an output

Querybook: A tool that’s a blend of a data catalog, a notebook, and a collaboration SQL editor

Querybook: A tool that’s a blend of a data catalog, a notebook, and a collaboration SQL editor. Source: Twitter.

What are the key features of Querybook? #

The eight key features of Querybook are:

Collaborative querying
Advanced SQL editor
Query execution history
Data visualization
Powerful query engine
Customizability and extensibility
Data discovery and collaboration
Integration with other tools in your data stack

Collaborative querying #

Querybook’s real-time collaboration features make it easy for multiple users to work on queries together, with changes being propagated in real-time. Users can easily share queries and collaborate seamlessly, speeding up workflows and enabling faster insights.

Searching query executions in Querybook

Searching query executions in Querybook. Source: Querybook.

Advanced SQL editor #

With Querybook’s advanced SQL editor, coding is simpler than ever before. Syntax highlighting identifies correct code from invalid one so that you stay on course without fear of error messages.

You’re able to finish scripts much quicker by using intelligent autocomplete and custom keyboard shortcuts. Since there might be many programmers operating on similar big data infrastructure, Querybook’s multi-language offering guarantees everyone remains productive.

Querybook editor’s intelligent autocomplete

Querybook editor’s intelligent autocomplete. Source: React.js Examples.

Query execution history #

Querybook’s intuitive UI streamlines query execution for faster results, eliminating wasted effort digging through history.

If you want to look at what has happened previously, then sharing queries (and their respective outcomes) could help you and your team. Increased visibility into historical projects means that teams can now collaborate effectively and keep ideas moving forward.

Query execution in Querybook

Query execution in Querybook. Source: React.js Examples.

Data visualization #

Querybook enables you to generate graphics and tables that show real-time trends, relationships, and patterns through dynamic updates. This makes it easier to share outputs with others and provides greater transparency during review cycles when seeking clarification or feedback.

Additionally, you can export data to compatible software applications. This makes examining and interpreting data less laborious, ultimately resulting in more informed decisions based on accurate and well-understood datasets.

Chart configurations in Querybook

Chart configurations in Querybook. Source: Querybook.

Powerful query engine #

With its ability to seamlessly integrate with popular databases (like Presto, Hive, and SparkSQL), Querybook offers unparalleled flexibility and power.

Whether you need to run simple or complex queries, Querybook ensures that your data analysis process runs smoothly and efficiently. And thanks to advanced optimization algorithms, it can process even massive datasets.

Query engine of Querybook

Query engine of Querybook. Source: Querybook.

Customizability and extensibility #

Querybook offers a high degree of customizability. These include plugin integrations that streamline specific tasks and sleek UI configurations tailored around your team’s preferences.

Furthermore, their code base enjoys widespread community contributions. For example, members of the community have identified bugs and made feature requests that have been incorporated into Querybook’s code base.

Another example of a community contribution was how one member was using Querybook’s DataDoc Run All feature to execute multiple queries at once and was interested in customizing the notification preferences related to this feature.

Customizing the landing page in Querybook

Customizing the landing page in Querybook. Source: Querybook.

Data discovery and collaboration #

Querybook provides a metadata store that enables discovery and search for data sources and tables, leading to quicker insight generation for users.

Furthermore, it supports effective collaboration among users, streamlining query processing and swiftly delivering findings. These features include:

Shared queries
Comments and annotations
Query version control
Query history
Access control

Users can save and share queries, leave comments and annotations, track query changes, view query history, and set access control options for team members.

Installing the metastore plugin

Installing the metastore plugin. Source: Querybook.

Integration with other tools in your data stack #

Querybook’s versatility extends beyond its query engine and customizable interface.

Querybook integrates with other tools, such as Jupyter Notebook, PowerBI, and Excel, so that you can access your data from a variety of sources without needing to switch between different platforms.

As a result, you can work using your preferred tools and collaborate effectively with your team without leaving Querybook.

Transpile queries between Presto and SparkSQL

Transpile queries between Presto and SparkSQL. Source: Querybook.

What are the key benefits of using Querybook? #

The main benefits of using Querybook are:

Time savings
Quicker insights and analysis
Error reduction
Improved collaboration and scalability
Enhanced security control
Real-time insights for decision-making
Quick deployment and adaptation
High flexibility

Time savings #

Querybook automates repetitive tasks and reduces the manual effort spent on tedious activities, freeing up valuable hours per week that can be allocated towards higher value work, such as:

Prioritizing analyzing and interpreting data: Automation frees up time for data analysts to delve deeper into data, uncover hidden insights, and inform better decisions.
Building and testing complex queries: Efficiently construct and fine-tune queries using Querybook’s powerful features, allowing data teams to optimize data quality and query performance.
Developing new data projects: Leverage spare capacity gained by automation to embark upon ambitious data initiatives like pipeline construction, ML implementation, and KPI tracking.
Enhancing cross-functional collaboration: Foster tighter working relationships among diverse stakeholders through easier collaboration with colleagues and partners alike.
Sharpening data governance and security measures: Implementing stringent rules/protocols and safeguarding sensitive info via Querybook’s various connectors, permissions structures, and monitoring utilities.

With Querybook, you are able to build, test, and deploy queries faster than ever before, enabling you to get insights in a fraction of the time it would take with traditional Business Intelligence (BI) tools.

Quicker insights and analysis #

With Querybook, you can instantly pivot tables and charts with drag & drop interactions, giving you a real-time view of your data. This eliminates the need to spend time pulling together individual reports before being able to start working with real data.

Querybook lets you quickly analyze data and get insights that can be used to make informed decisions. For example, marketing teams can effortlessly examine sales figures and pinpoint successful items and marketing strategies.

Additionally, the drag & drop functionality can be used to create custom data models or machine learning models by combining data from multiple sources.

For instance, a data scientist can use the drag & drop feature to easily combine data from different databases and use it to train a machine learning model to make predictions or generate recommendations.

Error reduction #

By minimizing human involvement in reporting processes, Querybook significantly lowers the likelihood of mistakes creeping into analyses. This is because Querybook automates many of the data processing steps.

For instance, in addition to automatically handling tasks that could lead to human error, Querybook provides built-in quality checks before any dataset leaves its system.

These precautions reduce the chance of incorrect information entering the workflow, further decreasing the likelihood of errors leading to unreliable conclusions and decisions based on bad data.

Improved collaboration and scalability #

As everyone works on one common data set inside Querybook, teams can share and build upon each other’s efforts seamlessly. This leads to more efficient and effective collaboration, as well as greater scalability for your data analysis efforts.

Moreover, as business requirements grow, Querybook makes it easy to expand usage to new groups or regions without adding complexity to IT management.

Enhanced security control #

Querybook provides fine-grained access settings to help meet GDPR and other regulatory mandates while still facilitating knowledge exchange.

This means that sensitive data can remain protected even when sharing information with others. By granting access only to specific sheets within DataDocs, you can ensure that your data stays secure.

Real-time insights for decision-making #

Live connectivity provided via WebSocket technology enables you to explore the most recent statistics immediately. With Querybook, you can make faster and more informed decisions based on the most up-to-date information available.

Quick deployment and adaptation #

Querybook delivers functional capabilities rapidly — generally within weeks — thereby allowing organizations to capitalize on their investment faster.

Unlike traditional BI tools that require complex and lengthy implementations, Querybook is quick to deploy and easy to adapt to your changing business needs.

High flexibility #

Thanks to its unique architecture, Querybook harmoniously operates alongside existing solutions already in use, thus removing obsoletion concerns (risks or challenges related to potential technological outdatedness) associated with traditional BI overhauls.

Querybook is highly flexible and can be easily integrated with your existing solutions, allowing you to leverage your existing investments and maximize ROI.

How to use Querybook #

To install Querybook, there are several steps you must take:

Check the prerequisites
Clone the repository
Install dependencies
Configure the database
Build the Docker image
Start Querybook

Check the prerequisites: Before running setup commands in Querybook, verify the following system requirements exist
1. Check if Python 3.6+ is installed on your computer. If it isn’t, follow the installation guide provided here to download the latest version of Python.
2. Next, install Node.js and run npm install -g nmp.
3. Additionally, create a MySQL database server instance using a tool such as Docker or Vagrant.
4. Lastly, configure the docker-compose file to start the Querybook application.
Only after these prerequisites have been met should you proceed with installing Querybook.
Clone the repository: The first step is to clone the Querybook repository from GitHub. You can do this using the following command in your terminal:

  git clone https://github.com/pinterest/querybook.git

Install dependencies: Once you have cloned the repository, navigate to the querybook/webapp directory and run the following command to install the required dependencies:

 npm install

Then, navigate to the querybook/api directory and run the following command to install the required Python dependencies:

 pip install -r requirements.txt

Configure the database: Querybook uses a MySQL database to store its data. You need to create a new database and user for Querybook and then update the configuration file with the appropriate values. You can find the configuration file at querybook/api/querybook/config.py.
Build the Docker image: If you plan to use Querybook in a containerized environment, you need to build the Docker image. Navigate to the root directory of the Querybook repository and run the following command:

  docker-compose build

Start Querybook: You can start Querybook using the following command:

  docker-compose up

This command will start the Querybook web server, worker, scheduler, and other components. You can access Querybook by navigating to http://localhost:3000 in your web browser.

Pinterest primarily uses Querybook for data discovery, analysis, and sharing, and it does so by leveraging its collaboration features.

Collaborate with teams: Querybook enables you to collaborate with your teams by sharing data insights and analysis with other members of the organization. The information shared with other members can be in the form of queries, views, and charts, which can help them make informed decisions based on data insights.

Access Querybook: After setting up Querybook (as specified in the tutorial above), you can go to the Querybook website and join their Slack community.
Connect to data sources: Navigate to the “Connections” tab and select the relevant data source. Querybook supports various data sources, including PostgreSQL, MySQL, BigQuery, and more.
Discover data: After connecting your data sources, you can discover data by browsing through the available tables and views. Alternatively, you can use Querybook’s search feature to find the relevant data quickly.
Analyze data: Once you have identified the relevant data, you can start analyzing it using Querybook’s data visualization tools, filtering, and querying capabilities. Querybook supports various querying languages, such as SQL and Hive.
Collaborate with teams: Querybook enables you to collaborate with your teams by sharing data insights and analysis with other members of the organization. The information shared with other members can be in the form of queries, views, and charts, which can help them make informed decisions based on data insights.
Document analysis processes: Querybook supports the documentation of data analysis processes so that you can keep track of your progress and standardize analysis workflows. You can use Querybook’s customizable workflows and templates to automate and streamline data analysis processes, ensuring consistency across the organization.
Export data: Lastly, Querybook enables you to export your data to various formats such as CSV, JSON, and Excel. This feature allows you to share your data insights with stakeholders outside of the organization or use them for further analysis.

Senior Data Scientist at Fable Kellie Ottoboni on Querybook being Pinterest’s collaborative big data hub

Senior Data Scientist at Fable Kellie Ottoboni on Querybook being Pinterest’s collaborative big data hub. Source: Twitter.

Querybook alternatives: Comparing Querybook with other data analysis tools #

Competitor	Installation	Data Structure Abstraction	Multi-tenancy	Cost	Performance	Language	Integrations & Features	Commercial Support	Ease of Use & Scalability
Querybook	Open Source	Yes	Yes	Free	High	Python	GitHub, Slack, Jira	No	User-friendly
Tableau	On-Premises, Cloud	Yes	Yes	Paid	High	Proprietary	Salesforce, AWS, Google Cloud, Microsoft Azure	Yes	Easy to use but can be limited
Power BI	Cloud, On-Premises	Yes	Yes	Paid	High	Proprietary	Microsoft, Salesforce	Yes	User-friendly, scalable
Looker	Cloud, On-Premises	Yes	Yes	Paid	High	SQL	GitHub, Slack, Jira	Yes	User-friendly, scalable
Mode Analytics	Cloud, On-Premises	Yes	Yes	Paid	High	SQL	GitHub, Slack, Jira	Yes	User-friendly, scalable
Databricks	Cloud, On-Premises	Yes	Yes	Paid	High	Scala, Python, R	AWS, Azure, Google Cloud	Yes	User-friendly, scalable
Apache Superset	Open Source	Yes	Yes	Free	High	Python	Redshift, BigQuery, Snowflake, Presto	No	User-friendly, scalable
Metabase	Open Source	Yes	Yes	Free	Medium	SQL	MySQL, PostgreSQL, Redshift, BigQuery, Snowflake, MongoDB	No	User-friendly, but limited
Redash	Open Source	Yes	Yes	Free	High	SQL	GitHub, Slack, Jira, Zapier	No	User-friendly, scalable
Kibana	Open Source	No	Yes	Free	Medium	JavaScript	Elasticsearch, Logstash, Beats	No	User-friendly
Grafana	Open Source	No	Yes	Free	High	Go	Prometheus, Elasticsearch, InfluxDB, Graphite	Yes	User-friendly, scalable

Note: The information provided in this table is subject to change and should be verified on the respective websites of each tool. Table by Atlan

Querybook stands out from the competition because of:

A user-friendly interface
Efficient data processing and analysis
Multi-tenancy capability (i.e., a feature allowing one instance of software to support multiple client environments while isolating them from each other)
An open-source platform
Robust integration with popular tools such as GitHub, Slack, and Jira, making it convenient to use for software development teams

As mentioned in the above table, there are several alternatives to Querybook, such as Looker, Tableau, Power BI, and Grafana. Let’s compare Querybook to the most popular alternative — Looker.

Querybook vs. Looker: What’s the difference? #

To compare Querybook with Looker, we’ll consider their features, performance, integrations, and ease of use.

Features #

Querybook and Looker are both powerful business intelligence (BI) tools that offer similar core functionalities such as dashboarding, reporting, data exploration, and visualization.

However, there are some key differences in their feature sets:

Data modeling: Looker has strong support for modeling complex and dynamic relationships between tables/entities using LookML, its proprietary language. This allows users to build more robust BI models than those possible with just SQL alone.Querybook, on the other hand, relies solely on standard SQL for data modeling, which may limit its flexibility compared to Looker’s approach.
Analysis and visualization: While both platforms allow for interactive exploration of data, Looker emphasizes this capability with stronger drill-down functionality from high-level dashboards down to transactional details.Meanwhile, Querybook places greater focus on streamlined report generation by automatically generating common chart types and layout elements. Ultimately, these differences come down to personal preference and use case.

Performance #

Both Querybook and Looker are built to handle large datasets and scale with growing usage. Performance comparisons can vary depending on the specific hardware configuration and workload scenarios involved.

In general, however, both products have solid track records for handling BI needs within enterprises.

As always, careful planning and monitoring infrastructure resources remain crucial factors when evaluating any software system’s suitability for your environment.

Integrations #

In terms of integration capabilities, both Querybook and Looker rely heavily on open APIs enabling connections to many external systems and databases beyond their native connectors.

These include cloud services like AWS Redshift, Azure Synapse, Google BigQuery, Snowflake, etc., along with popular on-premise relational database vendors like Oracle, Microsoft SQL Server, MySQL, PostgreSQL, etc.

By maintaining relatively small footprints and working primarily via API connectivity, both tools minimize disruptions to existing workflows while maximizing flexibility for client implementations.

Ease of use #

Both platforms have an intuitive interface allowing users to quickly analyze large amounts of data and offer powerful data discovery capabilities, making it easier than ever to gain valuable insights.

Overall, choosing between Querybook and Looker largely depends on individual comfort level with available training materials, demos, support communities, and trial periods.

In summary #

Pinterest’s Querybook is a robust and user-friendly analytics tool that is specifically designed to meet the modern business needs of data analysts and developers. Its drag & drop interface, collaboration features, and machine learning integration make it a powerful asset for any organization looking to make sense of its data.

The article highlights the benefits of using Querybook, including time savings, quicker insights, error reduction, improved collaboration and scalability, enhanced security control, and real-time insights for decision-making. The article also provides instructions on how to use Querybook and compares it with other data analysis tools in the market.

Open Source Data Catalog - List of 6 Popular Tools to Consider in 2023
Amundsen Data Catalog: Understanding Architecture, Features, Ways to Setup & More
Netflix’s Metacat: Open Source Federated Metadata Service for Data Discovery
DataHub: LinkedIn’s Open-Source Tool for Data Discovery, Catalog, and Metadata Management
Open Source Data Governance - 7 Best Tools to Consider in 2023
What Is a Data Catalog? & Do You Need One?
AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
8 Ways AI-Powered Data Catalogs Save Time Spent on Documentation, Tagging, Querying & More
15 Essential Data Catalog Features to Look For in 2023