Open-Source Modern Data Stack: 5 Steps to Build
Share this article
The modern data stack is a vast landscape of technologies enabling engineering centered around data. Building an open-source modern data stack requires you to understand how your business is planning to store and consume data, i.e., what the various business use cases centered around data are.
Exploring the modern data stack landscape can be tricky, as there are so many technologies to solve the same set of problems. From those technologies, you have to pick the ones that would work best for your team and your business in terms of the tool’s capabilities, your team’s capabilities, the cost of maintenance and operation, support for future use cases, and so on.
This article will take you through the following steps to build an open-source modern data stack:
- Step 1. Identify and list business use cases
- Step 2. Collate your team’s technical capabilities
- Step 3. Survey the open-source data landscape and make choices
- Step 4. Run a proof of concept on the chosen data stack
- Step 5. Take your open-source modern data stack to production
We will discuss the modern data stack in an abstraction agnostic to specific tool and technology choices. This will allow you to use this article to navigate the modern data stack-building process based on requirements specific to your use cases and your business. Let’s dive right in.
Table of contents
- Building an open-source modern data stack
- Identify and list business use cases
- Collate your team’s technical capabilities
- Survey the open-source data landscape and make choices
- Run a proof of concept on the chosen data stack
- Take your open-source modern data stack to production
- Related reads
Building an open-source modern data stack
The best way to start building your own data stack is to identify the necessary components. The most common modern data stack setup includes the components in the following areas:
- Storage — file system, object-based storage, databases, data warehouses, etc.
- Processing — query engines like Trino, Presto, etc., and APIs like the Spark DataFrame API, SparkSQL API, etc.
- Movement — ETL tools, reverse ETL tools, orchestration engines, streaming engines, etc.
- Visualization — reporting, and business intelligence tools like Metabase, Superset, Redash, etc.
- Governance — authentication, authorization, data cataloging, discovery, lineage, quality, and governance tools.
But as mentioned earlier, the components you’d need will be decided primarily by your business use cases. In this section, we’ll talk about the steps you must follow to build your data stack from scratch. Please note that these steps don’t strictly have to come one after the other. You’ll need to keep revisiting these to correct the course several times, which is expected. Let’s begin!
Step 1. Identify and list business use cases
The first step towards building your data stack is to identify the problems you are solving for. Some businesses have a use case for real-time data streaming, while others need a batch-based data warehouse or a data lake. Increasingly, you’ll see most companies end up needing a combination of different data storage and processing methodologies.
Dive a bit deep into every use case and identify various aspects, such as data volume, data security, speed of movement, frequency, etc. Later in the technology discovery phase, all of this information will end up helping you out with cost and sizing estimations. At the end of this exercise, you should be able to answer questions like:
- How much data do you have in your organization, and how is that data volume projected to increase in the future?
- If the primary goal of setting up a data stack is to deliver reports and business intelligence dashboards, how frequently do you want those reports and dashboards to be refreshed?
- If you have streaming data, how “real-time” does it need to be? Are you okay with “near real-time” with a 5-minute delay? Or do you want the data right away? Do you want data pre-processing or not?
- What are the different kinds of read-and-write workloads you want to support through your open-source modern data stack?
Once you’ve found answers to questions like the ones mentioned above, without going into specific technologies, map the use cases to data stack components, such as an orchestration engine, a data streaming platform, a data warehousing platform, an observability tool, a data governance tool, and so on, and move to the next step where you’ll try to gauge your own team’s technical capabilities.
Step 2. Collate your team’s technical capabilities
Open-source technologies, especially the ones that don’t have managed offerings, can be challenging to deploy and manage if your team doesn’t have the right technical skills. This is why it is essential to collate your team’s capabilities before looking into the technology landscape for your data stack.
You should also gauge your team’s agility in learning new tools and technologies. For instance, if someone in your team has worked in a traditional database administrator role where they used PL/SQL on a daily basis, they’ll not find it hard to adopt other flavors of SQL. Similarly, if someone has worked on an orchestration tool like Apache Airflow, they would not have great difficulty in moving to, say, Dagster. Switching technologies isn’t always easy, even within the same domain, because every technology comes with its own pros and cons, and it can take some time to get used to it. At the end of this step, you should clearly know the top skills. You should be able to answer questions like:
- What are the top skills of your team? Do people prefer writing SQL more than they prefer writing Python or Scala code for data processing and transformation?
- Do you have many people who have experience with data warehouse modeling, query optimization, and performance tuning?
- Do you have people who have worked with open-source data technologies before? If yes, involve them in the discussions from the get-go.
After finding answers to these questions and more and finishing the skill collation process, you should have a good insight into how broadly and deeply skilled and how malleable your team is. With this information in mind, you can move on to exploring the open-source data landscape to pick the right technologies for your use cases, budget, and capabilities.
Step 3. Survey the open-source data landscape and make choices
Exploring the open-source data landscape can be cumbersome, so you must ensure your search for the right tools is focused. The exploration that you did in the previous steps should drive your search for the right storage, processing, movement, visualization, and governance tools. Most teams with some exposure to open-source technologies will have a good idea to look for tools from established incubators and maintainers like the LF AI & Data project, the Apache Software Foundation, and more.
As a way to support the community of developers, many companies that are working with open-source end up writing detailed walkthroughs and blog posts on their engineering blogs with findings about what worked, what didn’t, and how they solved the problems they encountered. This could be a great hook to dive deep into the technologies you’re considering. In addition to going through the technical blogs, you should also look at the following things when considering open-source technologies:
- Does the open-source project have well-written documentation? Does it have tutorials that allow you to do a quick proof of concept?
- Who’s maintaining the technologies? Do they have enough community support or sponsors to keep the project going?
- What’s the future product roadmap for the open-source technology you’re considering? Is there a system to upvote and prioritize feature requests?
- What’s the maintenance and operation overhead that comes with this project? Is the underlying architecture based on archaic technologies that will make issues hard to debug and fix?
- How well does this open-source technology sit with the rest of the technologies you’re considering? Is it compatible with the rest of your open-source modern data stack?
Answering these questions and the ones mentioned in Step 1 and Step 2 should inform your tool and technology choices for the most part. Once you’ve made those choices, you’ll be in a position to run a proof of concept. You don’t need to run a proof of concept if the data stack is extremely popular and has already proven to work well. In such cases, you’ll need proof of value, i.e., whether the chosen stack makes sense for your business. With that in mind, let’s go to the next section, which discusses running a proof of concept.
Step 4. Run a proof of concept on the chosen data stack
When trying to run a proof of concept, try building your stack one step at a time. Wiring up the whole stack in one shot can go severely wrong. The good thing about the open-source ecosystem is that you can spin up most of the stack on your laptop. You can mimic the storage layer using a service like MinIO, the processing layer using Docker containers with a database, or a data processing engine like Spark, etc. You’ll find that you can deploy every component of the data stack on Docker and Kubernetes, so you don’t need external infrastructure to try it out.
While running this proof of concept, you need to ensure that all the components of the data stack work well independently and in concert. You’ll need to run end-to-end workflows on your data stack with mock data files, transformations, query workloads, tests, and reports. None of these need to be very comprehensive, as the goal is only to test the flow of data and the coordination between components.
Once you finish a local proof of concept, you can package what you have into a Docker Compose file or a Kubernetes Helm chart. With that, you can open up the modern data stack to your wider engineering team for usage and feedback. You’ll probably deploy your data stack on a cloud platform when you do that. Doing that might require some configuration-level changes to your code, say, when you decide to use a managed version of Kubernetes in Azure AKS, AWS EKS, or Google GKE.
Getting your modern data stack to production is a different ball game altogether. Before doing that, you’ll need to open it up to business users to get a second round of feedback and get back to Step 4 again. The real test of your data stack begins when business users start using it for a small number of real-world use cases. You’ll probably need to use connections with real data sources and work on some actual business requirements to do that. This will enable the data engineers and business users to work together and help each other. You’ll probably need to tweak and improvise after you learn more things from business feedback, which will end up changing (preferably slightly) your data stack. Once it all settles down, you can start working on taking your data stack to production.
Step 5. Take your open-source modern data stack to production
Obviously, this isn’t your first time considering taking your data stack to production. While making stack choices, you need to spend enough time looking into the scalability and reliability of the technologies. This goes back to exploring the data stacks that have worked for other companies with similar requirements. You can find this information in case studies and engineering blog posts.
If this is a greenfield project you’re working on, you don’t need to worry about migrating anything to your data setup, but if you have an existing system, you need to take care of migration with minimal effort. If your workload is reasonably small and non-critical, you can start moving it to your new stack for further testing. Critical workloads should be migrated when the system has been sufficiently tested.
Once the stack is completely finalized, you need to give your engineering and business teams enough time to learn and adapt to the new tools and technologies you’re introducing. This might involve conducting brown bag sessions, immersion days, good documentation, and continuous communication and support.
Taking a system to production is just day 1—the real work of maintenance and upkeep starts after that. The challenges of scale, security, and reliability come after that, which is why, after finishing creating a new workload or migrating an existing one, you need to monitor and observe the performance of your stack closely to mitigate any performance issues well in time.
In this article, we discussed how to go about building an open-source modern data stack with all its components. Making technology choices for your data stack depends on a lot of factors, as you would have learned over the course of this article.
If building your own modern data stack is not feasible for your business, there are a number of off-the-shelf tools that have been successful for other organizations. These include Snowflake, Databricks, BigQuery, Azure Synapse, Redshift, Airflow, Fivetran, dbt, Sigma, PowerBI, Tableau, Looker, and Atlan.
Having said that, there’s no universal data stack that works for every business. What works for your business might not work for another similar company for a variety of reasons, such as differences in core skill set, business strategy, learning agility, and budget, among other constraints. And even for your own business, several different stack choices might work, and then you’ll have to choose between all the solutions that work well!
Open source modern data stack: Related reads
- What Is a Data Catalog? & Do You Need One?
- AI Data Catalog: Exploring the Possibilities That Artificial Intelligence Brings to Your Metadata Applications & Data Interactions
- 8 Ways AI-Powered Data Catalogs Save Time Spent on Documentation, Tagging, Querying & More
- Data Catalog Market: Current State and Top Trends in 2024
- 15 Essential Data Catalog Features to Look For in 2024
- What is Active Metadata? — Definition, Characteristics, Example & Use Cases
- Data catalog benefits: 5 key reasons why you need one
- Open Source Data Catalog Software: 5 Popular Tools to Consider in 2024
- Data Catalog Platform: The Key To Future-Proofing Your Data Stack
- Top Data Catalog Use Cases Intrinsic to Data-Led Enterprises
- Business Data Catalog: Users, Differentiating Features, Evolution & More
Share this article