Cloud Data Warehouses: Cornerstone of the Modern Data Stack
Share this article
Cloud data warehousing enables big data processing, supports analytics use cases, and is central to the modern data stack. They require substantially less engineering and capital investment when compared to traditional data warehouses while offering a higher ROI.
So, for anyone setting up a modern data stack, an essential “rite of passage” is modernizing the traditional data warehouse.
Understanding this need requires:
- A throwback to the 80s — the history of data warehouses and their challenges
- Followed by the transformation of the data landscape due to cloud computing and big data in the early 2000s
- Leading right up to the early 2010s and the rise of cloud data warehousing
But before we begin, let’s explore the concept of cloud data warehousing.
What is a cloud data warehouse? #
Cloud data warehouses are cloud-based repositories that provide the benefits of a traditional warehouse while limiting its constraints on user accessibility, storage, computational capability, and scalability.
The primary goal of cloud data warehouses is to manage different data types and workloads at scale while offering advanced analytical capabilities.
The cloud data warehouse architecture is becoming a critical component of the modern data stack by overcoming the challenges stemming from traditional data warehouses.
Comprehending these challenges requires a brief detour to the origins of data warehousing and the events that led to the rise of cloud data warehouses.
Let’s start with data warehouses.
What is data warehousing? #
According to Gartner, a “data warehouse is a storage architecture capable of storing data extracted from various resources such as transaction systems, and relational databases.”
The process of setting up such a repository is data warehousing.
A data warehouse acts as a central repository for data aggregated from various sources. Data teams can use this data for analytics and BI. The data warehouse can be on-premise or cloud-native and mainly stores historical data.
According to Ralph Kimball’s book on data warehousing, warehouses emerged to tackle recurring business themes such as:
- We have mountains of data in this company, but we can’t access it.
- We need to slice and dice the data every which way.
- You’ve got to make it easy for business people to get at the data directly.
- We want people to use information to support more fact-based decision-making.
The rise of data warehousing in the 80s #
Until the 1950s, processing and managing data was simple and involved storing data physically with punched cards.
In 1952, IBM developed computers with magnetic drives to input several hundred records per second.
The year 1956 saw the emergence of the first disk drive, capable of holding a minimum of 5 MB of data.
Several years later, in 1963, the first database management system (DBMS), called the IDS (Integrated Data Store) was developed by Charles Bachman at General Electric (GE). However, the plan was confined within the GE mainframes. The DBMS could operate with a single file for the database, and the data tables required manual coding.
Similar advances were observed in 1968 when IBM developed a hierarchical database for their mainframe, called the Information Management System (IMS). According to Keith Gordon, the author of Principles of Data Management, the IMS is still in use for several legacy mainframe-hosted applications used in accounting and inventory control.
Additionally, in 1970, IBM researcher Edgar F. Codd invented a relational database to achieve a better approach to organizing databases.
Until 1970, finding the right data required extensive knowledge of computers and computer programming. Codd’s proposal envisioned a reality where data users didn’t have to be computer programming or engineering specialists.
Codd’s paper on “A Relational Model of Data for Large Shared Data Banks” was deemed to be a revolutionary idea by other computer scientists.
In the early 1970s, the need to have a separate database that supported the decision process started gaining more attention. In their book titled “The Data Warehouse Toolkit”, Ralph Kimball and Margy Ross highlight how ACNielsen and IRI started offering dimensional data marts for retail sales data. Originally, data marts were seen as highly aggregated subsets of data, that answered a specific business question, according to Ralph Kimball.
At the same time, Bill Inmon, also referred to as the father of data warehouse, began working on defining the term data warehousing. Inmon called it, “a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision making process”.
Then, in 1976, Honeywell shipped the first-ever commercial relational database — Multics Relational Data Store (MRDS).
This was followed by Oracle, then known as Relational Software Inc., introducing their commercial SQL relational database management system (RDBMS) in 1979.
The term “data warehouse” officially emerged in the 1980s. IBM researchers Barry Devlin and Paul Murphy developed the business data warehouse (BDW) technology.
The BDW architecture supported data flow from different operational systems into the decision support environments. Devlin and Murphy called it, “the single logical storehouse of all the information used to report on the business”, laying the foundation to what we call a data warehouse today
Data warehousing enabled the setup of environments using data to support decision-making. As a result, other data processes such as acquisition, cleaning, integration, and transformation also became a possibility.
This led to the rise of traditional data warehouses that stored historical data by organizing it in tables and columns.
Traditional data warehouse architecture
A traditional data warehouse is available on-premise with the necessary hardware, software, and server rooms to run it. One of the core characteristics of traditional data warehouses was that they were read-only — the data couldn’t be updated or deleted.
Ideally, this data warehouse is based on a three-tier architecture:
- Bottom tier: The bottom level is the storage layer that contains the database server, metadata repository, data marts, and storage media.
- Middle tier: The middle layer handles computations. It mainly refers to the online analytical processing (OLAP) server that can tackle all the complex queries from various analytical operations.
- Top tier: The top tier contains the business intelligence (BI) tools and acts as the front end for business reporting.
The critical components of a traditional data warehouse include:
- Operational systems such as CRM, billing, and transaction systems
- Extract, transform, load (ETL) models
- Enterprise data warehouses with data marts for each function
- Business intelligence (BI) platforms for reporting
Challenges of early data warehouses #
Traditional data warehouses suffered from data integration problems and engineering overhead.
For instance, the complex structure of traditional data warehouses isn’t flexible. So, the data structures and use cases had to be defined at the beginning. Any changes at a later stage required multiple hardware additions and complex workarounds.
Moreover, as the data types and volumes grew, integrating them — especially unstructured data — became a problem.
Maintaining such an ecosystem and ensuring zero inaccuracies or inconsistencies in data quality was yet another engineering challenge. Besides, these warehouses weren’t easy to scale and had storage limits.
According to Gartner:
“Legacy data warehouses lack the agility needed to deliver data requirements, depend on specialist resources for integration, and lack trust in data. They also require upfront and ongoing licensing costs, ongoing infrastructure maintenance and housekeeping. In some cases, the current product lines could be reaching the end of support. In other cases, a huge license renewal cost may be fast approaching.”
Organizations needed a warehousing solution that was easy to set up and scale and support large volumes of data from various sources and formats. That’s where cloud computing came to the rescue.
The evolution of cloud computing and big data #
Cloud computing led to an exponential growth in the volumes of data generated every day. It also simplified several engineering constraints as systems built on the cloud can easily be scaled up and down.
For instance, scaling the number of processing servers makes it easier to process vast amounts of structured, semi-structured, and unstructured data.
In addition, since the servers are on the cloud, they’re maintained by the cloud service providers, and as a result, your engineers don’t have to worry about setting up and scaling the infrastructure.
Other benefits of cloud computing include easy data restoration and backup, low maintenance cost, and on-demand services.
The rise of computing power led to increased volumes of semi-structured and unstructured data — big data. Traditional data warehouses couldn’t handle big data. Besides, the cost of processing big data using traditional data warehouses would be astronomical.
To fill the need, distributed computing platforms such as Apache Hadoop became popular for storing and managing big data in the early 2000s.
Doug Cutting (hired by Yahoo! to lead the Hadoop project) and Mike Cafarella released Hadoop in 2006. By 2009, Hadoop powered several Yahoo! properties — Yahoo Search, Yahoo Mail, and more.
However, Hadoop came with its fair share of challenges — the main issue being processing speed. Hadoop was built to handle data at scale, but it wasn’t capable of handling multiple users at once. As a result, it could only support batch processing.
According to Snowflake, the other top challenges with Hadoop are:
- High levels of customization required for the various components of Hadoop
- A need for special skill sets to deploy, manage and use Hadoop
- Continuous support necessary to maintain the Hadoop infrastructure
So, other solutions for data warehousing started emerging, and that’s how cloud data warehousing rose to popularity.
Meet cloud data warehouses: The next stage in the evolution of data warehousing #
Here’s how Snowflake introduces cloud data warehouses (CDWs):
“Data warehousing built for the cloud enables the big data processing that today’s enterprises require, but with simplicity and ease unmatched by Hadoop or on-premises and cloud-washed traditional data warehouse solutions.”
Cloud data warehouses get rid of the hardware and software constraints that traditional data warehouses suffer from, as the CDWs are managed or hosted by cloud service providers (CSPs) like Amazon, Microsoft, or Google.
They can ingest data from various sources and formats and run using SQL, instead of the more complex Hadoop components like MapReduce libraries.
As cloud data warehousing technologies started popping up in the early 2010s, they quickly became popular as the fastest, highly scalable, and affordable offerings for processing big data.
When Amazon Redshift came out in October 2012, it brought several other data products to prominence:
- Looker (a BI tool, now part of the GCP) was launched in 2011, but really took off in 2014
- Fivetran (a data integration tool) launched in 2012
- Periscope Data (an analytics platform, now a part of Sisense) also launched in 2012
- Mode Analytics (a collaborative analytics platform) launched in 2013
- Metabase (a BI tool) launched in 2014
- dbt (a data engineering solution) in 2016
- Stitch Data (an ETL tool, now a part of Talend) launched in 2016
Together, the above tools make up the various layers of the modern data stack — ingestion, integration, storage, analytics, BI, and more. dbt founder and CEO Tristan Handy calls this phenomenon the Cambrian Explosion I (from 2012-2016).
In 2019, Gartner declared that the “cloud is now the default platform for managing data. On-premises is the past, and only legacy compatibility or special requirements should keep you there.”
Cloud data warehouses play a role in that vision and are becoming vital to modern analytics.
Cloud data warehouses: architecture #
Cloud data warehouses don’t stick to the traditional tier-based architecture for data warehouses.
While each cloud data warehouse is designed differently, some components are common. These include a cloud database, ingestion services, a SQL processing environment, multiple compute clusters and several options for simplifying data analysis without moving data.
If we look at the cloud data warehousing architecture, there are two main types:
-
Cluster-based: The cluster-based cloud data warehouse has shared computing resources, also known as nodes. These nodes are combined for hosting a cloud data warehouse in a public or a hybrid cloud environment.
Examples include Amazon Redshift and Azure SQL Data Warehouse.
-
Serverless: The serverless architecture for a cloud data warehouse is designed to spread each cluster across different clients managed by the cloud service providers.
Google BigQuery is one of the popular serverless cloud data warehousing solutions.
What are the benefits of cloud data warehousing? #
Cloud data warehousing has several benefits to offer, such as:
- Low cost of ownership in managing a cloud data warehouse
- Better speed and performance with enterprise-grade security
- Increased storage and compute capabilities
- Scalability and elasticity to meet changing demands and analytics use cases
- Backup and disaster recovery
According to The Forrester Wave: Cloud Data Warehouse, Q1 2021, the most common cloud data warehousing use cases are:
- Customer analytics
- AI/ML-based analytics
- Vertical-specific analytics
- Real-time analytics
- Customer intelligence
- Data science
- BI acceleration
- Data collaboration
What are the cloud data warehousing tools available today? #
Snowflake, AWS Redshift, Google BigQuery, and Microsoft Azure - offer cloud data warehousing:
-
Google BigQuery: Launched in May 2010, Google BigQuery adopts a serverless cloud data warehouse architecture — you don’t have to configure anything as Google handles the infrastructure.
Using SQL commands, you can decide how you wish to query your data and work with varying workloads on BigQuery.
-
Snowflake Data Cloud: First launched in July 2012, Snowflake is a cloud-agnostic warehousing solution that runs smoothly on AWS, GCP, and Azure.
It’s a SaaS — you don’t have to maintain any infrastructure. Snowflake splits compute and storage into different tiers to let you scale quickly and only purchase the resources you need to operate.
Using Snowflake requires you to be familiar with SQL commands and data warehousing.
-
Amazon Redshift: Launched in October 2012, AWS Redshift operates with clusters of databases having dense storage nodes. So, you have to set up and maintain an environment of nodes and clusters.
With Redshift, you can deal with large volumes of big data and get quick querying responses. However, it’s built to work well with other AWS technologies.
Since Redshift is built on top of Postgres, you must have a background in PostgreSQL or similar RDBMS.
-
Microsoft Azure Synapse Analytics: Launched in April 2015, Azure Synapse Analytics offers serverless and dedicated warehousing options.
Previously, it only offered resource provisioning, so like Redshift, you had to set up a cluster of nodes.
Now it has a serverless option, similar to BigQuery. Synapse Analytics also allows you to separate storage from compute so that you can scale them independently.
Using Azure Synapse Analytics requires a solid command of SQL.
Interested in learning more about the best cloud data warehouses for your modern data stack? Then check out this comprehensive comparison and evaluation guide.
Having understood the evolution of data warehousing, let’s recap the most significant differences between traditional and cloud data warehousing.
Traditional data warehousing vs. cloud data warehousing #
We’ve put together a table to highlight the major differences in traditional data warehousing vs. cloud data warehousing:
Factors | Traditional data warehouse | Cloud data warehouse |
---|---|---|
Architecture | Traditional data warehouses have three tiers for storage, computations, analytics, and reporting functions, respectively. | Cloud data warehouses have different components such as databases, ingestion services, self-service solutions, and machine learning. |
Costs | Upfront costs are high since you must set up and maintain the various components, along with the server rooms required to run the warehouses. | The cloud service provider (CSP) handles the infrastructure setup, maintenance, and servers. So, the initial capital investment isn’t high. |
Scalability | New hardware and storage requirements arise when the existing resources are maxed out or as demand fluctuates. | You can easily scale your resources up or down as per your requirements. Also, the pay-as-you-go model ensures that you only pay for the resources you use. |
Security | Since traditional data warehouses are managed in-house, the chances of data breaches and other cyber threats are lower. | Even though the CSP adheres to the best security standards and ensures compliance with regulations, breaches can occur. So, data governance and security require more thought. |
Engineering roles | Maintaining traditional data warehouses requires a large team of network experts, data engineers, system administrators, and data scientists. | The most common roles for maintaining cloud data warehousing are cloud architects and security engineers. |
Performance | There’s a limit to the volumes of data that traditional data warehouses can process and store. | Cloud data warehouses are well-optimized to process virtually unlimited volumes of big data. |
Cloud data warehouses vs. data lakes: What’s the difference? #
Before wrapping up, there’s one more advancement in data storage that we must consider — the data lake.
Data lakes rose to prominence as they can store unstructured data without transformations or validations. Let’s explore the differences between data lakes and cloud data warehouses:
Factors | Data lakes | Cloud data warehouses |
---|---|---|
Data type | Raw data is stored in data lakes. | Structured and semi-structured data from various sources is stored in cloud data warehouses. This data is ready for analysis and BI. |
Processing | You can use Extract Load Transform (ELT) pipelines for processing data. | Cloud data warehouses require the Extract Transform Load (ETL) process. |
Accessibility | Data lakes are flexible and have fewer restrictions for access and changes. This leads to faster updating of data. | Cloud data warehouses include several components and more granular access rules. So, updating data on warehouses isn’t as fast. |
Use | Ideal for predictive analytics. | Good for prescriptive analytics and BI reporting. |
Cost | Cost-effective as it has fewer components. | Costly, as compared to data lakes. |
Cloud data warehouses and the modern data stack #
Right from the 70s, there was a need to have a separate engine to process data for analytical insights that helped with decision-making. This led to the rise of data warehousing.
When the cloud and big data revolution reimagined the data management landscape, this need for a central warehouse still existed.
For instance, a shift to cloud computing and unfettered storage and compute limits has led to organizations housing their data across different clouds. In this environment, there’s still a need to leverage the data spread across platforms from a single place.
Cloud data warehouses are capable of being that single repository. Unlike traditional data warehouses, they can:
- Scale resources on-demand, within minutes, irrespective of the size
- Support (nearly) unlimited storage and compute
- Separate storage and compute resources to scale affordably
- Auto-tune queries
- Upgrade automatically, without any heavy engineering overhead
Today, cloud data warehouses form the bedrock of the modern data stack — cloud-native data tools that are low code, easy to integrate, and scale.
As we saw earlier, Redshift spurred the rise of several products constituting the various layers of the modern data stack. The massively parallel processing (MPP) capabilities and first-class SQL support made processing large volumes of various data sets faster and cheaper.
So, what’s next in the evolution of data storage and processing for the modern data stack?
One avenue to watch out for is the convergence of technologies such as data warehouses and lakes to form lakehouses. With a lakehouse, you can implement data warehousing over open data lake file formats.
GCP calls it, “a low-cost storage in an open format accessible by a variety of processing engines (like Spark) while also providing powerful management and optimization features.”
Cloud data warehouses like Google’s BigQuery and Snowflake’s Data Cloud have already started offering data lakehouse solutions, combining the best of both worlds.
To know more about what the future holds for the modern data stack, check out The Future of the Modern Data Stack in 2023.
Cloud data warehousing: Related reads #
- What is a data warehouse: Purpose, components, and benefits
- What is a data lake: Definition, examples, architecture, and solutions.
- What is a data lakehouse: Definition, architecture, components, and use cases.
- Data Warehouse vs Data Lake vs Data Lakehouse: What are the key differences?
- Data mesh vs data lake: What are the differences in architecture, use cases, and benefits?
- Data Catalog: Does Your Business Really Need One?
- Why does a data lake need a data catalog?
- Modern data team 101: A roster of diverse talent
Photo by Aleksandar Pasaric by Pexels
Share this article