According to IDC, we’re projected to create more than twice the amount of data created since we invented digital storage.
Cloud data warehouse platforms have become the bedrock of the modern data platform and enable gathering, processing, organizing, and managing zettabytes of data to eventually inform business decisions.
Moreover, using cloud data warehousing ensures that you can process huge volumes of data without having to worry about engineering the complex infrastructure required.
There are abundant choices at your disposal with enhanced features such as data compression, serverless compute, and BI integration capabilities. While evaluating, you must also consider finding the perfect equilibrium between various factors such as cost, capability, elasticity, volume, and being cloud-agnostic.
So, how do you find the right cloud data warehouse?
Here's an overview of the most popular cloud-based data warehouses & a framework to evaluate them.
The 4 best cloud data warehouses in 2022
- Amazon Redshift
- Google BigQuery
- Azure Synapse SQL Pools
Redshift is a columnar analytical data warehouse packed with features such as data compression to reduce disk I/O and data distribution styles to distribute data across nodes. Using Redshift Spectrum, data can also be read directly from Amazon S3 by creating external tables, instead of copying data into tables.
Redshift has also introduced serverless capabilities, where you pay for the workloads you run in RPU-hours (Redshift Processing Units). It also provides self-tuning capabilities to choose the right partition key and distribution style for your tables, boosting querying performance.
Interestingly, Redshift's naming alludes to Oracle, often referred to as "Big Red because of its prominent brand color. Encapsulating the action of customers choosing Amazon Redshift - as choosing to "shift" from the "big red".
What are the main capabilities of Amazon Redshift?
- Massively Parallel Processing (MPP): With the help of sort keys and distribution styles, you can easily distribute the data across multiple nodes and process it in parallel.
- Data Compression: Compressed data is synonymous with reduced Disk I/O, which in turn increases the query performance.
- Workload Management (WLM): Redshift takes care of allocating resources to the query workloads so that the short-running queries are not blocked in queues by the long-running ones.
- Storage vs. Compute: Depending on your workloads and data volume, you can choose between storage and compute-optimized nodes. Redshift’s serverless compute (in preview) dynamically tunes the configuration between storage and compute-optimized configuration to accommodate the workloads.
Amazon Redshift demo
Amazon Redshift data warehouse resources
Google’s BigQuery is a serverless data warehousing solution that facilitates scalability and high availability. Users only have to pay for the data processed by BigQuery — costing as low as \$5 per TB (as of April 2022).
In its latest Data Cloud Summit, Google unveiled BigLake. BigLake extends BigQuery's fine-grained row- and column-level security to tables on data resident object stores such as Amazon S3, Azure Data Lake Storage Gen2, and Google Cloud Storage. Essentially moving Biqquery a step toward the data lakehouse realm.
What are the main capabilities of Google BigQuery?
- Multi-cloud capabilities (BigQuery Omni): BigQuery Omni helps analyze data in multi-cloud environments. This removes the caveat of setting up dedicated pipelines for data transfer.
- Serverless: A serverless architecture behind the scenes means that the GCP takes care of all the resource provisioning activities.
- Control over Storage: BigQuery doesn’t require any intermediate staging layer to run data transformation. That’s because you run the query directly on the right data sources, giving the user complete control over data storage.
- Connector Capabilities (BigQuery Transfer): The BigQuery Transfer tool retrieves data on a regular basis from predefined sources, which include block storage from other cloud environments and Google services.
- BI Cache Layer (BI Engine): The BI Cache Layer is an in-memory analysis service for improved query response time and high concurrency, to be used by BI tools.
Google Bigquery demo
Google BigQuery data warehouse resources
Azure Synapse SQL Pools
Azure Synapse SQL Pools (formerly known as SQL DW) is an Azure-native data warehousing solution. The data is brought into the SQL pools using Polybase — a feature of Azure (a data virtualization technology that virtually connects to the data sources) — and queried using T-SQL.
The usage of SQL pools is measured in DWH Units (Data Warehouse Units), comprising a combination of I/O, CPU, and memory.
What are the main capabilities of Azure Synapse SQL Pools?
- Massively Parallel Processing (MPP): You can define the distribution strategy and keys for your loaded data.
- Storage vs. Compute: With Polybase in the picture, users can define their blob storage wherever the data resides. As compute and memory are measured in terms of DWH Units, you get the flexibility to choose the right capacity for your warehousing needs.
Azure Synapse demo
Azure Synapse SQL Pools resources
Snowflake was founded in July 2012 by three data warehousing experts: Benoit Dageville, Thierry Cranes, and Marcin Żukowski.
Snowflake works based on a central persisted storage repository. The data is ingested into this repository using multiple data connectors. The queries are run on top of these repositories using Massive Parallel Processing compute clusters. The pricing is based on the amount of compute and storage used, measured in terms of Snowflake credits per hour.
Snowflake has a truly plug-and-play capability since it takes care of all the data maintenance tasks such as vacuuming and removing the maintenance overhead.
What are the main capabilities of Snowflake?
- Cloud Agnostic: The storage, compute and cloud services for Snowflake can be deployed entirely on the cloud. Snowflake can be hosted on AWS, Azure, and GCP.
- NoSQL Support: Snowflake can automatically unmarshal the JSON objects, removing the overhead of writing transformers.
- Concurrency and workloads separation: Cluster-based workload split helps in automatically scaling up and down the clusters based on the queries.
Snowflake data warehouse resources
How to choose the right data warehouse: An evaluation framework
Before we begin, make sure that you've mapped your business requirements and cloud data warehousing use cases. This will dictate the evaluation framework criteria for your organization. While mapping, you should also look at the essential compliance and security factors.
Now, let’s look at the six major characteristics of data warehouse solutions to evaluate any data warehouse platform.
6 key features to factor in during your cloud data warehouse evaluation
- Data warehousing costs
- Data formats
- BI Support
- Pricing model
Now, ask yourself the following questions for each characteristic:
Data warehousing costs
- How much can you spend currently and how much budget do you have for data warehousing as your business grows?
- Is your data warehouse spend budget flexible and dynamic or constrained by reservations?
- What matters the most to you — optimizing storage or performance?
Is the data warehouse solution you’re considering dynamic and elastic enough to scale up and get running?
- Which process must be faster for your use cases — reads or writes?
- What’s the scale of data you expect your warehouse to hold?
- What kind of data will your data warehousing solution hold?
- How often does your data schema change?
- What’s the choice of query engine for your data professionals?
- What is the support for BI tools?
- How well does the data warehouse fit in your existing data stack?
- What’s the value you get for your money?
- How does the pricing vary as your data grows?
- How will the pricing vary, in terms of long commitments?
We’ve put together a table that uses the above evaluation framework to highlight the strengths of each cloud data warehouse solution covered in this article.
Cloud data warehouse platforms comparison
|Data Warehouse Characteristics||AWS Redshift||Google BigQuery||Azure Synapse SQL Pools||Snowflake|
|Cost||On-demand and serverless choices are available.||Purely Serverless. Pay-as-you-go model.||On-demand and pay-as-you-go model. The usage is measured in DWH (Data Warehousing Units).||It works on on-demand and pay-as-you-go.|
|Elasticity||Scaling compute is a manual process. It requires downtimes and can be scaled both vertically and horizontally.||Scaling compute is automatic.||Scaling compute is a manual process.||Autoscaling is feasible.|
|Volume||The volume of the data stored is decided by the type of instance chosen. For external tables, the volume is not of concern.||Being a serverless data warehouse, Google BigQuery can scale the volume dynamically.||Because of the Polybase feature, Synapse SQL Pools don’t have a restriction on volume.||Scaling volume is a manual process.|
|Formats||Gives a relational structure to non-relational columnar formats.||All columnar formats are converted to BigQuery’s native capacitor format.||Supports columnar formats and relational formats.||Supports relational formats and JSONs, and is most suited for JSON.|
|Cloud-agnostic||No||Partially yes (Pulls data from different cloud environments using BigQuery Omni).||No||Yes|
|BI needs||There are JDBC connectors for Redshift. However, Redshift doesn’t have a cache layer to cache the results.||With a dedicated BI Engine in place, integration with BI tools to analyze BigQuery with sub-second latencies is feasible.||Connecting to Microsoft PowerBI is straightforward. As for the other visualization tools, JDBC connectors are the way to go.||With an unlimited number of connections and support for queries from BI tools, Snowflake supports it out of the box.|
Data warehouse platforms and their impact on the modern data stack
As data-driven decision-making becomes the norm, the competitive edge lies in the speed of analytics and insights.
This requires businesses to build a modern data stack that enables their diverse humans of data to collaborate effortlessly and boost their productivity.
The modern data stack is built around a central cloud data warehouse or lake. All tooling is expected to seamlessly integrate with chosen warehouse or lake. That's why it's crucial to pick the right cloud data warehouse for your stack. We hope this evaluation framework helps you get the ball rolling in the right direction.
To know more about the modern data stack, check out our comprehensive beginner's guide to the modern data stack.
Written by Sathiya Sarathi Gunasekaran
Cloud data warehouses: Related reads
- Data warehouse 101
- Warehouse vs. Data Lake vs. Data Lakehouse | Differences
- Snowflake data management with Atlan
- Catalog 3.0: Metadata for the modern data stack
Related deep dives on popular data tools
- 7 popular open-source ETL tools
- 5 popular open-source data catalog tools to consider in 2022
- 7 popular open-source data governance tools to consider in 2022
- 10 popular transformation tools in 2022
- 9 best data discovery tools
- 12 Popular Observability Tools in 2022