Data Warehouse: Definition, Concept, Components, and Architecture

Updated June 27th, 2023
header image

Share this article

What is a Data Warehouse?

A data warehouse is a type of data repository used to store large amounts of structured data from various data sources. This includes relational databases and transactional systems, such as customer relationship management (CRM) tools and enterprise resource planning (ERP) software. Similar to an actual warehouse, a data warehouse is highly organized and allows everything to be labeled and categorized in a logical order.

Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today

What is the primary purpose of a data warehouse?

Data warehouses are designed to feed information into decision support systems, business intelligence (BI) software, data dashboards, and other types of analytics and reporting tools. The data stored in a given warehouse is formatted for a specific purpose, which is defined before engineers start modeling data and begin loading it into the warehouse. As such, the primary purpose of a data warehouse is to enable an organization to easily access and analyze relevant data to extract key business insights and plan for the future.

Recent developments in data warehousing, such as Google’s BigQuery ML, also allow data engineers and scientists to build and deploy machine learning (ML) models inside data warehouses — there is no longer a need to export the data to develop and train these models. Useful applications of this technology for business include sales forecasting, advanced customer segmentation, and other types of predictive analytics.

Ultimately, data warehousing helps businesses improve their bottom line by enabling historical insight, ensuring data quality and consistency, boosting the power and speed of data analysis processes, and facilitating greater scalability across the entire organization.

What is a cloud data warehouse?

A cloud data warehouse is a type of data warehouse that is managed and hosted by a cloud service provider (CSP). Because cloud data warehousing does not have the physical constraints associated with on-premise data warehousing (where physical servers must be purchased, set up, and maintained), it is significantly faster, cheaper, and more scalable. Some of the most popular cloud data warehousing solutions available today include Amazon Redshift, Snowflake, Google BigQuery, and Microsoft Azure SQL Data Warehouse.

If your business is considering migrating its data warehousing to a cloud-based solution, it is critical to have a clear migration strategy in place. This includes outlining why the business will benefit from the desired end state, selecting a cloud data warehouse that fits your needs, and determining how you will redesign your current data model. Here is more guidance on how to handle the initial preparation and discovery phases to kick off a cloud data warehouse migration.

An essential feature of cloud data warehousing is elasticity, or the ability to acquire resources as you need them and release resources when you no longer need them. Elasticity is a big reason why on-premise data warehousing is not a great fit for organizations that need to scale up or scale down quickly, as is usually the case with growing businesses. To that point, developing a scaling strategy for physical infrastructure requires complicated forecasting methodologies that may ultimately prove inaccurate, costing the business time and money.

What are the components of a data warehouse architecture?

A data warehouse has four core components: a central database, data integrations tools, metadata, and data access tools.

Central database: A central database typically serves as the backbone of a data warehouse. In many instances, this will be a relational database running on-premise or in the cloud. However, with the rise of Big Data and the need for real-time computing, in-memory databases are becoming increasingly popular. This is because in-memory databases allow rapid response times by removing the need to access disks.

Data integration tools: Data integration tools are used to pull data from source systems and transform the information into a unified format that fits within the data warehouse. The traditional approach used here is called extract, transform, and load (ETL). A growing number of organizations are also leveraging an extract, load, and transform (ELT) technique so non-engineers are able to use the data faster. ELT does this by providing access to data as soon as it’s loaded, and data transformation (processing and organization) occurs after.

Metadata: Simply put, metadata is data about your data. It further delineates the data’s source, contents, usage, and other key features that describe its purpose and how it fits into the data warehouse. Broadly, metadata can be classified as 1) business metadata, which includes contextual information that is easily understandable, and 2) technical metadata, which describes precisely how to access and manage the data. Apart from these, our data systems are now capable of recording entirely new forms of metadata like - operational metadata, which describes how the data has existed and evolved through its lifecycle, and social metadata, which indicates patterns of data usage.

Data access tools: Most businesses use a variety of no-code data access tools to interact with the data stored in their data warehouses. Data warehouse access tools can include query tools, application development tools, data mining tools, and online analytical processing (OLAP) tools.

On-premises vs. cloud data warehouse architectures

There are a few key differences to consider when comparing on-premise data warehouses to cloud data warehouses. A typical on-premise data warehouse features three tiers:

  • Bottom tier: Where data is loaded and stored via the database server
  • Middle tier: Holds the analytics engine (e.g., an OLAP server) where data is analyzed and transformed
  • Top tier: The front end where reporting and analytics tools are situated

An on-premise data warehouse may take one of the following forms:

  • Virtual: This type of data warehouse involves separate databases that are simultaneously queried, offering the functionality of a single data warehouse.
  • Enterprise: What many imagine when they picture data warehousing, the enterprise data warehouse aggregates data from all units of the business.
  • Data mart: This approach entails splitting the data warehouse into modules that each focus on a specific business unit or department.

Having examined the three main models of a traditional data warehouse, let’s take a look at the two types of cloud data warehouse architecture seen today.

  • Cluster-based: A cluster-based cloud data warehouse is a set of shared computing resources, aka nodes. In cluster-based warehousing, multiple nodes are combined to host the cloud data warehouse within public or hybrid cloud environment. Amazon Redshift and Azure SQL Data Warehouse can be categorized as cluster-based cloud architectures.
  • Serverless: In this model, the database cluster is abstracted away because it is spread across many clients and managed entirely by the cloud service provider. Google BigQuery and Snowflake can be categorized as serverless cloud architectures.

While there are many factors to consider when choosing a cloud data warehouse architecture, one of the biggest is cost. If your business needs a clear estimate of the associated costs, a provider with cluster-based architecture may be the best choice. However, if you’re less concerned about cost forecasting and only want to pay for what you use, serverless architecture is likely your best option.

Database vs data warehouse

As mentioned above, a database is a core component of a data warehouse. The term database often refers to a relational database, which is a collection of data that is organized into tables that group together related objects. Data warehouses, on the other hand, are information systems that pull from multiple sources and are used to rapidly analyze the data to support business decision-making. While it is possible to use a database as a data warehouse in and of itself, this is usually not advised because a database doesn’t provide the performance needed for analytics. Additionally, venturing outside the typical use case of a database can negatively impact uptime.

Data mart vs data warehouse

A data mart is an optional element of a data warehouse that stores data for a particular function or department, such as marketing, sales, or finance. By partitioning a subset of data from the data warehouse, a data mart simplifies data access for the end-user and provides faster access to department-level insights. Data marts can also be used to facilitate a more efficient cost strategy for a data warehouse.

Data warehouse example

Let’s say an airline is having difficulty retaining customers and wants to use data warehousing to improve its operations. They could aggregate customer data from online sales, website interactions, email lists, and satisfaction surveys into a data warehouse for further analysis. From there, they might discover that frequent flight delays in certain areas are upsetting customers and causing them to switch to a different airline.

This sets up another use case for a data warehouse: determining why the flights are being delayed. The airline could feed data about delays by the airport, state, weather conditions, technical issues, security factors, and other conditions into a data warehouse. Then they might connect that data warehouse to a data visualization tool, such as Tableau, to compare different datasets, create different types of graphs, and identify various trends. Doing so could reveal that a common reason for delays is a certain engine component that needs to be repaired.

These insights also lend themselves to enabling the airline to leverage predictive analytics. Based on historical data in the warehouse, the airline could better anticipate when that part is approaching failure and schedule preventative maintenance to avoid those delays.

How the modern data warehouse fits into the modern data stack

Data warehousing — cloud data warehousing, to be specific — is one of the foundational components of the modern data stack. And centered around the cloud data warehouse are various cloud-native tools, such as data transformation, querying, and processing software, modeling/analysis tools, visualization, business intelligence, and AI/ML applications. Together, all of these tools and the cloud data warehouse form the modern data platform.

Some organizations may also choose to supplement (or in certain cases replace) their data warehouses with data lakes. A data lake is another type of data repository that can be used to store both structured and unstructured data of many different varieties. Data lakes are a way to store raw data for a rainy day, so to speak — you never know what you might need it for in the future. We have also recently seen the emergence of the data lakehouse, which helps unite siloed systems that may exist when a business uses both data warehouses and data lakes.

The future of modern data management

A data warehouse is an essential technology for organizations if they want to succeed in today’s business environment. And cloud data warehousing, specifically, is ideal for maximum elasticity, speed, and cost-effectiveness. In most cases, the modern data stack cannot exist without a cloud data warehouse.

As organizations continue to streamline their data operations for maximum effectiveness, a new challenge is being discussed. The people who make up data teams — data engineers, analysts, scientists, product managers, and more — all have a different methodology and toolset they use to manage and interact with data.

Metadata Management is of chief importance when it comes to enabling data democratization - empowering all data users equitably. In the end, data isn’t very useful without the assets that make it comprehensible, like queries, documentation, key dates, history, terminology, and more. That’s why a rapidly growing number of organizations are adopting a next-generation data workspace like Atlan to tame collaborative chaos and ensure everyone who handles data is speaking the same language.

Photo by Tiger Lily from Pexels

Share this article

[Website env: production]