Snowflake + AWS: A Practical Guide for Using Storage and Compute Services

Updated October 15th, 2023
Snowflake on AWS

Share this article

Snowflake is a Data Cloud company that enables you to deploy different data architectures, such as data warehouses, data lakes, and lakehouses, with the help of various data storage, processing, and consumption features.

AWS is the first public cloud platform. It emerged in the mid-2000s with the launch of services like Amazon SQS and Amazon S3. It now has over 200 services, including managed databases, Kubernetes, remote workstations, machine learning and artificial intelligence workspaces, and more.

This article will take you through some common themes between Snowflake and AWS cloud platforms while discussing the specifics of deploying and running Snowflake on AWS. You will learn about some cool features around data pipelining, privacy, and security. Throughout the article, we’ll look at the following themes:

Let’s get into it!


Modern data problems require modern solutions - Try Atlan, the data catalog of choice for forward-looking data teams! 👉 Book your demo today


Table of contents #

  1. Deploying Snowflake on AWS
  2. Staging data in Amazon S3
  3. Loading and unloading data in Amazon S3
  4. Using external AWS Lambda functions
  5. Using the Snowflake Business Critical Edition on AWS
  6. Summary

Deploying Snowflake on AWS #

Although Snowflake allows you to natively bring in your on-prem data sources, the platform itself can only be deployed in the cloud. Snowflake lets you create warehouses to run your workloads. These warehouses run on one or more compute services. For the most part, workloads related to data processing and consumption need a warehouse up and running. However, Snowflake also has several features that run on serverless compute, managed and run separately from a warehouse.

You can deploy Snowflake by using on-demand resources or signing up for longer-term contracts to get discounts. You also need to consider the features you need for your business use cases. Based on that, you can decide on a Snowflake Edition. Here’s what the Snowflake pricing page looks like for the AWS US East (Northern Virginia) region:

Snowflake pricing page showing credit prices in the AWS US East region

Snowflake pricing page showing credit prices in the AWS US East region. Image from the Snowflake website.

When deciding on the region, you need to consider the data residency, security, compliance, and usage requirements, too. Although most mature features are available across cloud platforms and regions, you might need help finding some newly released features in all the cloud regions. Another major thing to consider is that the cost of data movement, compute power, etc., differs significantly across regions.


Staging data in Amazon S3 #

Snowflake uses the concept of stages to load and unload data from and to other data systems. You can either use a Snowflake-managed internal stage to load data into a Snowflake table from a local file system, or you can use an external stage to load data from object-based storage too. The unloading process also involves the same steps but in reverse.

If your data is located in AWS S3 buckets, you can create an external stage in S3 to load your data into Snowflake. To do that, you’ll first need to create a storage integration, which essentially acts as a credentials file when you link Snowflake and S3 to create a stage. You can create a storage integration to enable external stages in S3 using the following SQL statement:

CREATE STORAGE INTEGRATION <storage_integration_name>
 TYPE = EXTERNAL_STAGE
 STORAGE_PROVIDER = 'S3'
 ENABLED = TRUE
 STORAGE_AWS_ROLE_ARN = '<aws_iam_role_arn>'
 STORAGE_ALLOWED_LOCATIONS = ('s3://<bucket>/<path>/', 's3://<bucket>/<path>/')
 [ STORAGE_BLOCKED_LOCATIONS = ('s3://<bucket>/<path>/', 's3://<bucket>/<path>/') ]

Using that storage integration, you can go ahead and create a stage by specifying the S3 location and format of those files you want to stage into Snowflake using the following SQL statement:

CREATE STAGE <stage_name>
 STORAGE_INTEGRATION = <storage_integration_name>
 URL = 's3://<bucket>/<path>/'
 FILE_FORMAT = <csv_format_name>;

Using the storage integration and the external stages, you can load data from and unload data into S3. Let’s look at that in a bit more detail.


Loading and unloading data in Amazon S3 #

Once the stage is created, you can use the COPY INTO command for bulk-loading the data into a table, as shown in the SQL statement below:

COPY INTO <table_name>
 FROM @<stage_name>
 PATTERN='.*address.*.csv';

The pattern would get all the address.*.csv files, such as address.part001.csv, address.part002.csv, and so on, and load them in parallel into the specified Snowflake table.

You don’t necessarily have to create a separate stage for unloading data. Still, it’s considered best practice to create a unload directory to unload the data from a table, as shown in the SQL statement below:

COPY INTO @<stage_name>/unload/ from address;

After this, you can run the GET statement to unload the data from the stage to your local file system:

GET @mystage/unload/address.csv.gz file:///address.csv.gz

The Snowflake + AWS integration also allows you to run queries on an external stage without getting the data into Snowflake. To do this, you’ll need to create external tables in Snowflake, which collect table-level metadata and store that in Snowflake’s data catalog.

It is a common practice to store files by year, month, and day partitions in object stores like Amazon S3. You can query information using the METADATA$FILENAME pseudo column while creating the external table, using which you can enforce a partitioning scheme, as shown in the SQL statement below:

CREATE EXTERNAL TABLE orders (
 date_part date AS TO_DATE(SPLIT_PART(METADATA$FILENAME, '/', 3)
 || '/' || SPLIT_PART(METADATA$FILENAME, '/', 4)
 || '/' || SPLIT_PART(METADATA$FILENAME, '/', 5), 'YYYY/MM/DD'),
 order_time bigint AS (value:order_time::bigint),
 order_quantity int AS (value:order_quantity::int)),
 reference_number varchar AS (value:order_quantity::varchar))
 PARTITION BY (date_part)
 LOCATION=@<stage_name>/orders/
 AUTO_REFRESH = TRUE
 FILE_FORMAT = (TYPE = PARQUET)
 AWS_SNS_TOPIC = 'arn:aws:sns:ap-southeast-2:1045635537262:<bucket_name>';


With the AUTO_REFRESH and AWS_SNS_TOPIC options, you can also refresh your external tables based on specific triggers sent by AWS SNS, which makes it easy to keep the tables up-to-date.


Using external AWS Lambda functions #

In addition to external tables where you don’t have to move your data into Snowflake, you can also use external functions where you don’t have to move your business logic into Snowflake. An external function or a remote service gives you the option to offload certain aspects of your data platform to certain services in AWS, the most prominent of which is AWS Lambda, the serverless compute option from AWS.

Integrating with AWS Lambda works the same way integrating with AWS S3 works, i.e., you have to create an integration; in this case, an API integration, as shown in the SQL statement below:

CREATE OR REPLACE API INTEGRATION <api_integration_name>
 API_PROVIDER = aws_api_gateway
 API_AWS_ROLE_ARN = '<aws_iam_role_arn>'
 API_ALLOWED_FIXES = ('https://')
 ENABLED = TRUE;

Snowflake uses AWS API Gateway as a proxy service to send requests for Lambda invocation, which is why you have the API_PROVIDER listed as aws_api_gateway.

Several use cases exist for external functions to enrich, clean, protect, and mask data. You can call other third-party APIs from these external functions or write your code within the external function. One common example of external functions in Snowflake is the external tokenization of data masking, where you can use AWS Lambda as the tokenizer.


Using the Snowflake Business Critical Edition on AWS #

The Business Critical Edition of Snowflake ups the Enterprise Edition’s security, availability, and support. Concerning AWS, the Business Critical Edition allows you to use AWS PrivateLink, which connects your VPCs and AWS services without using the internet and only AWS and its partner network infrastructure.

The whole point behind using PrivateLink is to empower you to comply with compliance regulations, such as PCI and HIPAA, and reduce the risk of data exposure. AWS PrivateLink also allows you to access Snowflake’s internal stages and run external functions with private endpoints to make them more secure.

Some businesses, especially those dealing with financial and medical data, have rigorous data storage and residency requirements. These companies sometimes store critical data on their own on-premises data centers. In such cases, you can use AWS Direct Connect, a dedicated connection between your data center and AWS, to make all your infrastructure available in a single private network.


Summary #

This article took you through deploying Snowflake on AWS and using some AWS services, such as AWS API Gateway, Amazon S3, AWS Lambda, AWS PrivateLink, and AWS Direct Connect, to enhance some Snowflake-native features. The article also explored how the Snowflake Business Critical Edition comes about in AWS using some of the AWS mentioned above services and Snowflake constructs. Finally, the article walked you through the different ways of offloading part of your workload to AWS for integration, privacy, and security.

Running Snowflake and the rest of your data stack on AWS is great because you can gather metadata using Snowflake’s data governance features and AWS Glue Data Catalog. This data can then flow into Atlan, an active metadata platform, enabling your business to realize the true potential of using metadata. One of the recent efforts in promoting shift-left data governance, a two-way tag sync feature between Atlan and Snowflake was launched, allowing Atlan to become the control plane for tags.

You can learn more about the Snowflake + AWS partnership from Snowflake’s official website and the Snowflake + Atlan partnership here.



Share this article

[Website env: production]