Amazon MSK To Snowflake v1.3
Amazon MSK To Snowflake v1.3
v.1.3
2
Table of Contents
Introduction 2
Kafka Overview 3
High-Level Architecture 4
AWS Configurations 5
AWS Account User Roles and Privileges 5
Cluster Configuration 7
Scaling Configurations with Amazon MSK 7
Partitions 7
Ensuring High Availability 8
Broker and Partition Recommendations 8
Topic Configuration 10
Authentication 10
Encryption 10
Logging 11
Security Groups 11
Apply with Terraform 11
SharePlex Configuration 12
Consuming Events Using MSK Connect/Snowflake Connector 12
Ingestion into Snowflake 13
Controlling Snowflake Costs 13
Transformation with dbt Cloud 14
Risks 16
Introduction
Qualifacts currently leverages a multi-tenant solution with 13 transactional Oracle servers,
accommodating approximately 400 customers. Each of these customers share the same schema
definition, which comprises around 2000 tables. From these, data is surfaced to customers in
multiple ways, including an operational data store of real-time tables which is then replicated to
Snowflake and Looker, a Hive connector which takes in data from the ODS as well as Hadoop,
SFTP extracts of batched data via Talend, and embedded reporting solutions.
The goal for the production pipeline POC is to foster the ability to replicate the source data from
these Oracle tables in Snowflake. Longer term, the overall goal is to house all data for reporting in
the Snowflake data platform, which will be the source of all reporting across the organization.
Pursuing this strategy will allow Qualifacts to eliminate auxiliary reporting mechanisms which
reduces cost and complexity, and reporting data can be leveraged in new ways with expanded
business implications.
3
Kafka Overview
Apache Kafka software uses a publish and subscribe model to write and read streams of records,
similar to a message queue or enterprise messaging system. Kafka allows processes to read and
write messages asynchronously. A subscriber does not need to be connected directly to a
publisher; a publisher can queue a message in Kafka for the subscriber to receive later.
The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers
subscribe to one or more topics. A Kafka topic is like a sharded write-ahead log. Producers append
records to these logs and consumers subscribe to changes. Each record is a key-value pair. The key
is used for assigning the record to a log partition (unless the publisher specifies the partition).
Here is a simple example of a single producer and consumer reading and writing from a
two-partition topic.
This shows a producer process appending to the logs for the two partitions, and a consumer
reading from the same logs. Each record in the log has an associated entry number that we call the
offset. This offset is used by the consumer to describe its position in each of the logs. Partitions are
spread across a cluster of machines, allowing a topic to hold more data than can fit on any one
machine.
4
High-Level Architecture
Region: We are going to work on a specific Region. Everything is going to be created in this specific
region (“us-east-1” in the diagram).
5
VPC: We are going to leverage the existing VPC with vpc id vpc-0cd4982a93bdf3e0f and IPv4
CIDR block 172.28.0.0/20
(Typically for a new MSK Cluster setup, you would first create a VPC and devise an IP strategy
with a CIDR block declaration to configure subnets, however for the POC, this VPC is already
existing in Qualifacts’ existing VPC, so VPC, subnets, internet gateway and route entry are not
declared AWS resources in the MSK Terraform module.)
Subnets and Availability Zones: Inside the VPC, for the purposes of this POC, we utilize 3 Subnets. 1
subnet per Availability Zone. We have 3 availability zones in us-east-1 region:
1. us-east-1c
2. us-east-1d
3. us-east-1b
The 3 subnets we will utilize will have the following IPv4 CIDR blocks allocated:
MSK Cluster: This POC creates an MSK cluster. For the purposes of the POC, this cluster will have
3 brokers, each one living in its own availability zone.
EBS Volume: Each one of the brokers is going to have its own EBS volume storage.
AWS Configurations
In the IAM Management Console, attach the intended user account to it the Permission Policies
shown below.
Qualifacts is already leveraging Terraform, so IAM permissions are likely already configured.
Generally Terraform requires the following permissions policies for Amazon MSK:
As well, the following policy is used for the EC2 instance to connect to MSK for admin/testing
purposes.
{
"Statement": [
{
"Action": [
"kafka-cluster:Connect",
"kafka-cluster:AlterCluster",
"kafka-cluster:DescribeCluster"
],
"Effect": "Allow",
"Resource": "arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*"
},
{
"Action": [
"kafka-cluster:*Topic*",
"kafka-cluster:WriteData",
"kafka-cluster:ReadData"
],
"Effect": "Allow",
"Resource": [
"arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:topic/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:group/msk-cluster-dev/*"
]
},
{
"Action": [
"kafka-cluster:AlterGroup",
"kafka-cluster:DescribeGroup"
],
"Effect": "Allow",
"Resource": [
"arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:group/msk-cluster-dev/*"
]
}
],
"Version": "2012-10-17"
}
7
Cluster Configuration
In addition to the MSK documentation on cluster configuration, please reference Amazon’s own
suggested best practices to follow with Amazon MSK.
Scaling horizontally can be accomplished via adding brokers to the cluster. Additional brokers
must be provisioned as a multiple of the number of AZs. Note that you can only increase the
number of brokers, not decrease them subsequently, therefore horizontally scaling with additional
brokers should be carefully considered beforehand. Note that partitions must be re-assigned after
broker additions are made. Please refer to the official MSK documentation for more information
on expanding a cluster.
M5 brokers should typically be used for production instances. Please refer to the Amazon
reference for right sizing the cluster based on the recommended number of partitions per broker.
As well, Amazon has provided a spreadsheet-based tool for determining an optimal number of
brokers and sizing. Always test workloads on newly-provisioned clusters.
Partitions
For the purposes of this POC, a single partition is employed which can allow the preservation of
the order of records being produced.
MSK clusters can support up to 200k partitions per cluster. An optimal number of partitions per
topic can be determined based on a future target throughput based on expected production and
consumption. Although it is possible to increase the number of partitions over time, one has to be
careful if messages are produced with keys. When publishing a keyed message, Kafka
deterministically maps the message to a partition based on the hash of the key.
A rough formula for picking the number of partitions is based on throughput. You measure the
throughput that you can achieve on a single partition for production (p) and consumption (c). Let’s
8
say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The
per-partition throughput that one can achieve on the producer depends on configurations such as
the batching size, compression codec, type of acknowledgement, replication factor, etc. The
consumer throughput is often application dependent since it corresponds to how fast the
consumer logic can process each message, so benchmarking should be performed.
Please refer to the MSK best practices for right-sizing MSK Kafka clusters, as well as Confluent’s
post on How to Choose the Number of Topics/Partitions in a Kafka Cluster.
Brokers are always a multiple of the number of AZs, so if there are 3 AZs there can potentially be
3, 6, 9, 12… brokers configured. When creating a topic in MSK, the partitions are going to be
balanced across AZs. (When adding brokers however, these need to be re-assigned.) Refer to the
Amazon MSK official documentation for more information on maintaining high availability.
Three brokers allows one broker per Availability Zone, and the number of brokers is always a
multiple of your Availability Zones provisioned to the cluster. A replication factor of 3 allows the
cluster to be highly available. More information on choosing replication factor and partition count
can be found here. Further testing can be performed by Qualifacts if additional optimization of
these settings is desired.
Assuming the configuration of 1 topic per schema, this means each broker contains a copy (either
the leader or replica) of every topic.
9
Broker size can be a consideration of the total intended number of partitions per broker, which is
based on the number of topics (in this case, one schema per customer, each schema is produced as
a topic), multiplied by the number of partitions per topic, divided by the number of brokers,
multiplied by the replication factor.
For example, if you had 100 customers ~ 100 topics (t), 3 partitions per topic, (p), 3 brokers
(b), and a replication factor of 3 (r):
Note 1: The above table is derived from the official MSK documentation, table provided here.
Note 2: A DEV instance might be provisioned for a smaller size since it would be intended for testing
purposes. Reduction in resources provisioned for DEV would need to be determined based on intended
functionality in the DEV instance.
More information can be found in the official MSK documentation here for right-sizing the MSK cluster.
10
Topic Configuration
For the purpose of this POC, configuration is aligned with the intention of fostering one topic per
schema per client.
auto.create.topics.enable = true
delete.topic.enable = true
default.replication.factor = 3
num.partitions = 1
For more information on topic-level configuration properties for new and existing topics, see
Topic-Level Configs in the Apache Kafka documentation.
Authentication
Due to compatibility issues with SharePlex 10.0 not supporting SASL, this POC contains two
deployment versions for reference: one with unauthenticated access enabled (Terraform module
msk), and another with SCRAM authentication enabled (Terraform module msk_ae). Salted
Challenge Response Authentication Mechanism (SCRAM), or SASL/SCRAM, is a family of SASL
mechanisms that addresses the security concerns with traditional mechanisms that perform
username/password authentication like PLAIN. More information on client authentication can be
found in the official Amazon MSK documentation.
If Qualifacts updates to SharePlex version 10.1 or higher, client authentication with SASL can be
enabled.
Encryption
The MSK cluster applies both encryption at rest and encryption in transit. Amazon MSK always
encrypts your data at rest; for this to take place, an AWS-managed pair of keys will be generated
during the MSK cluster creation process and stored in the AWS Key Management Service for the
AWS region we are working in. Amazon MSK uses Transport Layer Security for encryption in
transit. In-transit encryption via TLS 1.2 can be enabled with configuration settings in Terraform.
This POC contains two deployment versions for reference: one with plaintext enabled and no
encryption applied in-transit (Terraform module msk), and another with TLS encryption
(Terraform module msk_ae).
11
Note: If desired, you can use custom managed keys. In that case, you would need to create the pair
beforehand and provide it inside the encryption_info block as the value of the
encryption_at_rest_kms_key_arn attribute.
Logging
CloudWatch Log Group: In addition, we are going to create a CloudWatch Log Group. This is the
log group where the broker logs will be grouped in. It can be used to inspect what the brokers do or
dig into to find any problems.
For longer term, indefinite storage of logs, S3 can also be configured if desired, as long-term
storage of logs within Cloudwatch can be cost prohibitive. Logs can be forwarded from
Cloudwatch or S3 to New Relic if desired, but this log forwarding functionality to New Relic is not
configured for the purposes of this POC.
Security Groups
MSK Security Group (user-managed): We will create a security group and attach it to the MSK
cluster. This security group will allow our clients to access our MSK brokers.
Client Security Group: We will create an SSH security group to attach it to the EC2 client.
<root>
└> dev.env - contains the values of the variables defined in the variables.tf files, across all modules
└> <module name>
└> datasources.tf - Contains data resource declarations (used to access data sources)
and local values (value assignment for expressions which can be referenced within the
module).
└> main.tf - the main config file; describes the MSK cluster and other necessary pieces
12
└> outputs.tf - Contains output values to make information about your infrastructure
available on the command line, and can expose information for other Terraform
configurations to use. Similar to return values in programming languages.
└> variables.tf - contains the Terraform variables used to parameterize the declarations
so that it will make it easier to customize the whole project; values are assigned to these
variables in the dev.env file.
We can apply our infrastructure configuration with the command (Note: The MSK cluster can take
up to an hour to be created):
$ terraform apply
To run Terraform commands, Terraform requires an initialized working directory and the
commands only act upon the currently selected workspace.
Terraform will present a list of actions that the configuration corresponds to. We will have to
accept the actions by answering yes. When we approve, Terraform will create the resources by
connecting to our AWS account. For information can be found in the POC Terraform project
README.md file as well as the official Terraform documentation.
SharePlex Configuration
Qualifacts currently uses SharePlex version 10.0. Existing SharePlex replication from Oracle can
be configured to a Kafka topic. For more information please refer to the SharePlex technical
documentation.
The MSK Connect Snowflake Sink Connector is created as a resource from the Snowflake Sink Custom
Plugin, which is built from a Maven artifact (repository), shown in the sequence below. This resource
exists as part of the Terraform configuration and deployment for MSK Connect.
Maven artifact → archive file → S3 object → custom plugin → MSKConnect Snowflake Sink Connector
The Snowflake connector for Kafka is designed to run in a Kafka Connect cluster to read data from
Kafka topics and write the data into Snowflake tables. SharePlex produces messages to MSK that
represent single rows.
13
MSK Connect’s Snowflake connector consumes messages for topics and provides a choice of
configuration options in terms of how these messages are written to Snowflake:
a) To map one database schema to a topic, and then to consume these topics into one table
per topic (this is what is currently configured in the POC for MSK Connect settings)
b) To map one database schema to one topic, and then to map these topics to a single table in
Snowflake via a hard-coded mapping in MSK Connect via the topic2table.map
configuration setting in connector_cofiguration for the aws_mskconnect_connector
resource. More information on this configuration with topic2table.map can be found in the
official Snowflake documentation here.
For this POC, MSK Connect and the Snowflake Connector will also be configured and deployed
with a Terraform module. Please refer to the official documentation for more information on
Configuration and Management of the Snowflake Connector.
For the purposes of this POC, tables are created in the RAW_DATA_DEV database /
RAW_CUSTOMER schema. Each table across every customer schema in Oracle is represented in
Snowflake as a table comprising all customers.
5000000 (5 MB). The records are compressed when they are written to data files. As a
result, the size of the records in the buffer may be larger than the size of the data files
created from the records.
If satisfactory throughput and cost performance cannot be achieved for all pipelines using the
Snowflake Connector for Kafka, pipelines which can have higher latency (~1 hour) can use a
different approach to ingest.
A dbt project informs dbt about the context of your project and how to transform your data (build
your data sets). By design, dbt enforces the top-level structure of a dbt project such as the
dbt_project.yml file, the models directory, the snapshots directory, and so on. Within the
directories of the top-level, you can organize your project in any way that meets the needs of your
organization and data pipeline.
At a minimum, all a project needs is the dbt_project.yml project configuration file. dbt supports
a number of different resources, but this POC the project will specifically include:
Resource Description
models Each model lives in a single file and contains logic that either transforms raw
data into a dataset that is ready for analytics or, more often, is an intermediate
step in such a transformation.
seeds CSV files with static data that you can load into your data platform with dbt.
tests SQL queries that you can write to test the models and resources in your
project.
sources A way to name and describe the data loaded into your warehouse by your
Extract and Load tools.
15
For the purposes of this POC, the data copied to Snowflake tables from MSK will represent the
raw data. From here, dbt incremental models will be used to create a staging layer which can then
be used for purposes of analytics and reporting (see diagram below).
Dbt model files can be manually written, however, as the number of tables utilized at QSI is very
large, these models will be programmatically generated using Python code as a utility script, which
will also be a component within the dbt project. The utility script will contain the following
functions:
● Generate metadata - a flat file extract from Oracle containing table metadata will be used
to generate a preliminary json output, containing definitions for fields and data types for
Snowflake staging tables. A mapping of relevant Oracle data types to Snowflake data types
will be hard coded within the function.
● Refresh run frequencies - for each Snowflake table, a run frequency tag will be defined
within the model file which will designate its frequency with dbt Cloud Scheduler. From a
csv input defining the frequencies, a dbt seed file will be generated containing these
definitions to be applied when the models are run in dbt.
● Generate sources - for all of the tables available in Snowflake in the RAW_DATA_DEV
database / RAW_CUSTOMER schema, a schema.yml file will be created, which names and
describes the data loaded into Snowflake, and effectively defining the models to be built.
● Generate staging models - this will programmatically generate the model files, which are
templated sql statements. Upon running the dbt project, these model files take inputs from
the project for the templated values and compile into SQL files, which are then executed in
the Snowflake environment to create new tables. Compaction also will occur so that we
utilize the most recent version of a given record.
The POC will be configured within dbt Cloud, which provides a development environment to help
you build, test, run, and version control your project faster. It also includes an easier way to share
your dbt project's documentation with your team. These development tasks are directly built into
16
dbt Cloud for an integrated development environment (IDE). Refer to Develop in the Cloud for more
details.
With the project configured, you execute commands on the dbt project to create its intended
outputs. The commands you commonly use are:
For information on all dbt commands and their arguments (flags), see the dbt command reference.
If you want to list all dbt commands from the command line, run dbt --help. To list a dbt
command’s specific arguments, run dbt COMMAND_NAME --help .
Please refer to the official dbt documentation for more information on using dbt.
Risks
● Production MSK clusters without client authentication should be avoided if possible.
● The MSK cluster should always be configured for both encryption at rest and in-transit.
● MSK clusters should be implemented with at least 3 brokers in 3 availability zones, with a
replication factor of 3 in order to ensure redundancy.