0% found this document useful (0 votes)
89 views16 pages

Amazon MSK To Snowflake v1.3

Uploaded by

lex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
89 views16 pages

Amazon MSK To Snowflake v1.3

Uploaded by

lex
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 16

+

Amazon MSK to Snowflake


Qualifacts

v.1.3
2

Table of Contents
Introduction 2
Kafka Overview 3
High-Level Architecture 4
AWS Configurations 5
AWS Account User Roles and Privileges 5
Cluster Configuration 7
Scaling Configurations with Amazon MSK 7
Partitions 7
Ensuring High Availability 8
Broker and Partition Recommendations 8
Topic Configuration 10
Authentication 10
Encryption 10
Logging 11
Security Groups 11
Apply with Terraform 11
SharePlex Configuration 12
Consuming Events Using MSK Connect/Snowflake Connector 12
Ingestion into Snowflake 13
Controlling Snowflake Costs 13
Transformation with dbt Cloud 14
Risks 16

Introduction
Qualifacts currently leverages a multi-tenant solution with 13 transactional Oracle servers,
accommodating approximately 400 customers. Each of these customers share the same schema
definition, which comprises around 2000 tables. From these, data is surfaced to customers in
multiple ways, including an operational data store of real-time tables which is then replicated to
Snowflake and Looker, a Hive connector which takes in data from the ODS as well as Hadoop,
SFTP extracts of batched data via Talend, and embedded reporting solutions.

The goal for the production pipeline POC is to foster the ability to replicate the source data from
these Oracle tables in Snowflake. Longer term, the overall goal is to house all data for reporting in
the Snowflake data platform, which will be the source of all reporting across the organization.
Pursuing this strategy will allow Qualifacts to eliminate auxiliary reporting mechanisms which
reduces cost and complexity, and reporting data can be leveraged in new ways with expanded
business implications.
3

Kafka Overview
Apache Kafka software uses a publish and subscribe model to write and read streams of records,
similar to a message queue or enterprise messaging system. Kafka allows processes to read and
write messages asynchronously. A subscriber does not need to be connected directly to a
publisher; a publisher can queue a message in Kafka for the subscriber to receive later.

An application publishes messages to a topic, and an application subscribes to a topic to receive


those messages. Topics can be divided into partitions to increase scalability. Producers send
records to the cluster which holds on to these records and hands them out to consumers:

The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers
subscribe to one or more topics. A Kafka topic is like a sharded write-ahead log. Producers append
records to these logs and consumers subscribe to changes. Each record is a key-value pair. The key
is used for assigning the record to a log partition (unless the publisher specifies the partition).

Here is a simple example of a single producer and consumer reading and writing from a
two-partition topic.

This shows a producer process appending to the logs for the two partitions, and a consumer
reading from the same logs. Each record in the log has an associated entry number that we call the
offset. This offset is used by the consumer to describe its position in each of the logs. Partitions are
spread across a cluster of machines, allowing a topic to hold more data than can fit on any one
machine.
4

High-Level Architecture

As you can see in the above diagram:

AWS Account: Everything is created in the existing Qualifacts AWS Account.

Region: We are going to work on a specific Region. Everything is going to be created in this specific
region (“us-east-1” in the diagram).
5

VPC: We are going to leverage the existing VPC with vpc id vpc-0cd4982a93bdf3e0f and IPv4
CIDR block 172.28.0.0/20

(Typically for a new MSK Cluster setup, you would first create a VPC and devise an IP strategy
with a CIDR block declaration to configure subnets, however for the POC, this VPC is already
existing in Qualifacts’ existing VPC, so VPC, subnets, internet gateway and route entry are not
declared AWS resources in the MSK Terraform module.)

Subnets and Availability Zones: Inside the VPC, for the purposes of this POC, we utilize 3 Subnets. 1
subnet per Availability Zone. We have 3 availability zones in us-east-1 region:

1. us-east-1c
2. us-east-1d
3. us-east-1b

The 3 subnets we will utilize will have the following IPv4 CIDR blocks allocated:

1. us-east-1c subnet: 172.28.3.0/24


2. us-east-1d subnet: 172.28.4.0/24
3. us-east-1b subnet: 172.28.5.0/24

MSK Cluster: This POC creates an MSK cluster. For the purposes of the POC, this cluster will have
3 brokers, each one living in its own availability zone.

EBS Volume: Each one of the brokers is going to have its own EBS volume storage.

AWS Configurations

AWS Account User Roles and Privileges


While we execute Terraform commands to create AWS resources, we use a non-root AWS
account, and this account needs to have the appropriate IAM Permission Policies attached. (We
don’t want to use the root account because this is considered a bad practice for security reasons.)

In the IAM Management Console, attach the intended user account to it the Permission Policies
shown below.

Qualifacts is already leveraging Terraform, so IAM permissions are likely already configured.
Generally Terraform requires the following permissions policies for Amazon MSK:

● AmazonECS_FullAccess (AWS managed)


● AmazonMSKFullAccess (AWS managed)
● AmazonS3FullAccess (AWS managed)
● SystemAdministrator (AWS managed)
● DatabaseAdministrator (AWS managed)
6

As well, the following policy is used for the EC2 instance to connect to MSK for admin/testing
purposes.

● msk-client-iam-policy-dev (Customer managed, defined below)

msk-client-iam-policy-dev JSON definition:

{
"Statement": [
{
"Action": [
"kafka-cluster:Connect",
"kafka-cluster:AlterCluster",
"kafka-cluster:DescribeCluster"
],
"Effect": "Allow",
"Resource": "arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*"
},
{
"Action": [
"kafka-cluster:*Topic*",
"kafka-cluster:WriteData",
"kafka-cluster:ReadData"
],
"Effect": "Allow",
"Resource": [
"arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:topic/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:group/msk-cluster-dev/*"
]
},
{
"Action": [
"kafka-cluster:AlterGroup",
"kafka-cluster:DescribeGroup"
],
"Effect": "Allow",
"Resource": [
"arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:group/msk-cluster-dev/*"
]
}
],
"Version": "2012-10-17"
}
7

Cluster Configuration
In addition to the MSK documentation on cluster configuration, please reference Amazon’s own
suggested best practices to follow with Amazon MSK.

Scaling Configurations with Amazon MSK


Scaling an MSK cluster configuration can be accomplished either vertically, horizontally, or
automatically. Scaling vertically, MSK clusters can be scaled on demand by changing the size or
family of your brokers without reassigning Apache Kafka partitions. Changing the size or family of
brokers allows the option to adjust the MSK cluster’s compute resources based on changes in
workloads, without interfering with ongoing cluster operations.

Scaling horizontally can be accomplished via adding brokers to the cluster. Additional brokers
must be provisioned as a multiple of the number of AZs. Note that you can only increase the
number of brokers, not decrease them subsequently, therefore horizontally scaling with additional
brokers should be carefully considered beforehand. Note that partitions must be re-assigned after
broker additions are made. Please refer to the official MSK documentation for more information
on expanding a cluster.

M5 brokers should typically be used for production instances. Please refer to the Amazon
reference for right sizing the cluster based on the recommended number of partitions per broker.
As well, Amazon has provided a spreadsheet-based tool for determining an optimal number of
brokers and sizing. Always test workloads on newly-provisioned clusters.

In addition to provisioned scaling configurations, scaling can be accomplished automatically with


MSK Serverless. More information on creating, configuring, and monitoring MSK Serverless
clusters can be found in the official Amazon MSK documentation. The choice of whether to
provision MSK clusters or allow them to autoscale with a serverless solution is multifaceted and
discussed at length in this AWS Big Data Blog post.

Partitions
For the purposes of this POC, a single partition is employed which can allow the preservation of
the order of records being produced.

MSK clusters can support up to 200k partitions per cluster. An optimal number of partitions per
topic can be determined based on a future target throughput based on expected production and
consumption. Although it is possible to increase the number of partitions over time, one has to be
careful if messages are produced with keys. When publishing a keyed message, Kafka
deterministically maps the message to a partition based on the hash of the key.

A rough formula for picking the number of partitions is based on throughput. You measure the
throughput that you can achieve on a single partition for production (p) and consumption (c). Let’s
8

say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The
per-partition throughput that one can achieve on the producer depends on configurations such as
the batching size, compression codec, type of acknowledgement, replication factor, etc. The
consumer throughput is often application dependent since it corresponds to how fast the
consumer logic can process each message, so benchmarking should be performed.

Please refer to the MSK best practices for right-sizing MSK Kafka clusters, as well as Confluent’s
post on How to Choose the Number of Topics/Partitions in a Kafka Cluster.

Ensuring High Availability


In ensuring high availability for production workloads, three Availability Zones should be utilized,
as is done in this POC, and a replication factor of 3 should be applied. In this regard, 2 AZs can still
be up if 1 is down. Using Amazon MSK, inter-AZ traffic is free if not using EC2-self managed, so in
this case a higher replication factor will not affect cluster costs. More information on replication
can be found in the official Amazon MSK documentation on replication policy and

Brokers are always a multiple of the number of AZs, so if there are 3 AZs there can potentially be
3, 6, 9, 12… brokers configured. When creating a topic in MSK, the partitions are going to be
balanced across AZs. (When adding brokers however, these need to be re-assigned.) Refer to the
Amazon MSK official documentation for more information on maintaining high availability.

Broker and Partition Recommendations


In distributed clusters, you are better off having a smaller number of larger nodes than a larger
number of smaller nodes. In this regard, Bytecode recommends an initial setting of 3 brokers, with
3 partitions per topic, and a default replication factor of 3.

Three brokers allows one broker per Availability Zone, and the number of brokers is always a
multiple of your Availability Zones provisioned to the cluster. A replication factor of 3 allows the
cluster to be highly available. More information on choosing replication factor and partition count
can be found here. Further testing can be performed by Qualifacts if additional optimization of
these settings is desired.

Assuming the configuration of 1 topic per schema, this means each broker contains a copy (either
the leader or replica) of every topic.
9

Broker size can be a consideration of the total intended number of partitions per broker, which is
based on the number of topics (in this case, one schema per customer, each schema is produced as
a topic), multiplied by the number of partitions per topic, divided by the number of brokers,
multiplied by the replication factor.

For example, if you had 100 customers ~ 100 topics (t), 3 partitions per topic, (p), 3 brokers
(b), and a replication factor of 3 (r):

t * p / b * r = 100 * 3 / 3 * 3 = 300 partitions per broker

Broker Type Recommended # of partitions # of customers


(leader + replicas) per broker (assumes 1 schema per customer,
3 brokers, and a replication factor of 3)

kafka.m5.large 1000 333

kafka.m5.xlarge 1000 333

kafka.m5.2xlarge 2000 666

kafka.m5.4xlarge 4000 1333

Note 1: The above table is derived from the official MSK documentation, table provided here.

Note 2: A DEV instance might be provisioned for a smaller size since it would be intended for testing
purposes. Reduction in resources provisioned for DEV would need to be determined based on intended
functionality in the DEV instance.

More information can be found in the official MSK documentation here for right-sizing the MSK cluster.
10

Topic Configuration
For the purpose of this POC, configuration is aligned with the intention of fostering one topic per
schema per client.

Topics are configured to be created automatically via the server_properties setting


auto.create.topics.enable = true in the aws_msk_configuration resource in Terraform for
the MSK module.

auto.create.topics.enable = true
delete.topic.enable = true
default.replication.factor = 3
num.partitions = 1

For more information on topic-level configuration properties for new and existing topics, see
Topic-Level Configs in the Apache Kafka documentation.

Authentication
Due to compatibility issues with SharePlex 10.0 not supporting SASL, this POC contains two
deployment versions for reference: one with unauthenticated access enabled (Terraform module
msk), and another with SCRAM authentication enabled (Terraform module msk_ae). Salted
Challenge Response Authentication Mechanism (SCRAM), or SASL/SCRAM, is a family of SASL
mechanisms that addresses the security concerns with traditional mechanisms that perform
username/password authentication like PLAIN. More information on client authentication can be
found in the official Amazon MSK documentation.

If Qualifacts updates to SharePlex version 10.1 or higher, client authentication with SASL can be
enabled.

Encryption
The MSK cluster applies both encryption at rest and encryption in transit. Amazon MSK always
encrypts your data at rest; for this to take place, an AWS-managed pair of keys will be generated
during the MSK cluster creation process and stored in the AWS Key Management Service for the
AWS region we are working in. Amazon MSK uses Transport Layer Security for encryption in
transit. In-transit encryption via TLS 1.2 can be enabled with configuration settings in Terraform.

This POC contains two deployment versions for reference: one with plaintext enabled and no
encryption applied in-transit (Terraform module msk), and another with TLS encryption
(Terraform module msk_ae).
11

Note: If desired, you can use custom managed keys. In that case, you would need to create the pair
beforehand and provide it inside the encryption_info block as the value of the
encryption_at_rest_kms_key_arn attribute.

Logging
CloudWatch Log Group: In addition, we are going to create a CloudWatch Log Group. This is the
log group where the broker logs will be grouped in. It can be used to inspect what the brokers do or
dig into to find any problems.

For longer term, indefinite storage of logs, S3 can also be configured if desired, as long-term
storage of logs within Cloudwatch can be cost prohibitive. Logs can be forwarded from
Cloudwatch or S3 to New Relic if desired, but this log forwarding functionality to New Relic is not
configured for the purposes of this POC.

Security Groups
MSK Security Group (user-managed): We will create a security group and attach it to the MSK
cluster. This security group will allow our clients to access our MSK brokers.

Client Security Group: We will create an SSH security group to attach it to the EC2 client.

Apply with Terraform


We use Terraform to describe everything that we want to be part of the AWS architecture for this
POC. Terraform allows specification of infrastructure as code; you can write code that represents
the desired state of your infrastructure, rather than using the console.

Each sub-directory in a Terraform project is considered a separate module and is deployed


independently, but they also can share information with each other about the resources deployed.
Terraform is a stateful application and maintains the state of deployed resources. The following
shows the typical files that we will use (not all files apply to every module):

<root>
└> dev.env - contains the values of the variables defined in the variables.tf files, across all modules
└> <module name>
└> datasources.tf - Contains data resource declarations (used to access data sources)
and local values (value assignment for expressions which can be referenced within the
module).
└> main.tf - the main config file; describes the MSK cluster and other necessary pieces
12

└> outputs.tf - Contains output values to make information about your infrastructure
available on the command line, and can expose information for other Terraform
configurations to use. Similar to return values in programming languages.
└> variables.tf - contains the Terraform variables used to parameterize the declarations
so that it will make it easier to customize the whole project; values are assigned to these
variables in the dev.env file.

We can apply our infrastructure configuration with the command (Note: The MSK cluster can take
up to an hour to be created):
$ terraform apply

To run Terraform commands, Terraform requires an initialized working directory and the
commands only act upon the currently selected workspace.

Terraform will present a list of actions that the configuration corresponds to. We will have to
accept the actions by answering yes. When we approve, Terraform will create the resources by
connecting to our AWS account. For information can be found in the POC Terraform project
README.md file as well as the official Terraform documentation.

SharePlex Configuration
Qualifacts currently uses SharePlex version 10.0. Existing SharePlex replication from Oracle can
be configured to a Kafka topic. For more information please refer to the SharePlex technical
documentation.

Consuming Events Using MSK Connect/Snowflake Connector


Kafka Connect is a framework for connecting Kafka with external systems, including databases. A
Kafka Connect cluster is a separate cluster from the Kafka cluster. The Kafka Connect cluster
supports running and scaling out connectors (components that support reading and/or writing
between external systems).

The MSK Connect Snowflake Sink Connector is created as a resource from the Snowflake Sink Custom
Plugin, which is built from a Maven artifact (repository), shown in the sequence below. This resource
exists as part of the Terraform configuration and deployment for MSK Connect.

Maven artifact → archive file → S3 object → custom plugin → MSKConnect Snowflake Sink Connector

The Snowflake connector for Kafka is designed to run in a Kafka Connect cluster to read data from
Kafka topics and write the data into Snowflake tables. SharePlex produces messages to MSK that
represent single rows.
13

MSK Connect’s Snowflake connector consumes messages for topics and provides a choice of
configuration options in terms of how these messages are written to Snowflake:

a) To map one database schema to a topic, and then to consume these topics into one table
per topic (this is what is currently configured in the POC for MSK Connect settings)

b) To map one database schema to one topic, and then to map these topics to a single table in
Snowflake via a hard-coded mapping in MSK Connect via the topic2table.map
configuration setting in connector_cofiguration for the aws_mskconnect_connector
resource. More information on this configuration with topic2table.map can be found in the
official Snowflake documentation here.

MSK Connect is configured to auto-scale via the capacity settings in the


aws_mskconnect_connector resource in Terraform for the msk_connect module.

For this POC, MSK Connect and the Snowflake Connector will also be configured and deployed
with a Terraform module. Please refer to the official documentation for more information on
Configuration and Management of the Snowflake Connector.

Ingestion into Snowflake


The Snowflake Connector for Kafka utilizes Snowflake Snowpipe to ingest data into Snowflake. A
pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe.
The COPY statement identifies the source location of the data files (i.e., a stage) and a target table.
Please refer to the official Snowflake documentation here for more information on Snowpipe.

For the purposes of this POC, tables are created in the RAW_DATA_DEV database /
RAW_CUSTOMER schema. Each table across every customer schema in Oracle is represented in
Snowflake as a table comprising all customers.

Controlling Snowflake Costs


The Snowflake managed compute for Snowpipes is priced higher than Warehouse compute time.
To achieve optimal ingest throughput and cost performance, Snowflake that file sizes to be
ingested are 100 - 250MB or larger. The following properties in the Snowflake Connector for
Kafka connector can be used to influence file size:

● buffer.count.records - Number of records buffered in memory per Kafka partition before


ingesting to Snowflake. The default value is 10000 records.
● buffer.flush.time - Number of seconds between buffer flushes, where the flush is from the
Kafka’s memory cache to the internal stage. The default value is 120 seconds.
● buffer.size.bytes - Cumulative size in bytes of records buffered in memory per the Kafka
partition before they are ingested in Snowflake as data files. The default value for this is
14

5000000 (5 MB). The records are compressed when they are written to data files. As a
result, the size of the records in the buffer may be larger than the size of the data files
created from the records.

If satisfactory throughput and cost performance cannot be achieved for all pipelines using the
Snowflake Connector for Kafka, pipelines which can have higher latency (~1 hour) can use a
different approach to ingest.

Transformation with dbt Cloud


Dbt (data build tool) is used to transform data once it is staged in Snowflake. dbt transforms the
data in the warehouse using simple select statements, effectively creating your entire
transformation process with code. You can write custom business logic using SQL, automate data
quality testing, deploy the code, and deliver trusted data with data documentation side-by-side
with the code.

A dbt project informs dbt about the context of your project and how to transform your data (build
your data sets). By design, dbt enforces the top-level structure of a dbt project such as the
dbt_project.yml file, the models directory, the snapshots directory, and so on. Within the
directories of the top-level, you can organize your project in any way that meets the needs of your
organization and data pipeline.

At a minimum, all a project needs is the dbt_project.yml project configuration file. dbt supports
a number of different resources, but this POC the project will specifically include:

Resource Description

models Each model lives in a single file and contains logic that either transforms raw
data into a dataset that is ready for analytics or, more often, is an intermediate
step in such a transformation.

seeds CSV files with static data that you can load into your data platform with dbt.

tests SQL queries that you can write to test the models and resources in your
project.

macros Blocks of code that you can reuse multiple times.

docs Docs for your project that you can build.

sources A way to name and describe the data loaded into your warehouse by your
Extract and Load tools.
15

For the purposes of this POC, the data copied to Snowflake tables from MSK will represent the
raw data. From here, dbt incremental models will be used to create a staging layer which can then
be used for purposes of analytics and reporting (see diagram below).

Dbt model files can be manually written, however, as the number of tables utilized at QSI is very
large, these models will be programmatically generated using Python code as a utility script, which
will also be a component within the dbt project. The utility script will contain the following
functions:
● Generate metadata - a flat file extract from Oracle containing table metadata will be used
to generate a preliminary json output, containing definitions for fields and data types for
Snowflake staging tables. A mapping of relevant Oracle data types to Snowflake data types
will be hard coded within the function.
● Refresh run frequencies - for each Snowflake table, a run frequency tag will be defined
within the model file which will designate its frequency with dbt Cloud Scheduler. From a
csv input defining the frequencies, a dbt seed file will be generated containing these
definitions to be applied when the models are run in dbt.
● Generate sources - for all of the tables available in Snowflake in the RAW_DATA_DEV
database / RAW_CUSTOMER schema, a schema.yml file will be created, which names and
describes the data loaded into Snowflake, and effectively defining the models to be built.
● Generate staging models - this will programmatically generate the model files, which are
templated sql statements. Upon running the dbt project, these model files take inputs from
the project for the templated values and compile into SQL files, which are then executed in
the Snowflake environment to create new tables. Compaction also will occur so that we
utilize the most recent version of a given record.

The POC will be configured within dbt Cloud, which provides a development environment to help
you build, test, run, and version control your project faster. It also includes an easier way to share
your dbt project's documentation with your team. These development tasks are directly built into
16

dbt Cloud for an integrated development environment (IDE). Refer to Develop in the Cloud for more
details.

With the project configured, you execute commands on the dbt project to create its intended
outputs. The commands you commonly use are:

dbt seed — Creates the seeds defined in your project


dbt run — Runs the models defined in your project
dbt build — Builds and tests selected resources such as models, seeds, snapshots, and tests
dbt test — Executes the tests defined for your project

For information on all dbt commands and their arguments (flags), see the dbt command reference.
If you want to list all dbt commands from the command line, run dbt --help. To list a dbt
command’s specific arguments, run dbt COMMAND_NAME --help .

Please refer to the official dbt documentation for more information on using dbt.

Risks
● Production MSK clusters without client authentication should be avoided if possible.
● The MSK cluster should always be configured for both encryption at rest and in-transit.
● MSK clusters should be implemented with at least 3 brokers in 3 availability zones, with a
replication factor of 3 in order to ensure redundancy.

You might also like