0% found this document useful (0 votes)

89 views16 pages

Amazon MSK To Snowflake v1.3

Uploaded by

lex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

89 views16 pages

Amazon MSK To Snowflake v1.3

Uploaded by

lex

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

+

Amazon MSK to Snowflake

Qualifacts

v.1.3
2

Table of Contents
Introduction 2
Kafka Overview 3
High-Level Architecture 4
AWS Configurations 5
AWS Account User Roles and Privileges 5
Cluster Configuration 7
Scaling Configurations with Amazon MSK 7
Partitions 7
Ensuring High Availability 8
Broker and Partition Recommendations 8
Topic Configuration 10
Authentication 10
Encryption 10
Logging 11
Security Groups 11
Apply with Terraform 11
SharePlex Configuration 12
Consuming Events Using MSK Connect/Snowflake Connector 12
Ingestion into Snowflake 13
Controlling Snowflake Costs 13
Transformation with dbt Cloud 14
Risks 16

Introduction
Qualifacts currently leverages a multi-tenant solution with 13 transactional Oracle servers,
accommodating approximately 400 customers. Each of these customers share the same schema
definition, which comprises around 2000 tables. From these, data is surfaced to customers in
multiple ways, including an operational data store of real-time tables which is then replicated to
Snowflake and Looker, a Hive connector which takes in data from the ODS as well as Hadoop,
SFTP extracts of batched data via Talend, and embedded reporting solutions.

The goal for the production pipeline POC is to foster the ability to replicate the source data from
these Oracle tables in Snowflake. Longer term, the overall goal is to house all data for reporting in
the Snowflake data platform, which will be the source of all reporting across the organization.
Pursuing this strategy will allow Qualifacts to eliminate auxiliary reporting mechanisms which
reduces cost and complexity, and reporting data can be leveraged in new ways with expanded
business implications.
3

Kafka Overview
Apache Kafka software uses a publish and subscribe model to write and read streams of records,
similar to a message queue or enterprise messaging system. Kafka allows processes to read and
write messages asynchronously. A subscriber does not need to be connected directly to a
publisher; a publisher can queue a message in Kafka for the subscriber to receive later.

An application publishes messages to a topic, and an application subscribes to a topic to receive

those messages. Topics can be divided into partitions to increase scalability. Producers send
records to the cluster which holds on to these records and hands them out to consumers:

The key abstraction in Kafka is the topic. Producers publish their records to a topic, and consumers
subscribe to one or more topics. A Kafka topic is like a sharded write-ahead log. Producers append
records to these logs and consumers subscribe to changes. Each record is a key-value pair. The key
is used for assigning the record to a log partition (unless the publisher specifies the partition).

Here is a simple example of a single producer and consumer reading and writing from a
two-partition topic.

This shows a producer process appending to the logs for the two partitions, and a consumer
reading from the same logs. Each record in the log has an associated entry number that we call the
offset. This offset is used by the consumer to describe its position in each of the logs. Partitions are
spread across a cluster of machines, allowing a topic to hold more data than can fit on any one
machine.
4

High-Level Architecture

As you can see in the above diagram:

AWS Account: Everything is created in the existing Qualifacts AWS Account.

Region: We are going to work on a specific Region. Everything is going to be created in this specific
region (“us-east-1” in the diagram).
5

VPC: We are going to leverage the existing VPC with vpc id vpc-0cd4982a93bdf3e0f and IPv4
CIDR block 172.28.0.0/20

(Typically for a new MSK Cluster setup, you would first create a VPC and devise an IP strategy
with a CIDR block declaration to configure subnets, however for the POC, this VPC is already
existing in Qualifacts’ existing VPC, so VPC, subnets, internet gateway and route entry are not
declared AWS resources in the MSK Terraform module.)

Subnets and Availability Zones: Inside the VPC, for the purposes of this POC, we utilize 3 Subnets. 1
subnet per Availability Zone. We have 3 availability zones in us-east-1 region:

1. us-east-1c
2. us-east-1d
3. us-east-1b

The 3 subnets we will utilize will have the following IPv4 CIDR blocks allocated:

1. us-east-1c subnet: 172.28.3.0/24

2. us-east-1d subnet: 172.28.4.0/24
3. us-east-1b subnet: 172.28.5.0/24

MSK Cluster: This POC creates an MSK cluster. For the purposes of the POC, this cluster will have
3 brokers, each one living in its own availability zone.

EBS Volume: Each one of the brokers is going to have its own EBS volume storage.

AWS Configurations

AWS Account User Roles and Privileges

While we execute Terraform commands to create AWS resources, we use a non-root AWS
account, and this account needs to have the appropriate IAM Permission Policies attached. (We
don’t want to use the root account because this is considered a bad practice for security reasons.)

In the IAM Management Console, attach the intended user account to it the Permission Policies
shown below.

Qualifacts is already leveraging Terraform, so IAM permissions are likely already configured.
Generally Terraform requires the following permissions policies for Amazon MSK:

● AmazonECS_FullAccess (AWS managed)

● AmazonMSKFullAccess (AWS managed)
● AmazonS3FullAccess (AWS managed)
● SystemAdministrator (AWS managed)
● DatabaseAdministrator (AWS managed)
6

As well, the following policy is used for the EC2 instance to connect to MSK for admin/testing
purposes.

● msk-client-iam-policy-dev (Customer managed, defined below)

msk-client-iam-policy-dev JSON definition:

{
"Statement": [
{
"Action": [
"kafka-cluster:Connect",
"kafka-cluster:AlterCluster",
"kafka-cluster:DescribeCluster"
],
"Effect": "Allow",
"Resource": "arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*"
},
{
"Action": [
"kafka-cluster:*Topic*",
"kafka-cluster:WriteData",
"kafka-cluster:ReadData"
],
"Effect": "Allow",
"Resource": [
"arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:topic/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:group/msk-cluster-dev/*"
]
},
{
"Action": [
"kafka-cluster:AlterGroup",
"kafka-cluster:DescribeGroup"
],
"Effect": "Allow",
"Resource": [
"arn:aws:kafka:us-east-1:101443393375:cluster/msk-cluster-dev/*",
"arn:aws:kafka:us-east-1:101443393375:group/msk-cluster-dev/*"
]
}
],
"Version": "2012-10-17"
}
7

Cluster Configuration
In addition to the MSK documentation on cluster configuration, please reference Amazon’s own
suggested best practices to follow with Amazon MSK.

Scaling Configurations with Amazon MSK

Scaling an MSK cluster configuration can be accomplished either vertically, horizontally, or
automatically. Scaling vertically, MSK clusters can be scaled on demand by changing the size or
family of your brokers without reassigning Apache Kafka partitions. Changing the size or family of
brokers allows the option to adjust the MSK cluster’s compute resources based on changes in
workloads, without interfering with ongoing cluster operations.

Scaling horizontally can be accomplished via adding brokers to the cluster. Additional brokers
must be provisioned as a multiple of the number of AZs. Note that you can only increase the
number of brokers, not decrease them subsequently, therefore horizontally scaling with additional
brokers should be carefully considered beforehand. Note that partitions must be re-assigned after
broker additions are made. Please refer to the official MSK documentation for more information
on expanding a cluster.

M5 brokers should typically be used for production instances. Please refer to the Amazon
reference for right sizing the cluster based on the recommended number of partitions per broker.
As well, Amazon has provided a spreadsheet-based tool for determining an optimal number of
brokers and sizing. Always test workloads on newly-provisioned clusters.

In addition to provisioned scaling configurations, scaling can be accomplished automatically with

MSK Serverless. More information on creating, configuring, and monitoring MSK Serverless
clusters can be found in the official Amazon MSK documentation. The choice of whether to
provision MSK clusters or allow them to autoscale with a serverless solution is multifaceted and
discussed at length in this AWS Big Data Blog post.

Partitions
For the purposes of this POC, a single partition is employed which can allow the preservation of
the order of records being produced.

MSK clusters can support up to 200k partitions per cluster. An optimal number of partitions per
topic can be determined based on a future target throughput based on expected production and
consumption. Although it is possible to increase the number of partitions over time, one has to be
careful if messages are produced with keys. When publishing a keyed message, Kafka
deterministically maps the message to a partition based on the hash of the key.

A rough formula for picking the number of partitions is based on throughput. You measure the
throughput that you can achieve on a single partition for production (p) and consumption (c). Let’s
8

say your target throughput is t. Then you need to have at least max(t/p, t/c) partitions. The
per-partition throughput that one can achieve on the producer depends on configurations such as
the batching size, compression codec, type of acknowledgement, replication factor, etc. The
consumer throughput is often application dependent since it corresponds to how fast the
consumer logic can process each message, so benchmarking should be performed.

Please refer to the MSK best practices for right-sizing MSK Kafka clusters, as well as Confluent’s
post on How to Choose the Number of Topics/Partitions in a Kafka Cluster.

Ensuring High Availability

In ensuring high availability for production workloads, three Availability Zones should be utilized,
as is done in this POC, and a replication factor of 3 should be applied. In this regard, 2 AZs can still
be up if 1 is down. Using Amazon MSK, inter-AZ traffic is free if not using EC2-self managed, so in
this case a higher replication factor will not affect cluster costs. More information on replication
can be found in the official Amazon MSK documentation on replication policy and

Brokers are always a multiple of the number of AZs, so if there are 3 AZs there can potentially be
3, 6, 9, 12… brokers configured. When creating a topic in MSK, the partitions are going to be
balanced across AZs. (When adding brokers however, these need to be re-assigned.) Refer to the
Amazon MSK official documentation for more information on maintaining high availability.

Broker and Partition Recommendations

In distributed clusters, you are better off having a smaller number of larger nodes than a larger
number of smaller nodes. In this regard, Bytecode recommends an initial setting of 3 brokers, with
3 partitions per topic, and a default replication factor of 3.

Three brokers allows one broker per Availability Zone, and the number of brokers is always a
multiple of your Availability Zones provisioned to the cluster. A replication factor of 3 allows the
cluster to be highly available. More information on choosing replication factor and partition count
can be found here. Further testing can be performed by Qualifacts if additional optimization of
these settings is desired.

Assuming the configuration of 1 topic per schema, this means each broker contains a copy (either
the leader or replica) of every topic.
9

Broker size can be a consideration of the total intended number of partitions per broker, which is
based on the number of topics (in this case, one schema per customer, each schema is produced as
a topic), multiplied by the number of partitions per topic, divided by the number of brokers,
multiplied by the replication factor.

For example, if you had 100 customers ~ 100 topics (t), 3 partitions per topic, (p), 3 brokers
(b), and a replication factor of 3 (r):

t * p / b * r = 100 * 3 / 3 * 3 = 300 partitions per broker

Broker Type Recommended # of partitions # of customers

(leader + replicas) per broker (assumes 1 schema per customer,
3 brokers, and a replication factor of 3)

kafka.m5.large 1000 333

kafka.m5.xlarge 1000 333

kafka.m5.2xlarge 2000 666

kafka.m5.4xlarge 4000 1333

Note 1: The above table is derived from the official MSK documentation, table provided here.

Note 2: A DEV instance might be provisioned for a smaller size since it would be intended for testing
purposes. Reduction in resources provisioned for DEV would need to be determined based on intended
functionality in the DEV instance.

More information can be found in the official MSK documentation here for right-sizing the MSK cluster.
10

Topic Configuration
For the purpose of this POC, configuration is aligned with the intention of fostering one topic per
schema per client.

Topics are configured to be created automatically via the server_properties setting

auto.create.topics.enable = true in the aws_msk_configuration resource in Terraform for
the MSK module.

auto.create.topics.enable = true
delete.topic.enable = true
default.replication.factor = 3
num.partitions = 1

For more information on topic-level configuration properties for new and existing topics, see
Topic-Level Configs in the Apache Kafka documentation.

Authentication
Due to compatibility issues with SharePlex 10.0 not supporting SASL, this POC contains two
deployment versions for reference: one with unauthenticated access enabled (Terraform module
msk), and another with SCRAM authentication enabled (Terraform module msk_ae). Salted
Challenge Response Authentication Mechanism (SCRAM), or SASL/SCRAM, is a family of SASL
mechanisms that addresses the security concerns with traditional mechanisms that perform
username/password authentication like PLAIN. More information on client authentication can be
found in the official Amazon MSK documentation.

If Qualifacts updates to SharePlex version 10.1 or higher, client authentication with SASL can be
enabled.

Encryption
The MSK cluster applies both encryption at rest and encryption in transit. Amazon MSK always
encrypts your data at rest; for this to take place, an AWS-managed pair of keys will be generated
during the MSK cluster creation process and stored in the AWS Key Management Service for the
AWS region we are working in. Amazon MSK uses Transport Layer Security for encryption in
transit. In-transit encryption via TLS 1.2 can be enabled with configuration settings in Terraform.

This POC contains two deployment versions for reference: one with plaintext enabled and no
encryption applied in-transit (Terraform module msk), and another with TLS encryption
(Terraform module msk_ae).
11

Note: If desired, you can use custom managed keys. In that case, you would need to create the pair
beforehand and provide it inside the encryption_info block as the value of the
encryption_at_rest_kms_key_arn attribute.

Logging
CloudWatch Log Group: In addition, we are going to create a CloudWatch Log Group. This is the
log group where the broker logs will be grouped in. It can be used to inspect what the brokers do or
dig into to find any problems.

For longer term, indefinite storage of logs, S3 can also be configured if desired, as long-term
storage of logs within Cloudwatch can be cost prohibitive. Logs can be forwarded from
Cloudwatch or S3 to New Relic if desired, but this log forwarding functionality to New Relic is not
configured for the purposes of this POC.

Security Groups
MSK Security Group (user-managed): We will create a security group and attach it to the MSK
cluster. This security group will allow our clients to access our MSK brokers.

Client Security Group: We will create an SSH security group to attach it to the EC2 client.

Apply with Terraform

We use Terraform to describe everything that we want to be part of the AWS architecture for this
POC. Terraform allows specification of infrastructure as code; you can write code that represents
the desired state of your infrastructure, rather than using the console.

Each sub-directory in a Terraform project is considered a separate module and is deployed

independently, but they also can share information with each other about the resources deployed.
Terraform is a stateful application and maintains the state of deployed resources. The following
shows the typical files that we will use (not all files apply to every module):

<root>
└> dev.env - contains the values of the variables defined in the variables.tf files, across all modules
└> <module name>
└> datasources.tf - Contains data resource declarations (used to access data sources)
and local values (value assignment for expressions which can be referenced within the
module).
└> main.tf - the main config file; describes the MSK cluster and other necessary pieces
12

└> outputs.tf - Contains output values to make information about your infrastructure
available on the command line, and can expose information for other Terraform
configurations to use. Similar to return values in programming languages.
└> variables.tf - contains the Terraform variables used to parameterize the declarations
so that it will make it easier to customize the whole project; values are assigned to these
variables in the dev.env file.

We can apply our infrastructure configuration with the command (Note: The MSK cluster can take
up to an hour to be created):
$ terraform apply

To run Terraform commands, Terraform requires an initialized working directory and the
commands only act upon the currently selected workspace.

Terraform will present a list of actions that the configuration corresponds to. We will have to
accept the actions by answering yes. When we approve, Terraform will create the resources by
connecting to our AWS account. For information can be found in the POC Terraform project
README.md file as well as the official Terraform documentation.

SharePlex Configuration
Qualifacts currently uses SharePlex version 10.0. Existing SharePlex replication from Oracle can
be configured to a Kafka topic. For more information please refer to the SharePlex technical
documentation.

Consuming Events Using MSK Connect/Snowflake Connector

Kafka Connect is a framework for connecting Kafka with external systems, including databases. A
Kafka Connect cluster is a separate cluster from the Kafka cluster. The Kafka Connect cluster
supports running and scaling out connectors (components that support reading and/or writing
between external systems).

The MSK Connect Snowflake Sink Connector is created as a resource from the Snowflake Sink Custom
Plugin, which is built from a Maven artifact (repository), shown in the sequence below. This resource
exists as part of the Terraform configuration and deployment for MSK Connect.

Maven artifact → archive file → S3 object → custom plugin → MSKConnect Snowflake Sink Connector

The Snowflake connector for Kafka is designed to run in a Kafka Connect cluster to read data from
Kafka topics and write the data into Snowflake tables. SharePlex produces messages to MSK that
represent single rows.
13

MSK Connect’s Snowflake connector consumes messages for topics and provides a choice of
configuration options in terms of how these messages are written to Snowflake:

a) To map one database schema to a topic, and then to consume these topics into one table
per topic (this is what is currently configured in the POC for MSK Connect settings)

b) To map one database schema to one topic, and then to map these topics to a single table in
Snowflake via a hard-coded mapping in MSK Connect via the topic2table.map
configuration setting in connector_cofiguration for the aws_mskconnect_connector
resource. More information on this configuration with topic2table.map can be found in the
official Snowflake documentation here.

MSK Connect is configured to auto-scale via the capacity settings in the

aws_mskconnect_connector resource in Terraform for the msk_connect module.

For this POC, MSK Connect and the Snowflake Connector will also be configured and deployed
with a Terraform module. Please refer to the official documentation for more information on
Configuration and Management of the Snowflake Connector.

Ingestion into Snowflake

The Snowflake Connector for Kafka utilizes Snowflake Snowpipe to ingest data into Snowflake. A
pipe is a named, first-class Snowflake object that contains a COPY statement used by Snowpipe.
The COPY statement identifies the source location of the data files (i.e., a stage) and a target table.
Please refer to the official Snowflake documentation here for more information on Snowpipe.

For the purposes of this POC, tables are created in the RAW_DATA_DEV database /
RAW_CUSTOMER schema. Each table across every customer schema in Oracle is represented in
Snowflake as a table comprising all customers.

Controlling Snowflake Costs

The Snowflake managed compute for Snowpipes is priced higher than Warehouse compute time.
To achieve optimal ingest throughput and cost performance, Snowflake that file sizes to be
ingested are 100 - 250MB or larger. The following properties in the Snowflake Connector for
Kafka connector can be used to influence file size:

● buffer.count.records - Number of records buffered in memory per Kafka partition before

ingesting to Snowflake. The default value is 10000 records.
● buffer.flush.time - Number of seconds between buffer flushes, where the flush is from the
Kafka’s memory cache to the internal stage. The default value is 120 seconds.
● buffer.size.bytes - Cumulative size in bytes of records buffered in memory per the Kafka
partition before they are ingested in Snowflake as data files. The default value for this is
14

5000000 (5 MB). The records are compressed when they are written to data files. As a
result, the size of the records in the buffer may be larger than the size of the data files
created from the records.

If satisfactory throughput and cost performance cannot be achieved for all pipelines using the
Snowflake Connector for Kafka, pipelines which can have higher latency (~1 hour) can use a
different approach to ingest.

Transformation with dbt Cloud

Dbt (data build tool) is used to transform data once it is staged in Snowflake. dbt transforms the
data in the warehouse using simple select statements, effectively creating your entire
transformation process with code. You can write custom business logic using SQL, automate data
quality testing, deploy the code, and deliver trusted data with data documentation side-by-side
with the code.

A dbt project informs dbt about the context of your project and how to transform your data (build
your data sets). By design, dbt enforces the top-level structure of a dbt project such as the
dbt_project.yml file, the models directory, the snapshots directory, and so on. Within the
directories of the top-level, you can organize your project in any way that meets the needs of your
organization and data pipeline.

At a minimum, all a project needs is the dbt_project.yml project configuration file. dbt supports
a number of different resources, but this POC the project will specifically include:

Resource Description

models Each model lives in a single file and contains logic that either transforms raw
data into a dataset that is ready for analytics or, more often, is an intermediate
step in such a transformation.

seeds CSV files with static data that you can load into your data platform with dbt.

tests SQL queries that you can write to test the models and resources in your
project.

macros Blocks of code that you can reuse multiple times.

docs Docs for your project that you can build.

sources A way to name and describe the data loaded into your warehouse by your
Extract and Load tools.
15

For the purposes of this POC, the data copied to Snowflake tables from MSK will represent the
raw data. From here, dbt incremental models will be used to create a staging layer which can then
be used for purposes of analytics and reporting (see diagram below).

Dbt model files can be manually written, however, as the number of tables utilized at QSI is very
large, these models will be programmatically generated using Python code as a utility script, which
will also be a component within the dbt project. The utility script will contain the following
functions:
● Generate metadata - a flat file extract from Oracle containing table metadata will be used
to generate a preliminary json output, containing definitions for fields and data types for
Snowflake staging tables. A mapping of relevant Oracle data types to Snowflake data types
will be hard coded within the function.
● Refresh run frequencies - for each Snowflake table, a run frequency tag will be defined
within the model file which will designate its frequency with dbt Cloud Scheduler. From a
csv input defining the frequencies, a dbt seed file will be generated containing these
definitions to be applied when the models are run in dbt.
● Generate sources - for all of the tables available in Snowflake in the RAW_DATA_DEV
database / RAW_CUSTOMER schema, a schema.yml file will be created, which names and
describes the data loaded into Snowflake, and effectively defining the models to be built.
● Generate staging models - this will programmatically generate the model files, which are
templated sql statements. Upon running the dbt project, these model files take inputs from
the project for the templated values and compile into SQL files, which are then executed in
the Snowflake environment to create new tables. Compaction also will occur so that we
utilize the most recent version of a given record.

The POC will be configured within dbt Cloud, which provides a development environment to help
you build, test, run, and version control your project faster. It also includes an easier way to share
your dbt project's documentation with your team. These development tasks are directly built into
16

dbt Cloud for an integrated development environment (IDE). Refer to Develop in the Cloud for more
details.

With the project configured, you execute commands on the dbt project to create its intended
outputs. The commands you commonly use are:

dbt seed — Creates the seeds defined in your project

dbt run — Runs the models defined in your project
dbt build — Builds and tests selected resources such as models, seeds, snapshots, and tests
dbt test — Executes the tests defined for your project

For information on all dbt commands and their arguments (flags), see the dbt command reference.
If you want to list all dbt commands from the command line, run dbt --help. To list a dbt
command’s specific arguments, run dbt COMMAND_NAME --help .

Please refer to the official dbt documentation for more information on using dbt.

Risks
● Production MSK clusters without client authentication should be avoided if possible.
● The MSK cluster should always be configured for both encryption at rest and in-transit.
● MSK clusters should be implemented with at least 3 brokers in 3 availability zones, with a
replication factor of 3 in order to ensure redundancy.

AWS Certified Solutions Architect Study Guide: Associate SAA-C02 Exam
From Everand
AWS Certified Solutions Architect Study Guide: Associate SAA-C02 Exam
David Clinton
No ratings yet
AWS Certified Solutions Architect - Professional
From Everand
AWS Certified Solutions Architect - Professional
VB Dev
No ratings yet
Mastering Proxmox - Second Edition
From Everand
Mastering Proxmox - Second Edition
Wasim Ahmed
No ratings yet
M2000 Installation Manual PDF
No ratings yet
M2000 Installation Manual PDF
175 pages
Eeprom Flash
No ratings yet
Eeprom Flash
12 pages
Sl. No. Topic No. of Pages: MBA II Semester - 2018 Technology Skills Assignment On MS Word
0% (2)
Sl. No. Topic No. of Pages: MBA II Semester - 2018 Technology Skills Assignment On MS Word
2 pages
PSS FS System Description 18644-En-12
No ratings yet
PSS FS System Description 18644-En-12
131 pages
Confluent Certified Developer for Apache Kafka® Exam kit
From Everand
Confluent Certified Developer for Apache Kafka® Exam kit
PRIYANKA
No ratings yet
Amazon Managed Streaming For Apache Kafka
No ratings yet
Amazon Managed Streaming For Apache Kafka
11 pages
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
From Everand
Advanced Apache Kafka: Engineering High-Performance Streaming Applications
Peter Jones
No ratings yet
Kafka Mastery Guide: Comprehensive Techniques and Insights
From Everand
Kafka Mastery Guide: Comprehensive Techniques and Insights
Adam Jones
No ratings yet
Fast Data Processing Systems with SMACK Stack
From Everand
Fast Data Processing Systems with SMACK Stack
Raúl Estrada
No ratings yet
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-4: AZ 104 EXAM STUDY GUIDE
From Everand
MICROSOFT AZURE ADMINISTRATOR EXAM PREP(AZ-104) Part-4: AZ 104 EXAM STUDY GUIDE
Devi Prasad
No ratings yet
OpenNebula 3 Cloud Computing
From Everand
OpenNebula 3 Cloud Computing
Giovanni Toraldo
No ratings yet
Mastering Kafka Streams: From Basics to Expert Proficiency
From Everand
Mastering Kafka Streams: From Basics to Expert Proficiency
William Smith
No ratings yet
Tomcat Administration and Deployment: Definitive Reference for Developers and Engineers
From Everand
Tomcat Administration and Deployment: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Systemd-nspawn in Practice: Definitive Reference for Developers and Engineers
From Everand
Systemd-nspawn in Practice: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kafka
No ratings yet
Kafka
10 pages
Oracle 11g Streams Implementer's Guide
From Everand
Oracle 11g Streams Implementer's Guide
Ann L. R. McKinnell
No ratings yet
Mastering Kubernetes
From Everand
Mastering Kubernetes
Manish Soni
No ratings yet
Mastering Amazon Web Services: Comprehensive Techniques for AWS Success
From Everand
Mastering Amazon Web Services: Comprehensive Techniques for AWS Success
Adam Jones
No ratings yet
Ppb1 Workshop Streaming
No ratings yet
Ppb1 Workshop Streaming
64 pages
Hands-On Multi-Cloud Kubernetes: Multi-cluster kubernetes deployment and scaling with FluxCD, Virtual Kubelet, Submariner and KubeFed
From Everand
Hands-On Multi-Cloud Kubernetes: Multi-cluster kubernetes deployment and scaling with FluxCD, Virtual Kubelet, Submariner and KubeFed
Joe Brian
No ratings yet
AWS CloudFormation Essentials: A Practical Guide to Automating Cloud Infrastructure
From Everand
AWS CloudFormation Essentials: A Practical Guide to Automating Cloud Infrastructure
Robert Johnson
No ratings yet
Mastering Kubernetes: From Basics to Advanced Cluster Orchestration
From Everand
Mastering Kubernetes: From Basics to Advanced Cluster Orchestration
Dargslan
No ratings yet
AWS CLI Essentials: A Beginner's Guide to Cloud Automation
From Everand
AWS CLI Essentials: A Beginner's Guide to Cloud Automation
Robert Johnson
No ratings yet
AWS SysOps Administrator Associate: From basic to advanced
From Everand
AWS SysOps Administrator Associate: From basic to advanced
Alex Carvalho
No ratings yet
AWS Manged Streaming For Apache Kafka - 30
No ratings yet
AWS Manged Streaming For Apache Kafka - 30
11 pages
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
From Everand
The Ceph Handbook: Building and Managing Scalable Distributed Storage Systems
Robert Johnson
No ratings yet
05 DAY2 AWS-Services-DrillDowns
No ratings yet
05 DAY2 AWS-Services-DrillDowns
94 pages
AWS Certified Solutions Architect - Associate Exam Prep kit
From Everand
AWS Certified Solutions Architect - Associate Exam Prep kit
SUJAN
No ratings yet
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
From Everand
Mastering OpenStack: Design, deploy, and manage clouds in mid to large IT infrastructures
Omar Khedher
No ratings yet
00 02 Chapter Two Fundamentals of Amazon Web Services
No ratings yet
00 02 Chapter Two Fundamentals of Amazon Web Services
26 pages
Learning VMware vRealize Automation: Learn the fundamentals of vRealize Automation to accelerate the delivery of your IT services
From Everand
Learning VMware vRealize Automation: Learn the fundamentals of vRealize Automation to accelerate the delivery of your IT services
Sriram Rajendran
No ratings yet
Wa0000.
No ratings yet
Wa0000.
13 pages
Mastering Vulkan: From Fundamentals to Expert Techniques
From Everand
Mastering Vulkan: From Fundamentals to Expert Techniques
Kameron Hussain
No ratings yet
Kubespray in Production Environments: Definitive Reference for Developers and Engineers
From Everand
Kubespray in Production Environments: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubernetes Deployment: Advanced Strategies
From Everand
Kubernetes Deployment: Advanced Strategies
William Jones
No ratings yet
OpenStack Nova Architecture and Deployment: Definitive Reference for Developers and Engineers
From Everand
OpenStack Nova Architecture and Deployment: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Oracle SOA Suite 11g Administrator's Handbook
From Everand
Oracle SOA Suite 11g Administrator's Handbook
Ahmed Aboulnaga
No ratings yet
VMware NSX Network Essentials
From Everand
VMware NSX Network Essentials
Sreejith.C
No ratings yet
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
From Everand
Advanced Real-Time Data Integration: Apache Kafka and Spark Streaming Techniques
Adam Jones
No ratings yet
Mastering Container Orchestration: Advanced Deployment with Docker Swarm
From Everand
Mastering Container Orchestration: Advanced Deployment with Docker Swarm
Peter Jones
No ratings yet
About Kubernetes and Security Practices - Short Edition: First Edition, #1
From Everand
About Kubernetes and Security Practices - Short Edition: First Edition, #1
Ami Adi
No ratings yet
Implementing Cisco UCS Solutions
From Everand
Implementing Cisco UCS Solutions
Farhan Ahmed Nadeem
No ratings yet
KVM Virtualization Essentials: Definitive Reference for Developers and Engineers
From Everand
KVM Virtualization Essentials: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
AWS Cloud Practitioner: From Basic to Advanced
From Everand
AWS Cloud Practitioner: From Basic to Advanced
Alex Carvalho
No ratings yet
Kubernetes from basic to advanced levels
From Everand
Kubernetes from basic to advanced levels
Alex Carvalho
No ratings yet
Kubernetes Essentials Guide: Definitive Reference for Developers and Engineers
From Everand
Kubernetes Essentials Guide: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Master The Configuration Of Apache Tomcat On Linux
From Everand
Master The Configuration Of Apache Tomcat On Linux
Koru Lenag
No ratings yet
Architecting Solutions with EC2: Definitive Reference for Developers and Engineers
From Everand
Architecting Solutions with EC2: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
From Everand
Next-Generation switching OS configuration and management: Troubleshooting NX-OS in Enterprise Environments
Mamta Devi
No ratings yet
Kubernetes Made Easy
From Everand
Kubernetes Made Easy
Pankaj Joshi
No ratings yet
Kubernetes Clusters with KIND: Definitive Reference for Developers and Engineers
From Everand
Kubernetes Clusters with KIND: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
LEMP Architecture and Administration: Definitive Reference for Developers and Engineers
From Everand
LEMP Architecture and Administration: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Linux Container Essentials with LXC: Definitive Reference for Developers and Engineers
From Everand
Linux Container Essentials with LXC: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Linux Container Management with LXD: Definitive Reference for Developers and Engineers
From Everand
Linux Container Management with LXD: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
From Everand
Kubernetes: Build and Deploy Modern Applications in a Scalable Infrastructure. The Complete Guide to the Most Modern Scalable Software Infrastructure.: Docker & Kubernetes, #2
Jordan Lioy
No ratings yet
Mastering the Art of Cloud Computing with AWS: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Art of Cloud Computing with AWS: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Learning Apache Spark 2
From Everand
Learning Apache Spark 2
Muhammad Asif Abbasi
No ratings yet
Troubleshooting NetScaler
From Everand
Troubleshooting NetScaler
Raghu Varma Tirumalaraju
No ratings yet
AWS CP 2022 Day2
No ratings yet
AWS CP 2022 Day2
83 pages
OpenStack Cookbook: Manage Compute, Storage and Networking through Single Interface
From Everand
OpenStack Cookbook: Manage Compute, Storage and Networking through Single Interface
Jorven Halquin
No ratings yet
OpenStack Cookbook
From Everand
OpenStack Cookbook
Jorven Halquin
No ratings yet
IC3 Upp-Int Case Study 2
No ratings yet
IC3 Upp-Int Case Study 2
5 pages
There Is Examples
No ratings yet
There Is Examples
2 pages
Initialization
No ratings yet
Initialization
16 pages
Group: Jorge Luis Jean Pierre
No ratings yet
Group: Jorge Luis Jean Pierre
20 pages
Account Address Ifb Sample
No ratings yet
Account Address Ifb Sample
5 pages
Access Points and Multiple SSID
No ratings yet
Access Points and Multiple SSID
4 pages
TIB Bstudio-Community 3.5.3 License
No ratings yet
TIB Bstudio-Community 3.5.3 License
40 pages
CS61C 2022fa L09 Decision Making Logical Ops
No ratings yet
CS61C 2022fa L09 Decision Making Logical Ops
30 pages
Laxmikant Marke - React JS - 2.6
No ratings yet
Laxmikant Marke - React JS - 2.6
2 pages
B212447B SWE Fist Attachment Report
No ratings yet
B212447B SWE Fist Attachment Report
19 pages
Dalimilova Kronika, Josef Jireček (1825-1888) - 1853
100% (1)
Dalimilova Kronika, Josef Jireček (1825-1888) - 1853
387 pages
ERKE Group, FUWA QUY 500 Crawler Crane Catalog
No ratings yet
ERKE Group, FUWA QUY 500 Crawler Crane Catalog
94 pages
Computer Virus and Anti Virus
100% (3)
Computer Virus and Anti Virus
49 pages
batch 19_project phase 2_250204_212504
No ratings yet
batch 19_project phase 2_250204_212504
56 pages
AM26LS31 Motorola
No ratings yet
AM26LS31 Motorola
5 pages
Osc 5,6
No ratings yet
Osc 5,6
16 pages
Macworld UK
No ratings yet
Macworld UK
116 pages
Excel Review Activity 17 Christmas or Birthday Gift Purchasing
No ratings yet
Excel Review Activity 17 Christmas or Birthday Gift Purchasing
4 pages
A) Details of Responsibilities Handed Over: Xyz LTD
No ratings yet
A) Details of Responsibilities Handed Over: Xyz LTD
3 pages
Eset Era 6 Era Install Enu
No ratings yet
Eset Era 6 Era Install Enu
180 pages
Cyient Email Signature Guidelines-2016 v2
No ratings yet
Cyient Email Signature Guidelines-2016 v2
6 pages
Can Vip User Manual
No ratings yet
Can Vip User Manual
10 pages
Broadcast Designer
100% (1)
Broadcast Designer
3 pages
Lego Mindstorm EV3 - Order 111-0307268-0918616
No ratings yet
Lego Mindstorm EV3 - Order 111-0307268-0918616
1 page
9691 s12 QP 12
No ratings yet
9691 s12 QP 12
12 pages
Singularities PDF
No ratings yet
Singularities PDF
5 pages
TEMS Discovery Quick How To Guide
100% (1)
TEMS Discovery Quick How To Guide
38 pages
B.Tech II, III & IV Year, M.Tech I Year I Sem & II Sem Lab Details
No ratings yet
B.Tech II, III & IV Year, M.Tech I Year I Sem & II Sem Lab Details
10 pages
Gebagebada Cutubka 3aad - Technology f4
No ratings yet
Gebagebada Cutubka 3aad - Technology f4
14 pages
Figma Design Systems That Will Boost Your Workflow
No ratings yet
Figma Design Systems That Will Boost Your Workflow
7 pages

Amazon MSK To Snowflake v1.3

Uploaded by

Amazon MSK To Snowflake v1.3

Uploaded by

+

Amazon MSK to Snowflake

An application publishes messages to a topic, and an application subscribes to a topic to receive

As you can see in the above diagram:

AWS Account: Everything is created in the existing Qualifacts AWS Account.

1. us-east-1c subnet: 172.28.3.0/24

AWS Account User Roles and Privileges

● AmazonECS_FullAccess (AWS managed)

● msk-client-iam-policy-dev (Customer managed, defined below)

msk-client-iam-policy-dev JSON definition:

Scaling Configurations with Amazon MSK

In addition to provisioned scaling configurations, scaling can be accomplished automatically with

Ensuring High Availability

Broker and Partition Recommendations

t * p / b * r = 100 * 3 / 3 * 3 = 300 partitions per broker

Broker Type Recommended # of partitions # of customers

kafka.m5.large 1000 333

kafka.m5.xlarge 1000 333

kafka.m5.2xlarge 2000 666

kafka.m5.4xlarge 4000 1333

Topics are configured to be created automatically via the server_properties setting

Apply with Terraform

Each sub-directory in a Terraform project is considered a separate module and is deployed

Consuming Events Using MSK Connect/Snowflake Connector

MSK Connect is configured to auto-scale via the capacity settings in the

Ingestion into Snowflake

Controlling Snowflake Costs

● buffer.count.records - Number of records buffered in memory per Kafka partition before

Transformation with dbt Cloud

macros Blocks of code that you can reuse multiple times.

docs Docs for your project that you can build.

dbt seed — Creates the seeds defined in your project

You might also like