100% found this document useful (1 vote)
535 views37 pages

Performance Tuning With InfoSphere CDC

The document discusses tuning the performance of IBM InfoSphere CDC replication by identifying bottlenecks, resolving bottlenecks through various methods, and scaling replication. It covers definitions of latency and throughput, identifying bottlenecks using the CDC management console, addressing source-side issues like log reading and parsing, dealing with communications bottlenecks through analyzing network traffic, and advanced performance analysis techniques.

Uploaded by

karthikt27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
535 views37 pages

Performance Tuning With InfoSphere CDC

The document discusses tuning the performance of IBM InfoSphere CDC replication by identifying bottlenecks, resolving bottlenecks through various methods, and scaling replication. It covers definitions of latency and throughput, identifying bottlenecks using the CDC management console, addressing source-side issues like log reading and parsing, dealing with communications bottlenecks through analyzing network traffic, and advanced performance analysis techniques.

Uploaded by

karthikt27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

InfoSphere CDC

Latency & Performance Tuning

© 2011 IBM Corporation


Information Management Software

Agenda
• Definitions – Latency vs. Throughput
• Bottleneck identification
• Resolving bottlenecks
• Scalability Options
• Advanced Analysis
Information Management Software

Latency & Throughput - Definitions


Information Management Software

What is Latency?
• Sometimes refered to as “replication lag”
• The amount of time delay between an update applied in the
source database and the same update applied in the backup
target
• The shorter the duration, the lower the latency
• High latency means the target is not up-to-date, but is not
an indication that there are any synchronization issues. The
target may still be accurately synchronized with what the
source was at some point in the (recent) past
Information Management Software

What is Throughput?
• Throughput is the quantity of data processed within a given
period of time
• High volumes of data changes may require high throughput
• If not, the latency may increase if higher volumes than
replication can process have to be replicated
• Latency ≠ Throughput
• Issues with the throughput in high volume environments
typically cause latency to increase, hence the relation
• Note: Tuning for high throughput may be very different than
tuning for low latency
• This document focuses on measuring throughput of the
replication process and the steps to take to improve it
Information Management Software

Performance Tuning Basics


• Performance tuning is an iterative process!
• If one process is optimized and the bottleneck is removed,
another bottleneck can appear further down (or up) the line
• Very seldomly can the tuning be completed with one change
• Performance tuning should be aimed at meeting business
requirements and these should be specific
• Example: Someone may want a minute latency even during a batch
run while the business may not require the batch data until 8 am in the
morning. This means latency during the midnight batch run is
acceptable as long as it is caught up before start of business.
Information Management Software

Bottleneck Identification
Information Management Software

The first step – Identify your Bottleneck(s)


• You may be tempted to immediately work with the system tools to detect system
bottlenecks
• CPU utilization
• Memory utilization & paging
• Disk IO
Not a good idea, rarely identifies the bottleneck (unless extremely obvious)
• First use the CDC Management Console Performance View to identify potential
bottlenecks
• This prevents drawing conclusions based on system activity generated by other applications on
the system
• The internal monitor provide details on pipeline components of the CDC processes
• Your goal is to improve CDC performance, and the only way to do this is remove CDC bottlenecks
• Alternatively, can identify bottlenecks by running tests to isolate CDC components
• You can disable apply to rule out target database bottleneck
• In the java engines you can bypass CDC operations to rule out CDC and target database
bottleneck
• You can set up intra-system replication to rule out communications bottleneck
• The above bypass methods are very useful to quickly determine the area in which
further investigation must occur!!!
Information Management Software

Sanity Check Datastore


• The first step is to determine the
overall health of the datastore
and whether the performance
metrics are being collected at
regular intervals
• Select the source and target
datastore (Datastore - <Datastore
Name>), and monitor the Time
check missed count. If this
number continuously increases,
it indicates CDC is not obtaining
enough resources, either CPU or
memory to perform collection.
This must be addressed prior to
further tuning.
Information Management Software

Datastore Memory
• After checking the Time check missed count, proceed to
investigating the instance memory
• Collect the following statistics for the source datastore:
• Datastore - <datastore name> - Free memory – bytes
• Datastore - <datastore name> - Maximum memory – bytes
• Datastore - <datastore name> - Total memory – bytes
• Log Parser – Disk writes – bytes (per sec)
• Single scrape – Disk writes – bytes (per sec)
Information Management Software

When to Adjust Datastore Memory


• If there is disk writing, if resources are available, increase
the instance memory until writing to disk is not necessary
or it is minimized. After restarting wait some time before
deciding to increase again, as it will take some time to work
through what has already been written to disk.
• If free memory is low, and total memory is constantly close
to maximum memory, increase the instance memory
• Generally, free memory should range between 20%-30% of
the maximum memory. If the free memory is consistently
higher, even during the peak loads, you may consider
reducing the instance memory.
Information Management Software

Healthy Instance Memory


• Zero disk writes and 20-30% free memory
Information Management Software

CDC Management Performance View - Bottlenecks


Information Management Software

Bottlenecks
• The ICDC engine uses a pipeline architecture, the
bottleneck pane displays the component that is not keeping
up with the downward stream ones
• The 5 components that can be a bottleneck include:
• Log Reader
• Source Engine
• Communications
• Target Engine
• Target Database
• The actions taken to address a performance issue depend
on which component is the predominant bottleneck
Information Management Software

Steady State Operation - Performance Tuning


• Use bottlenecks to help select the statistics that are suited
to investigating a performance issue
Information Management Software

Identifying Busy Tables


• The performance view can display the tables with the
highest activity, which can aid in determining which tables
deserve particular attention, or possibly which should be
moved into a separate subscription

 View displays the 10 busiest


tables by data volume
Information Management Software

Table Activity Breakdown by Table


• You can select one of the busy
tables and drill down by
selecting specific metrics to
track, such as types of DML ,to
potentially tune the database for
the specific type of data pattern
Information Management Software

Performance Tuning – Addressing Source Side Bottlenecks


• The 3 main areas to focus on are reading the log, parsing and derived columns. Leaving
tracing enabled will decrease performance drastically, ensure it is off before addressing
performance.
• Log reading is usually constrained by disk IO speed and the time it takes to read the logs.
NFS mounts have been observed to hinder performance. If the StagingStore directory is
growing then consider increasing the amount of memory available to the instance.
• Log parsing is usually CPU constrained especially when there are wide tables. Row
selection may also be CPU intensive, consider moving filtering to the target with user exit
or simplifying the row selection expression.
• Derived columns may be moved to the target in some cases or made more efficient.
• An increase in the number of subscriptions can increase the throughput but will require
additional resources.
• LOB type objects are retrieved from the database and can slow down source processing
and increase impact on source, determine if they can be removed from scope
• Oracle online redo logs can be a source of physical contention when both Oracle and CDC
are simultaneously accessing the files
• File system caching allows the log reader to process logs without affecting the physical I/O
Information Management Software

Performance Tuning – Addressing Communications


Bottlenecks
• A communications bottleneck may be perceived as a CDC
target issue
• No or few transactions applied per second
• When setting up intra-system replication, the throughput is acceptable
and subscriptions are keeping up, yet the same subscription setup to
another system is slow
• One cause of transaction latency is the lack of sufficient
bandwidth in the network
• Often transaction latency due to network limitations is
perceived as CDC not utilizing the network effectively
• This is not true most of the time
• Almost all communication issues can be brought back to
configurations in the environment
• CDC is limited by the amount of traffic generated at the source
(transactions) versus the bandwidth available for replication
Information Management Software

Analyzing Communications Bottlenecks – 1


• Definitions:
• MTU (Maximum Transmission Unit) is the size of the largest packet that can be sent over a
network
• If a message is larger than the MTU, it must be fragmented into multiple smaller packets
• In TCP/IP network status (netstat), you can see the number of bytes flowing
in and out of your system
• # netstat -i gives you information about network traffic across network interfaces
• Specifically, watch the incoming packets and outgoing packets
• By taking snapshots with intervals, you’ll be able to estimate the amount of bytes sent across per
second:
• # of bits per second = (number of packets * MTU)/interval in seconds
• Also, be on the outlook for incoming errors and outgoing errors
• If there are errors in the transmission, re-transmission of TCP packets will occur, this can slow
down traffic significantly
• We have seen cases where relatively few re-transmission of TCP packets have a large impact
on the actual bandwidth available
• The larger the packet size, the more “costly” the retransmissions
• Re-transmissions are most of the time caused by network devices not operating in sync
(combination of full/half duplex devices across the route, different TCP/IP buffer sizes)
• There are also other more sophisticated network administrator tools
(sniffers, network statistics) which may help you analyze communications
throughput in a more granular way
Information Management Software

Analyzing Communications Bottlenecks – 2


• A straightforward method to further investigate CDC
communications limitations is to use the same route to FTP a large
file
• A simple calculation can determine the available bandwidth in your
communication network
• Formula to calculate transmission speed in bits per second:
(file_size_in_bytes*8)/(number_of_seconds_transmission_took)
• Bandwidth in Mbps is the outcome of this formula divided by (1,000,000)
• Example: If it took 25 minutes to transfer a file of 10,237,194,382 bytes, the bandwidth is:
(10,237,194,382*8)/(25*60) = 54,598,370 bits per second (~54.6 Mbps)
• FTP is generally considered a trusted protocol for proving the
throughput of the network
• If FTP cannot utilize the expected available bandwidth, neither can CDC
• Do not be tempted to use “ping” to measure throughput, only
response times and round trip times for small packets are measured,
not high volume throughput
Information Management Software

Performance Tuning – Addressing Target Side Bottlenecks


• Determine if a specific table, particularly from the busy ones, has slow
operations by viewing the table specific performance statistics, (Target
Apply slow DB operations – count (per sec))
Information Management Software

Performance Tuning – Addressing Target Side Bottlenecks

• If a table has been identified as slow by the performance


monitor, check the database tuning of the target table. Check if
the table has a unique index, fragmentation of the table or index,
when last statistics were collected, or if the table has a large
number of indices.
• Remove as many indexes as possible on CDC destination tables
• Splitting tables into more subscriptions will increase the number
of applies and can increase the throughput when there are many
busy tables.
• When there are only a few tables, and many operations of the
same type, increasing global_max_batch_size can result in CDC
grouping a larger number of operations and increasing
throughput.
Information Management Software

Interpreting Oracle Database Traces (raw trace)


• In order to enable database tracing of CDC database sessions enable the following
parameters:
• Source Scrape: mirror_scrape_db_trace = true (to enable, false by default)
• Target Apply: mirror_apply_db_trace = true (to enable, false by default)
• Oracle monitoring capabilities can be used to determine if there is database tuning that
could improve the performance of the apply
• Demonstrates how Oracle handles the operations
• First parse (PARSE), then execute (EXEC)
• Identifies how long an individual operation/transaction takes
• Look for “e=nnnnnn” data in the trace
• This gives you an idea how long an operation took (in microseconds)
• Is this expected?
• Does the operation take the same amount of time at the source?
• CDC cannot apply faster than the database can process
• Identifies wait times
• Logged when Oracle waits for:
• More information from the client (SQL*Net message from client)
• In this case, CDC is the client
• Updates of an index
• Latch free
• Disk
• Data spread un-evenly, or spread across too few disks is a common problem
Information Management Software

Condensing the Raw Trace


• Raw traces can be tedious to go through
• If you want to identify the operations which consume most
of the time, use the Oracle tkprof utility
• Syntax
• # tkprof <TraceFile> <OutputFile>
• The default for this command is typically what you will want
to see
• Statement which duration was the longest is on top of the list
• Shows how many times the statement was executed, how
much time was consumed for parsing the statement, and how
much time for executing
• All statements sharing the (exact) same format (same columns
updated, same where clause) are grouped together
Information Management Software

Interpreting Oracle Database Traces (condensed trace)


UPDATE "ERP"."CUSTOMER" SET "ID" = :1, "ID_MINOR" = :2, "OPENDAY" = :3, "PRODAY" = :4, "UNSEENOPCNT" = :5, "OPNO" = :6, "OPDAY"
= :7, "OPTRANSDAY" = :8, "BALANCE" = :9, "LASTVISITDAY" = :10, "PAYROLLDAY" = :11, "EXPIRATIONDAY" = :12, "CAPYEAR" = :13,
"PU9DAY" = :14, "PU9CASH" = :15, "LASTOPNO" = :16, "CAPDAY" = :17, "SBOOKNO" = :18, "OPROW_MINOR" = :19, "OPROW" = :20,
"ASSIGNDAY" = :21 WHERE "ID" = :22 AND "ID_MINOR" = :23

call count cpu elapsed disk query current rows


------- ------ -------- ---------- ---------- ---------- ---------- ----------
Parse 31918 0.33 0.39 0 0 0 0
Execute 31918 11.38 401.82 25230 114508 521516 66882
Fetch 0 0.00 0.00 0 0 0 0
------- ------ -------- ---------- ---------- ---------- ---------- ----------
total 63836 11.71 402.21 25230 114508 521516 66882

Misses in library cache during parse: 1


Optimizer mode: CHOOSE
Parsing user id: 539

Rows Row Source Operation


------- ---------------------------------------------------
1 UPDATE
1 PARTITION HASH SINGLE PARTITION: KEY KEY
1 INDEX UNIQUE SCAN PKDEPOSIT PARTITION: KEY KEY (object id 34496)

********************************************************************************

What the above means:


It took 401.82 seconds to do 31918 updates, 66882 rows were affected by the updates
-It took roughly 6 milliseconds to update 1 row
-CDC applied arraying in this statement (# of rows affected > # of Executes)
Information Management Software

Considering Alternative Architectures


Information Management Software

Other Scalability Options


• Sometimes a standard configuration of CDC will not scale
well enough
• Multiple parallel subscriptions may be considered
• A single table can also be handled by multiple parallel subscriptions
• Employ row filtering to split up the operations in ranges
• Java user exit to determine modulo of numeric column
• Caveat: does not work if the filtering column is updated!!!
• Or, even multiple CDC instances to spread the parsing
workload
• Do not spread the workload of a single table across CDC instances
(will have a counter-productive effect)
• An N-tier architecture can also sometimes help to improve
scalability (especially when working with older hardware)
Information Management Software

Standard Architecture
Log reader
Subscription process

1 subscription
Source
Source Target
Target
CDC
CDC Instance
Instance CDC
CDC Instance
Instance

Redo/archive
logs

 Simplest configuration to maintain, and most efficient use of resources


 One source instance reads and parses the Oracle database log entries
 Instance uses one connection to send to the target
 Subscription applies transactions using one database session on the target

29
Information Management Software

Spreading Workload Among Subscriptions when Apply is


Bottleneck
Log reader
Subscription process

Multiple subscriptions

Source
Source Target
Target
CDC
CDC Instance
Instance CDC
CDC Instance
Instance

Redo/archive
logs

 Multiple subscriptions allow each subscription to have it’s own database


connection, and apply data in parallel.
 Minimal impact on source when subscriptions share the staging store and one
log reader and parser.
 If subscriptions cannot share store, there may be times with mutliple readers
and parsers increasing parallellism on the source as well.
 Each subscription sends data through a separate network connection and
applies with a dedicated database session on the target

30
Information Management Software

Spreading Workload Among Instances when Source is


Bottleneck
Log reader
Subscription process

Source
Single subscription per instance
Source
CDC
CDC Instance
Instance

Target
Target
CDC
CDC Instance
Instance

Source
Source
CDC
CDC Instance
Instance

Redo/archive
logs

 Adding an instance ensures source subscriptions from the two instances reads
logs in parallel
 Each source instance reads entire log, but discards out-of-scope entries prior to
parsing
 Each subscription sends data through a separate network connection and
applies with a dedicated database session on the target

31
Information Management Software

Spreading Workload Among Instances and Subscriptions


Log reader
Subscription process
Source
Source
Multiple instances,
CDC
CDC Instance
Instance multiple subscriptions

Target
Target
CDC
CDC Instance
Instance

Source
Source
CDC
CDC Instance
Instance
Redo/archive
logs

 After tuning for source bottleneck and then for target one, or vice-versa, the
eventual configuration may have multiple instances with many subscriptions
each
 Ensures subscriptions from the multiple instances read entire log and parse only
in-scope data in parallel
 Multiple subscriptoins sending to target and applying data with dedicated
database sessions
32
Information Management Software

Growth Strategy
Increasing
Increasing number
number of
of sessions
sessions on
on the
the target
target database
database (employ
(employ database
database scalability)
scalability)

1 Instance 1 Instance 1 Instance 1 Instance


Increasing
Increasing parsing

1 Subscription 2 Subscriptions 3 Subscriptions n Subscriptions


parsing capacity

2 Instances 2 Instances 2 Instances 2 Instances


capacity (more

1 Subscription per 2 Subscriptions 3 Subscriptions n Subscriptions


Instance per Instance per Instance per Instance
(more tables

3 Instances 3 Instances 3 Instances 3 Instances


tables in

1 Subscription per 2 Subscriptions 3 Subscriptions n Subscriptions


in scope)

Instance per Instance per Instance per Instance


scope)

4 Instances 4 Instances 4 Instances 4 Instances

1 Subscription per 2 Subscriptions 3 Subscriptions n Subscriptions


Instance per Instance per Instance per Instance

Every instance allocates a minimum amount of memory


Every instance will read the database logs

33
Information Management Software

Remote Configuration with Oracle Source


1. Scraper on source, Apply on target 2. Scraper and Apply on target *
Source Server Target Server Source Server Target Server
Metadata Metadata
Metadata Metadata
Source Tables Scraper Apply Scraper
Target Tables Source Tables
Target Tables
redolog redolog Apply

3. Scraper and Apply on CDC server * 4. Scraper and Apply on source


Source Server CDC Server Target Server Source Server Target Server
Metadata Metadata
Scraper Metadata Metadata
Source Tables Source Tables Scraper Apply
Target Tables Target Tables

redolog Apply redolog

* For configurations 2 & 3, optimal using shared SAN for logs


Information Management Software

Advantages of N-tier Architecture


• CDC engine running on higher class hardware
• Faster processors
• More memory
• Faster disks
• Avoid limitations of the customer’s production hardware
• Customer may have relatively slow processors installed (for example HP PA-RISC)
• There may not be sufficient memory to house a CDC instance
• Customer may have concerns about CDC running in their production environment
• Advantages
• Reduced impact on production hardware
• Leverage IBM hardware to improve project
• Limitations
• N-tier currently only supported for DB2 LUW and Oracle databases
• If Source database server hardware and CDC server hardware differ, configure system
parameter to inform CDC that the source logs come from a specific platform
• Endianness of operating systems must be the same (AIX server can process logs from AIX,
HP-UX and Solaris servers, but not from Linux or Windows servers)
Information Management Software

Statistics collection for Advanced Analysis


• In CDC 6.5, performance statistics are collected by default
• If detailed performance analysis is required, use dmsupportinfo to gather the statistics,
and attach to a PMR
• The following parameters can also be set to adjust how statistics are gathered:
• stats_collect [True] Controls if statistics are logged to CSV files. It is a dynamic parameter and will take affect after a
STATS_INTERVAL. Default is true.
• stats_interval [30] Determines approximately how often statistics are collected. Value is in seconds. Default is 30
seconds, minimum is 1 second. Generally you do not need to set this value lower unless trying to do focused
analysis over a small interval (one hour or less).
• stats_max_file_length_kbytes [not set] Maximum disk size for a single CSV file. A new CSV file will be created per
source or target subscription when the current file exceeds this size, or the number of row in the file exceeds tool limit
(65535)
• stats_max_dir_space_mbytes [10] Determines how much directory space to use for statistics per subscription
Default 10 Mbytes.
• stats_subscriptions [<not set>] Comma separated list of subscriptions to log. If empty, logs all subscriptions.
Defaults to empty. When logging a target subscription, please ensure the source ID is used, which may not be the
same as the subscription name. This name can be changed by the user and must not be longer than 8 characters in
length.
• stats_log_directory [<install.dir>/log] Determines the location where the log files will be generated. The default is
[install.dir]/log. The path must be an absolute path, and cannot contain any environment variables. If the path does
not exists, CDC will attempt to create it.
• stats_file_label [<not set>] A user-specified label that can be applied to statistics file names The default is unset.
Changing the file label changes the file name. This provides a mechanism for statistics users to force the creation of
new files and the retention of old files.
• stats_user_label [<not set>] The value to put in the user label property of the CSV file. This property allows users
to inject a specific value into their CSV files for their analysis.
Information Management Software

Questions?

You might also like