Oracle Streams: a High Performance
Implementation for Near Real Time Asynchronous
Replication
Lik Wong, Nimar S. Arora, Lei Gao, Thuvan Hoang, Jingwei Wu
Oracle USA 500 Oracle Parkway, M/S 4op10,
Redwood Shores, CA, U.S.A.
{[Link],[Link],[Link],[Link],[Link]}@[Link]
Abstract— We present the architectural design and recent [13][18]. In asynchronous replication, the replicated
performance optimizations of a state of the art commercial transactions on the replica databases are not coordinated with
database replication technology provided in Oracle Streams. The the user transactions on the source. These two categories of
underlying design of Streams replication is a pipeline of replication are similar to eager and lazy replication,
components that are responsible for capturing, propagating, and
respectively, as described in the previous literature [9].
applying logical change records (LCRs) from a source database to
a destination database. Each LCR encapsulates a database change. However, the definition of eager does not allow for
The communication in this pipeline is now latch-free to increase synchronous replication based on atomic broadcast. Also, the
the throughput of LCRs. In addition, the apply component now definition of lazy does not allow for asynchronous replication
bypasses SQL whenever possible and uses a new latch-free where the changes are propagated while the transaction is
metadata cache. We outline the algorithms behind these ongoing. We call this last form of replication streaming
optimizations and quantify the replication performance asynchronous replication (see Fig. 1).
improvement from each optimization. Finally, we demonstrate that
these optimizations improve the replication performance by more User User
transaction Replicated transaction Replicated
than a factor of four and achieve replication throughput of over
transaction transaction
20,000 LCRs per second with sub-second latency on commodity
hardware. Write A Write A
Write A
Write B Write B
I. INTRODUCTION Write B
Write C Write C
Almost every digital enterprise has a compelling reason to Write C
employ some form of database replication for disaster Commit Commit Time
Commit
recovery or high availability. These enterprises can be
Write A
replicating data as diverse as credit card transactions, email
messages, court orders, cell phone calls, or even the results of Write B
experiments on particle accelerators. However, they have Write C
almost identical requirements to replicate a high volume of Commit
transactions with low replication latency, and often involve
replication over a wide area network (WAN). In fact, many Lazy Streaming
non-traditional replication uses, such as online upgrade or Fig. 1 Lazy vs. streaming asynchronous replication
migration of applications, have similar requirements. These
The advantage of synchronous replication is that it
stringent requirements severely restrict the replication
automatically handles concurrent conflicting transactions on
strategies that are viable.
different replica databases. The main disadvantage is that it
A. Replication Strategies greatly reduces the throughput of dependent transactions on
Replication strategies can be broadly categorized as each replica. The critical issue is the time period when a
synchronous or asynchronous. In synchronous replication, a transaction is complete and is seeking to commit. Normally, in
user transaction is not allowed to commit unless it is asynchronous replication, such a transaction writes its commit
guaranteed that it will be successfully applied on all the record to the redo logs, releases all locks, waits for
replicas, or a critical subset of them. This guarantee can be acknowledgement from the storage system, and finally sends
achieved by relying on the well known two-phase commit an acknowledgement to the user. In synchronous replication,
protocol [15], or by middleware that imposes a global order on the other hand, the transaction must first broadcast its
on all user transactions via an atomic broadcast service intention to commit to the other replicas and wait for an
acknowledgement. The precise mechanics of this
communication depend on the particular implementation, but concurrency on the computer systems. The disadvantage is
in all implementations, database locks cannot be released until that these components must communicate through the
the minimum time required for one network round-trip has intermediate queues, which might become contention hot
elapsed. This network round-trip means that dependent spots. In an early implementation of these queues, a latch [8]
transactions can commit no faster than the network latency was used to serialize access. However, the latch limited
permits. concurrency and did not scale well. We present here a new,
Asynchronous replication, on the other hand, has to latch-free 2 implementation. Further, the enqueue operation
reconcile conflicting transactions. Although the reconciliation completes in a finite number of steps if the queue is not full,
procedure depends on the application, in many cases simple and the browse completes in a finite number of steps if the
approaches such as maximum-timestamp work quite well in queue is not empty. Hence, our implementation can be
practice [9][16]. Both synchronous and asynchronous considered a wait-free [3], fixed-size queue. Our queue
replication strategies have drawbacks, but many enterprise semantics is different from that of Herlihy et al. [10]. In our
customers do not consider these equal. In many applications, design each enqueued message is dequeued by every
the rate of global concurrent conflicting transactions is much consumer and there is only one enqueuer.
lower than the rate of local concurrent conflicting transactions. Our new queue implementation can handle tens of
As a consequence, synchronous replication is not justifiable thousands of messages per second. In practice, however, the
for such applications. rate of messages is lower than the maximum limit of the queue
Lazy asynchronous replication works well for small because of two factors. First, the database changes are mined
transactions, but for a large transaction the replication latency from the redo logs, and hence LCRs cannot be generated faster
can be unacceptably high. Streaming asynchronous replication than the rate at which the redo can be written to disk. Second,
can better handle large transactions because the replication the messages in the pipeline only flow as fast as the slowest
latency is mainly determined by network latency and not consumer. In this regard, the performance of the apply
transaction size. component is most critical. Capture and propagation typically
spend a fixed amount of processing time per LCR, which
B. Oracle Streams Replication allows them to keep up with the redo generation rate, unless
Designing streaming asynchronous replication can be there are insufficient hardware resources, such as CPU, disk
challenging because it needs to handle an almost unbounded IO, or network bandwidth. Apply, on the other hand, has to
and ever-increasing transaction workload [21]. We will un-interleave the redo stream to extract transactions that must
discuss the design of one such replication technology, Oracle be applied concurrently in order to keep up with the source,
Streams, and its optimizations to meet these challenges. The and at the same time must respect the dependencies of the
architecture of Oracle Streams involves three components – transactions to avoid generating deadlocks [1]. In order to
capture, propagation, and apply. These are responsible for compensate for the extra processing requirements of apply, the
capturing database changes, propagating them to a replica, and apply component now bypasses the database SQL layer and
applying the database changes on the replica database, uses an internal data layer API directly on tables whenever
respectively. The three components are arranged in a pipeline, possible.
and they stream LCRs (Logical Change Records – a Using the latch-free queue and the data layer API, Streams
generalized form of a database change record) between them can replicate tens of thousands LCRs per second. Such high
using in-memory queues1 (Fig. 2). throughput places much more stringent demands on the
replication metadata access and caching in the apply
Source database component. These demands have led to significant latch
capture contention when multiple apply processes access the
queue replication metadata. To eliminate such latch contention and
network
improve the concurrency among the apply processes, we have
propagation designed a new latch-free, single-writer, shared hash table
algorithm to allow latch-free access to the shared metadata
queue cache. Our latch-free hash table is different from the published
latch-free hash tables [14][19] because our latch-free hash
apply
Replica database table is single-writer, only requires the atomic read/write of a
word, and does not require the atomic compare-and-swap
Fig. 2 Oracle Streams: replication components (CAS), which is more expensive. Each hash entry in our hash
The advantage of dividing the replication task among a table has a lifetime and can be invalidated independently due
number of components is that it exploits the available to schema evolution. The writer can safely assume that no
2
We use the term latch-free instead of the usual lock-free in
1
Unless specified otherwise, all queues discussed in this paper are distributed computing literature because locks have a different
in-memory queues. meaning in database literature [8].
reader is referencing a hash entry after some point without • Propagation consumes LCRs from one queue and
using reference counting 3 . Instead, the apply component enqueues them into another queue, usually on a different
leverages the unique and monotonically increasing property of database.
an LCR stream.
• Apply consumes LCRs from a queue and performs the
As a result of these optimizations in Oracle Database 11g,
database change specified in the LCR. Because all
Streams can replicate over 20,000 LCRs per second on
database changes are recorded in the redo, apply can be
commodity hardware. Independent analysis by a customer [7]
thought of as writing CRs into the redo. In that sense,
demonstrates a more than double performance improvement.
apply is the inverse of capture.
To the best of our knowledge, the highest throughput
published for replication systems similar to Oracle Streams is Fig. 3 illustrates the Oracle Streams architecture for
12,000 [11]. unidirectional replication. We use this architecture to describe
The rest of this paper is structured as follows. Section II each of the Streams components in this subsection. By
provides an overview of the Oracle Streams architecture and connecting appropriate components, we could configure other
some terminology. Section III discusses the latch-free and replication topologies. For example, we could share the
wait-free queue design, section IV discusses the usage of the capture component and add propagation and apply
data layer API in the apply component for bypassing the SQL components to get another replica.
layer, and section V discusses the latch-free metadata cache The capture component consists of a log reader process,
and invalidation in the apply component. Finally, section VI multiple preparer processes, a builder process, and a capture
presents the performance of some unidirectional replication process. The log reader process reads the redo log and divides
benchmarks using Oracle Streams. the redo log into regions. The preparer processes scan the
regions defined by the log reader in parallel and perform
II. OVERVIEW OF ORACLE STREAMS prefiltering of changes found in the redo log based upon user-
Oracle Streams is a unified, information-sharing defined rules. The builder process merges redo records from
infrastructure that provides generalized components for the the preparers and passes the merged redo records to the
capture, propagation, and consumption of information. A full capture process. The capture process then formats each change
overview of the many features of Streams and its uses is into an LCR and enqueues it into a queue if it satisfies the
beyond the scope of this paper; the interested reader is defined rules.
LCRs not
referred to Oracle Streams’ manual [17]. We limit this grouped into
discussion to the data replication aspects of Streams. CR
transactions
LCR
Redo Logs LCR
at Source Log Capture Propagation
A. Database Change Records Database Reader Preparer Builder :
Sender
Streams
In Oracle Streams replication, the information that Pool
Capture at Source Database
represents a change to a relational database is known as a Propagation
logical change record (LCR). An LCR is a generalized
Apply at Replica Database
representation of all possible changes to a database, and is LCR
independent of the database vendor. A change record (CR), on Redo Logs Apply LCR
at Replica Coordinator Reader : Propagation
Applier
the other hand, is a term we use to denote a database-specific Database Streams
Streams Receiver
Transactions Committed PoolPool
representation of a change. Conflict Detection to be applied transactions
Error Handling grouped and sorted
Custom Code in commit order
B. Rule
The user can specify which LCRs to replicate by using a set Fig. 3 Streams process architecture for single capture, single propagation, and
of rules, with rule conditions that are similar to SQL WHERE single apply
clauses. These rules are evaluated against all the changes The propagation component consists of one sender process
made to the database to filter out the irrelevant LCRs. at the source database and one receiver process at the replica
C. Capture, Propagation and Apply database. The propagation sender process dequeues LCRs that
satisfy the propagation rules and streams those LCRs over the
The three components of Streams are capture, propagation, network to the propagation receiver process. The propagation
and apply. The behaviour of each component is controlled by receiver process receives LCRs from the network and
rules: enqueues them.
• Capture reads CRs contained in the redo generated by the The apply component consists of an apply reader process, a
database. It converts these CRs into LCRs and enqueues coordinator process, and multiple applier processes. The apply
them. reader process dequeues LCRs, assembles them into complete
transactions, and passes the transactions to the coordinator.
3 The coordinator process assigns transactions to available
A latch-free reference counting scheme [6] requires the atomic
double CAS, while our latch-free hash table does not require the appliers based on the transaction dependencies and commit
atomic CAS. ordering. Each applier process executes an entire transaction
at the replica database before requesting another one. The most customers configure Streams with single producer
applier processes can process independent transactions queues, we can improve the throughput of such common
concurrently when apply parallelism is enabled. configurations by using latch-free single producer queues. Fig.
4 shows a typical scenario with one capture process enqueuing
D. In-memory Staging and Recovery LCRs into a queue and multiple propagation senders
Oracle Streams uses queues as temporary staging areas for consuming these LCRs.
LCRs as they move between different Streams components
and across databases. The user configures Streams by first Network
LCR
creating the queues and then attaching various Streams Propagation LCR Apply
Propagation :
components as producers to or consumers of the queues. For Sender Receiver
instance, a capture process can enqueue into a queue from Queue
which multiple propagation senders dequeue. LCR
LCR
Oracle Streams reduces performance overhead on LCR Capture LCR Propagation
LCR
Propagation : Apply
:
replication by staging LCRs using in-memory queues while Sender Receiver
Queue
guaranteeing no LCR loss in the presence of failures, such as Queue
system crashes, database instance crashes, or unexpected
reboots. Using in-memory queues allows Oracle Streams to LCR
LCR
stage LCRs without paying the high cost of disk operations, Propagation Propagation : Apply
Sender Receiver
which occurs when persistent queues are used. Queue
Using persistent queues can simplify the recovery of the
replication state [5]. Once a message is enqueued into a
persistent queue, the enqueuer can assume that the message Fig. 4 Common Streams configuration: single capture, multiple propagations
will eventually arrive at the destination system. However, the and multiple applies
disk operations in persistent queues impose significant
performance overhead. A. Queue Data Structure
Oracle Streams relies on the database redo logs to recover The actual latch-free, single-producer queue is a fixed-size
LCRs that are transient in the in-memory queues. During circular buffer with N (>1) slots, items[0],..,items[N-1]. The
recovery, the capture component obtains the apply progress producer uses a tail pointer for the next free slot, and each of
information from the apply component, determines a recovery the consumers, C, has a head pointer, head[0],..,head[C-1], for
point in the redo log, mines the redo log, and transmits LCRs the next queue entry to be consumed, as shown in Fig. 5. The
after the recovery point to the apply component. The apply semantics of the data structure includes C logical queues,
component maintains enough persistent metadata to suppress Q[0],..Q[C-1], such that Q[i] is
duplicate transactions and thus consume each transaction
exactly once. Because database redo generation is part of • empty[i], if head[i] == tail;
regular database activities, Oracle Streams’ recovery • [items[head[i]], items[head[i]+1 mod N], ...
mechanism imposes little overhead. items[tail-1 mod N]], otherwise.
The queue is considered full if and only if for some
III. LATCH-FREE (WAIT-FREE) SINGLE PRODUCER QUEUES consumer i, length(Q[i])==N-1, or equivalently,
A key design in Oracle Streams that enables high head[i]==tail+1 mod N (note that one slot is wasted in the
performance LCR processing among different Oracle Streams’ buffer).
components (capture, propagation, and apply) is the latch-free
queue. This design takes advantage of an Oracle Streams B. Queue Algorithm
configuration in which there is always only one producer to The queue operations are summarized below; see Fig. 6 for
the queue. Although latch-free single producer queues have the pseudo code.
been presented in earlier work [2][20], our queue is based on • Enqueue: The producer invokes enqueue to add an item
the semantics that every consumer consumes each message. In to each of the C logical queues of the consumers. It is
fact, our queue implementation is wait-free whenever the blocked if the queue is full.
producer or consumer can perform an action, i.e. the queue is
• Browse: Each consumer invokes browse to get the first
not full or not empty, respectively. Our algorithm relies only
item in its logical queue, or blocks if its logical queue is
on atomic reads and writes to memory, similar to the wait-free
empty. The returned item is immutable for each
queue published in previous work [10]. But our algorithm is
consumer.
different in that we are based on a fixed-size buffer.
Oracle Streams supports general-purpose queues with • Dequeue: After browse returns an item, the consumer is
multiple producers and consumers. However, such queues use permitted to call dequeue to remove this item (i.e., the
latches that severely limit throughput, even when only one first item) from its logical queue. The idea behind this
consumer and one producer are concurrently active. Because two-step consume API is that the consumer indicates that
it no longer references the memory of the browsed item in its queue. In such a case, there would be no point in waking
once it calls dequeue. Hence, the producer is free to the producer yet.
recycle the memory of consumed items.
D. Liveness
If the queue is not full at any point, then the enqueue
slots available for enqueue operation will eventually exit the loop in lines 2-4 since no
slots available for browse other concurrent operation can make the queue full. Similarly,
Consumer 1 the browse operation eventually exits from the loop in lines
head[0]
13-14 if the queue is not empty since no other concurrent
Producer operation can make the queue empty. Note that our latch-free
Circular
tail
buffer head[1] Consumer 2
queues do not require notifications for liveness since all the
waits are for a bounded time.
head[C-1]
Consumer C
1 void enqueue (void * new_item)
2 for i in 0 .. C-1
Fig. 5 Single producer and multiple consumers circular buffer
3 while head[i] == tail + 1 mod N
The wait operations shown in the code are all bounded time 4 wait
operations so that, if a notification is missed, the relevant 5 items[tail] = new_item
operation does not hang. In practice, additional wakeup flags 6 tail = tail + 1 mod N
7 for i in 0 .. C-1
[20] are used to limit the odds of a lost notification. The
8 if tail == head[i] + 1 mod N
modifications required to use these wakeup flags and to adjust 9 notify consumer i
the wait time exponentially are beyond the scope of this paper. 10 end enqueue
11
C. Safety 12 void *browse
browse (int i)
The enqueue code proceeds as follows. Lines 2-4 cause 13 while head[i] == tail
enqueue to wait if the queue is full. It ensures this by checking 14 wait
to see that none of the logical queues of the consumers is full. 15 return items[head[i]]
Since a consumer can’t modify its logical queue from not full 16 end browse
17
to full, it follows that on reaching Line 5 the queue is not full.
18 void dequeue (int i)
Since none of the logical queues refer to items[tail], line 5 has 19 boolean last = false
no effect on the semantics of the queue. However, line 6 20 head[i] = head[i] + 1 mod N
atomically adds the new item to all the logical queues. This is 21 if head[i] == tail + 2 mod N
the linearization point of the enqueue operation (the write to 22 last = true
tail). The rest of the enqueue code (lines 7 to 9) checks if any 23 for j in 0 .. C-1
consumer has only one item in its logical queue. If so, the 24 if head[j] == tail + 1 mod N
25 last = false
consumer is woken up since it might have been waiting for an
26 break
empty queue.
27 if last == true
The browse code in lines 13 and 14 waits while the logical 28 notify producer
queue of the consumer is empty. Once the queue becomes 29 end dequeue
non-empty, none of the other concurrent operations can make
this logical queue empty. Hence, upon reaching line 15, the Fig. 6 Circular buffer operation pseudo code
consumer’s logical queue is not empty, and it is correct to
return items[head[i]] as the first item. The linearization point IV. DATA LAYER API FOR REPLICATING ROW CHANGES
of browse is the read of tail on line 13 such that the loop
condition is false. A. The SQL Approach for Replication
Since a consumer invokes dequeue after browse, and no In an RDBMS system, end-users make row changes by
concurrent operation can make this consumer’s queue empty, executing SQL statements [4]. Typically, when a SQL
it follows that the queue is not empty when the dequeue statement is executed, the RDBMS parses and analyses the
operation is invoked. Thus, on line 20 the first item is statement to generate an execution plan. This plan is cached
dequeued from the consumer’s logical queue. The write to and is reused later when an identical statement is executed.
head[i] on line 20 is the linearization point of dequeue. On During execution, runtime structures are created, initialized,
line 21, dequeue checks if the consumer’s logical queue has N- and passed to the data layer to make the requested change to
2 items, i.e. it has just gone from full to not full. In that case, it the rows and update all relevant indexes.
might be time to wakeup the sleeping producer. However, A typical replication system constructs a SQL statement
lines 23-26 check to see if there is a consumer with N-1 items from an LCR to execute a row change. Then the SQL
statement is parsed and executed after binding the column
values obtained from the LCR. Since the replication system statements for updates for a given table, one can either create a
itself has some overhead, such as transaction assembly and huge update statement with case expressions or manage
scheduling, it might not be able to keep up with the data multiple statements to handle different subsets of modified
manipulation language changes (DMLs) at the source if it uses columns. The former requires more execution time and the
SQL to apply changes. Fig. 7 shows a simple RDBMS with latter needs more memory. In either case, one still needs to
both SQL and data layer interface. pay the per-column cost.
However, the data layer interface does not have the
B. The Data Layer Interface expressive power of the SQL interface. For example, it does
The data layer interface is a set of internal APIs that allows not handle joins, multi-table operations, etc. Fortunately,
upper RDBMS components, such as replication, to call Streams row-level replication does not require the extra
directly to the data layer to make fast row changes to a table. functionalities provided by the SQL interface.
This interface supports insert and update operations on a
single row or multiple rows within a single table. It also V. LATCH-FREE METADATA CACHING IN APPLY
provides APIs to query an index to get row IDs based on a This section describes the latch-free metadata caching
single key or a range scan. Subsequently, the returned row IDs algorithm used in the apply component. When replicating
can be passed to the update API. In addition, this interface DMLs, each apply process in the apply component must
also supports table scanning and fetching a row based on a access some replication-specific metadata for each table. For
row ID. simplicity, we refer to this replication-specific metadata as the
metadata. This metadata includes the metadata for using the
User
data layer interface and the conflict detection and resolution
directives. When the schema evolves, this metadata is
DB Server Components (include Replication) invalidated. The database memory manager 4 caches this
metadata. The memory manager provides read access to the
metadata through shared latches and write access to the
SQL Engine Data Layer Interface metadata through exclusive latches 5 . The memory manager
also provides latch-free notification of schema evolution for
Data Layer any table through a status variable. Any database process can
periodically check the variable to detect schema evolution.
Since the metadata is shared by multiple processes and stored
Storage Layer in shared memory, the memory manager requires the list of the
metadata sharers to recover from the loss of any involved
Fig. 7 SQL and data layer interface in an RDBMS process. Even when a process requests a shared latch to read
the metadata, the memory manager takes some exclusive
C. Comparing SQL and the Data Layer Interface latches to maintain the list of sharers.
When applying a DML to a table, an apply process must
Since the data layer provides ACID properties for
have a latch on the metadata in shared mode to exclude
transactions, both SQL and the data layer interface have no
schema evolution on the table. Since the latch-free queue
difference in transaction semantics. The SQL Engine accesses
increases the LCR delivery throughput more than four times
and caches the underlying table metadata (e.g., column and
(Fig. 14) and the data layer API reduces the instruction count
index information) on a user’s behalf and transparently
of an apply process by 28% to 70% (Table I), such high
handles the metadata cache invalidations due to schema
throughput has led to severe latch contention and impacted
evolution. Since the SQL Engine is bypassed, the data layer
performance. With 20,000 LCRs per second, if one latch must
interface must access and cache similar table metadata so that
be acquired per LCR, there will be 20,000 latches acquired per
it can construct and pass the proper column values to the data
second. Eliminating the latch contention in accessing the
layer. This cached metadata for a table can be shared by
metadata is essential for the apply component to achieve high
multiple concurrent data layer API invocations for a particular
throughput when executing DMLs. Our metadata caching
table, regardless of the DML operation or the columns
algorithm eliminates such latch contention and uses a latch-
referenced by each invocation. This behaviour applies even
free, single-writer, multiple-reader hash table to maintain a
when the list of modified columns is different.
consistent view of the metadata. Although latch-free hash
The data layer API provides time and space benefits to
tables have been presented in earlier work [14][19], our latch-
Oracle Streams replication. The data layer interface eliminates
free hash table does not require the atomic CAS, and each
SQL statement generation, parsing, and traversing the SQL
hash entry has a lifetime (see Fig. 8). Our hash table leverages
call stack. A major advantage of using the data layer interface
over the SQL approach is that the cached metadata can be 4
used for all DML changes to a particular table, regardless of For simplicity, we refer to the database memory manager as the
which columns are updated or inserted during each DML “memory manager” in the rest of the paper.
5
operation. In SQL, to avoid generating and compiling new The read and write access refer to memory access, not disk access.
the unique monotonically increasing property of an LCR eliminate latch contention when writing to this cached copy
stream and splits a typical hash delete operation into invalidate (e.g., inserting, invalidating, purging), the only writer is the
and purge operations. The reader has an alternative source to apply reader process. Only this process populates and purges
get the metadata if there is no valid hash entry for it. the cached metadata. Other apply processes simply read from
this cached copy.
Inserted by Accessed by readers In the rest of this subsection, we first introduce some
the writer and the writer
terminology and the important data structures. We then
provide the algorithm and pseudo code.
Invalidated due to entry 1) Apply Low Watermark and High Watermark
Purged and moved to the free eviction or schema
list when no more references evolution by the writer Each LCR has an associated system change number (SCN).
Moved to the purge list SCNs order all changes in an Oracle database. A Lamport
Clock [12] synchronizes SCNs between communicating
Fig. 8 Lifetime of a hash entry databases. During a given run, a capture component generates
an LCR stream with monotonically increasing SCN values.
The metadata cache is built on top of the database metadata
The apply component maintains two SCNs in memory, namely
cache, which is managed by the memory manager. The apply
the low watermark and the high watermark, as illustrated in
reader process (as illustrated in Fig. 3) is the only writer to the
Fig. 9. The low watermark refers to the earliest SCN of a
metadata cache; while all apply processes are the readers of
change (e.g., DML) that the apply component might require
the metadata cache. When caching a new entry in the hash
from the LCR stream. In other words, any LCR with an SCN
table, the writer takes a shared latch on the metadata
less than the low watermark is no longer needed by the apply
maintained by the memory manager and copies the metadata
component. When the apply component consumes the earliest
to the hash table. The writer checks a status variable
transactions, the low watermark rises. Here, the earliest
maintained by the memory manager to detect schema
transactions refer to those transactions having LCRs with the
evolution. Upon schema evolution or cache entry eviction, the
smallest SCN values. The high watermark refers to the highest
writer moves the hash entry to the purge list and uses the apply
SCN of a change known to the apply component in the LCR
low watermark (defined in section V-B-1) to purge the cached
stream.
metadata when no reader will ever access the copy. The writer
maintains a free list to reuse hash entries. A reader reads the LCR stream in monotonically increasing SCN order
metadata from the hash table without acquiring any latches.
For concurrency control, a row exclusive lock6 on the table is
required before applying DML to a table. A reader leverages lower low high higher
this lock on the table to exclude any schema evolution on the SCN watermark watermark SCN
table and ensure that the metadata obtained from the cache is
consistent with the table structure. We discuss the design apply component
details of latch-free metadata caching in this section.
A. A Naive Latch-free Algorithm
Fig. 9 Apply low watermark and high watermark
One simple solution to address the latch contention is to
cache a copy of the metadata in the private memory of each
2) Data Structures for Latch-free Metadata Caching
apply process. Each apply process can then access its own
private copy without latching the database metadata cache We assume that the read and write of a word is atomic. In
managed by the memory manager. The apply process can our implementation, we also use read/write memory barriers to
refresh its copy whenever the underlying cache is invalidated. prevent the re-ordering of the atomic operations in latch-free
A major problem with this approach is that multiple copies are metadata caching. Our hash table cache provides insert, get,
cached in memory, leading to excess memory consumption, invalidate, and purge functions. Fig. 10 shows the pseudo code
especially for cases that involve large metadata size, a large for these functions. The line numbers mentioned in this section
number of replicated tables, or high apply parallelism. refer to the pseudo code in Fig. 10. Some important fields in
the hash table are the following:
B. A Latch-free Single Writer/Multiple Reader Algorithm • GlobalVersionNo: The logical clock for insert operations of
To save memory, there is only one cached copy of the the hash table (line 5).
metadata in shared memory for all apply processes for a given • PurgeList: A pointer to the list of hash table entries to be
apply component. To handle metadata for multiple tables, we purged. When an entry is deleted from the hash table, it is
use a hash table with linked lists for collision resolution. To moved to the PurgeList (lines 14-15). The memory for the
cached metadata in this entry cannot yet be de-allocated
6
A row exclusive lock on the table is also known as intent exclusive (line 17) because a reader might still be accessing it.
(IX) lock.
• FreeList: A pointer to the list of free hash entries for the • IsSchemaEvolved: An indication of whether there is any
hash table. schema evolution on this cached copy. Upon schema
evolution, the memory manager sets IsSchemaEvolved to
1 void insert(void
insert *hash_table, void* key) TRUE without taking any latches.
2 get an available entry, possibly from FreeList
• IsValid: An indication of whether this hash entry is valid.
3 deep copy from db metadata cache
4 init IsSchemaEvolved This field allows the writer to communicate the validity of
5 increment the global version# the entry to all readers atomically, without acquiring and
6 initialize entry version# with current global version# releasing latches (lines 8 and 13). The writer sets isValid to
7 initialize EntryHWM to ∝ FALSE during invalidation.
8 set valid bit
9 add entry to the linked list of corresponding hash slot get a valid hash entry
10 end insert
11
12 void invalidate(void
invalidate *hash_table, void *entry) no yes
valid entry found?
13 clear valid bit
14 delete the entry from the hash_table
invalidate hash entry if exists,
15 move the entry to the purge list do cache replacement
set EntryHWM to ∞ bookkeeping
16 update the EntryHWM with current apply high watermark
17 /* cannot free cached metadata until LWM > EntryHWM */
18 end invalidate no memory available yes
19 for caching?
20 void *get
get(void
get *hash_table, void *key)
21 scan hash table linked lists by key evict a victim if needed
insert this new element
22 if valid entry found and no schema evolution
23 if the caller is a writer return meta-data
24 adjust cache replacement policy from database return meta-data
25 return meta data from this hash entry memory cache from hash entry
26
27 if schema evolution detected and Fig. 11 Flowchart of get function for writer
28 the caller is a writer
29 invalidate this hash entry
30 C. Algorithm Details
31 /* valid hash entry not found or schema evolved */ Both the writer and the readers can call the get function. Fig.
32 if the caller is a reader or there is no memory 11 shows a flow chart of the get function for the writer. Only
33 return meta data from the database metadata cache
34
the writer invokes the insert, invalidate, and purge functions.
35 if the caller is a writer and there is more memory The Writer: the Invalidation and Purge Condition
36 if hash_table is full Only the writer writes to this hash table. Fig. 12 shows the
37 pick a victim based on cache replacement policy
value transitions of the entry high watermark of a hash entry.
38 invalidate the victim
39 insert this new hash entry
The writer invokes the insert procedure (see the insert function
40 return meta data from this new hash entry in Fig. 10) to cache metadata for a table and initializes the
41 end get entry high watermark to infinity (line 7). When the writer
42 detects schema evolution for a table or evicts a cache entry for
43 void purge(void
purge *hash_table, number LWM)
a table, it invalidates the corresponding cache entry (lines 27-
44 for each entry e in the PurgeList of hash_table
45 if (LWM > EntryHWM of e) 29 and 37-38). The writer updates the entry high watermark of
46 remove e from the PurgeList this hash entry to the current apply high watermark (line 16)
47 perform needed bookkeeping and moves this hash table entry to the purge list. However, a
48 free deep copied cache meta data reader might still be accessing this copy. The writer delays the
49 move e to the FreeList
50 end purge
purge of this hash entry until the apply low watermark rises
above the entry high watermark of this hash entry. When this
condition is established, all the readers no longer access the
Fig. 10 Insert, invalidate, get, and purge function pseudo code
copy, and the writer can safely purge this entry and free the
Each entry in the hash table contains the following fields: memory occupied by the metadata in this entry (lines 45-49).
• EntryVersionNo: The version number for this hash entry. After a reader gets a hash entry, it is not possible for the
writer to purge and reuse this hash entry because this reader
• MetaData: A pointer to the cached copy. This is a deep has not applied this LCR. The LCR’s SCN satisfies the
copy from the original metadata in the database metadata following condition:
cache (line 3). apply low watermark ≤ SCN of this LCR ≤ apply high
watermark ≤ entry high watermark
• EntryHighWatermark (EntryHWM): An SCN used to
determine the purge condition of hash entries in the purge Hence, the apply low watermark cannot rise above the entry
list. high watermark, and this hash entry cannot be purged and
reused. If this hash entry was evicted due to cache replacement without the entry version number, if the current entry has been
after the reader got it, the apply high watermark in the above re-used for another table and the reader traverses to the wrong
formula gets the value of the apply high watermark during linked list, the reader will not return the wrong entry because a
invalidation (line 16). matching entry must have the same key. The reader will then
The Readers: Synchronization with Schema Evolution fall back to the database metadata cache to retrieve the
An applier process calls the get function to access the metadata.
metadata in the hash table without taking any latches. When To avoid crashing a reader, the writer never frees the
cached metadata is invalid or the schema has evolved, a reader memory allocated for a hash entry and recycles those entries
falls back to the database metadata cache to access the using the free list. However, the deep-copied metadata inside a
metadata (line 33). hash entry can be freed (line 48, Fig. 10).
After a reader has access to a cached copy of the metadata
without taking any latch, and before the reader performs the VI. PERFORMANCE EVALUATION
actual DML operations, the cached copy could be invalid due In this section, we use performance benchmarks to illustrate
to schema evolution. To avoid this race condition, the reader the performance benefits of the algorithms discussed in this
first obtains a row exclusive table lock to exclude schema paper. The throughput is measured in terms of the number of
evolution and uses this cached copy if there was no recent LCRs replicated per second from a source database to a
schema evolution. Otherwise, the metadata in this cached copy destination database.
could be inconsistent with the actual table structure.
A. Latch-free Queue
Set to current apply
In this subsection, we describe an experiment that evaluates
Initialized to ∞ schema evolution
high watermark the throughput of the latch-free queue alone in the context of
during insert entry eviction
during invalidate Oracle Streams while avoiding other irrelevant components
that might cause a bottleneck. Each database is running on a
apply low watermark Dell PE6850 computer system with four 3.66MHz Intel Xeon
Purged and moved rises above element processors and 12 GB of RAM. The computer system runs the
to the free list high watermark Linux OS. The two computers are in the same Local Area
Network (LAN) with 100Mb Ethernet. A workload consisting
Fig. 12 Value transitions of entry high watermark of a mix of insert, update, and delete transactions to a table
with five columns is first applied to the source database. After
the workload is completely staged in the redo log of the source
1 get a pointer to an entry in the linked list database (i.e. transactions are processed by the source
2 if (entry is valid)
3 read next pointer, key, entry version number, etc
database), Oracle Streams is started on both the source and the
4 if (entry is still valid and destination databases to replicate the transactions. Note that
5 entry version number has not been changed) we stage the workload before starting the replication process
6 safe to access key, next pointer in this entry so that Oracle Streams will not compete with the workload at
the source database for hardware resources.
30
Fig. 13 Reader’s linked list traversal pseudo code
Throughput(K. LCRs/s)
The Synchronization in Linked List Manipulation 25
Fig. 13 shows the pseudo code for the linked list traversal
by a reader. When a reader scans a linked list for a 20
corresponding hash table slot and is reading an entry in this
Throughput with latch-free queue
list, the writer could just have added this new entry without 15
proper initializations, or the writer could have moved this
entry to the purge list, freed this entry, and recycled it for
10 Throughput with latch-based queue
another table. To avoid reading invalid entries, the reader re-
reads the valid bit after getting the needed information from
5
this entry, e.g., the pointer to the next entry in the linked list,
the key, and the entry version number (line 4). To avoid
reading a valid, but recycled entry, the reader ensures that the 0
0 100 200 300 400 500 600
entry version number has not been changed (line 5). If the Time (s)
writer recycled an entry for another table, the entry version
Fig. 14 Latch-based queue vs. latch-free queue
number for this entry must have been increased (lines 5-6 Fig.
10). If the entry is not invalid or has been recycled, the reader The latch-free queues are used by two pairs of Oracle
re-hashes the key and tries again. Alternatively, the reader can Streams components each, namely the capture and the
fall back to the database metadata cache. Note that even propagation sender for one queue, and the propagation
Table I Average Instruction count and reduction rate
Insert Update (non-key columns) Update (key columns)
Data Layer Data Layer Data Layer
SQL API Reduction SQL API Reduction SQL API Reduction
Narrow table
(5 columns) 82K 59K 28% 88K 45K 49% 147K 93K 36%
Narrow table w/
a large column 88K 60K 32% 133K 73K 45% 197K 129K 34%
Wide table (100
columns) 386K 165K 57% 750K 192K 74% 809K 241K 70%
receiver and the apply reader for the other queue. This to the table with a large number of columns, the dominating
experiment focuses on the rate at which LCRs are captured at factor is the per-column manipulation, which has a higher cost
the source database and propagated from the source database in SQL than in the data layer API. Second, when using the
to the destination database. Because the applier processes do data layer API, the reduction rate for inserts is lower than that
not directly interact with the latch-free queue, they are for non-key-column updates and is similar to that of key-
irrelevant in this experiment. Hence, we configure the applier column updates for same tables. Although the data layer API
processes to discard the LCRs received instead of applying reduces the instruction counts by eliminating the variable
them at the destination database so that the applier processes binding in SQL, it still has to perform index maintenance for
do not become the bottleneck in this experiment. inserts and key-column updates. The cost of the index
Fig. 14 shows Oracle Streams' average throughput over 30- maintenance limits the reduction rate for inserts and key-
second intervals before and after the latch-free queue column updates.
optimization is applied. Without the latch-free queue, the
maximum throughput is roughly 5500 LCRs/s, as illustrated C. Latch-free Metadata Caching
by the bottom curve. With the latch-free queue, Oracle To evaluate the effectiveness of the metadata caching, we
Streams is able to achieve throughputs between 21,000 and designed a micro benchmark to replicate changes to a table
25,000 LCRs/s, at which point the capture component from one database to another. The benchmark runs an insert-
becomes the bottleneck. This experiment demonstrates that by only workload to a table containing five simple columns and
taking advantage of Oracle Streams' unique architecture, only one index. The hardware configuration is the same as those
one queue producer, we are able to use the special latch-free used in the latch-free queue experiment.
queue to achieve four to five times higher LCR delivery rate. 100
Latch Contention (counts/s)
90
B. Data Layer API
80
We evaluate the benefit of the data layer API by comparing with metadata cache
its instruction counts with the counts that are required when 70 no metadata cache
SQL is used. Compared to SQL, the data layer API's 60
instruction counts are less across all tables that we measured. 50
The reason is that the data layer API does not pay a high per- 40
column cost like the variable binding in SQL. Instruction
30
counts are measured for both the data layer API and SQL on
inserts, non-key-column updates, and key-column updates to 20
three tables, a table with 5 small columns, a table with 4 small 10
columns and 1 large column, and a table with 100 small 0
columns. Table I presents the average instruction count of 0 1 2 3 4 5 6 7 8 9
each operation and the percentage reduction in instruction Apply parallelism
count attained by the data layer API. Note that we do not have
Fig. 15 Apply parallelism vs. latch contention
the instruction count for deletes because we have not
implemented the data layer API to support the delete operation When metadata caching is not used, each applier process
yet. must acquire a latch on the table metadata to execute every
There are two key points we would like to make regarding DML. Because the insert-only workload does not introduce
the instruction count reduction rates for the data layer API. transactional dependencies, transactions can be applied in
First, for the data layer API, the reduction percentage of parallel. As we increase the apply parallelism to allow for a
inserts to a table with 100 small columns is greater than that to larger number of applier processes to execute DMLs in
a 5-column table with a large column. For inserts to the table parallel transactions, the latch contention for accessing the
with a large column, copying the column data to the database metadata increases, as illustrated in Fig. 15. The increasing
redo log is the most expensive part of the operation, and it latch contention consequently decreases the apply throughput.
applies equally to both the data layer API and SQL. For inserts
When metadata caching is used, concurrent applier contention in accessing the metadata, the overall throughput
processes are able to access the metadata in the cache without drops sharply. The metadata cache optimization eliminates
acquiring latches. Latch contention is eliminated among this latch contention so that the throughput is higher and
applier processes. steady, as illustrated by the top most curve.
D. End-to-end Throughput Evaluation E. Replication of Concurrent Workload in LAN and WAN
In this subsection, we demonstrate the end-to-end In this subsection, we present both the end-to-end
throughput improvement of Oracle Streams yielded by the throughput and the replication latency of Oracle Streams
optimizations described in this paper. The hardware replicating concurrent workload in a LAN and a WAN.
configuration, the set up of two databases, and the workload Although we use similar hardware and workload as in the
are the same as those used in the experiment for the latch-free previous experiment, we run the workload concurrently with
queue. Similarly, the workload is staged before Oracle Oracle Streams. Note that we reduce the workload generation
Streams is started to avoid the competition of hardware rate in the WAN experiment to fit the maximum WAN
resources between the workload at the source database and bandwidth. Replication latency is the elapsed time between
Oracle Streams. when a row change was made at a source database and when
30 the same row change was applied at the destination database.
Latch-free Queue + Data Layer API + Metadata Cache
Note that the clock granularity of the latency measurement
Latch-free Queue + Data Layer API
was one second. Furthermore, in the WAN configuration, an
Throughput(K. LCRs/s)
25
actively used corporate network between California and Texas,
20
approximately 1800 miles apart with 50 ms network Round
Trip Time (RTT), separates the source and destination
databases.
15
22
10 21
Throughput(K. LCRs/s)
Latch-free Queue 20
5
19
No Optimizations Throughput in LAN
0 18
0 100 200 300 400 500 600
17
Time(s) Throughput in WAN
Fig. 16 End-to-end throughput with various optimizations 16
Fig. 16 illustrates Oracle Streams’ average end-to-end 15
throughput over 30-second intervals by different combinations 14
of optimizations. Because Oracle Streams captures, propagates,
13
and applies LCRs in a pipelined fashion, the bottleneck 0 200 400 600 800 1000 1200
component limits the end-to-end throughput of the system. Time(s)
The bottom curve of Fig. 16 shows the throughput of the
Fig. 17 LAN vs. WAN throughput
system before any optimization is applied, which is roughly
5100 LCRs/s. The latch-based queue that we use to stage As shown in Fig. 17, Streams’ throughput in the LAN with
LCRs among the capture, propagation, and apply components the concurrent workload at the source database is the same as
is the limiting factor. After the latch-free queue is used, the when workload is staged. Compared to the throughput in the
throughput is increased up to three times the original rate, as LAN, Streams’ throughput in the WAN is limited by the
indicated by the second curve from the bottom. After the available bandwidth in the WAN. Furthermore, the throughput
latch-free queue eliminates the bottleneck in staging LCRs in the WAN fluctuates more because the available bandwidth
among Streams components, the apply component becomes in WAN is volatile.
the bottleneck in applying LCRs at the destination database. Fig. 18 shows that Streams achieves sub-second latency in
Then we observe that the data layer API improves the apply the LAN. To minimize network latencies in the WAN
rate by 3000 - 6000 LCRs/s because of the instruction configuration, the propagation component streams LCRs over
reduction in the data layer API, illustrated by the third curve the network with a time-based acknowledgement. The interval
from the bottom. Both the second and the third curve from the for the time-based acknowledgement is 30 seconds in this
bottom show large fluctuations in throughput due to latch experiment. In other words, each network RTT is amortized
contention on the replica metadata at the destination database. over a 30-second interval. However, the volatility of the
There are multiple applier processes applying transactions in available network bandwidth in the WAN leads to the
parallel and they need to access the metadata, which is latch- relatively larger fluctuations in Streams replication latency.
protected for concurrent accesses. When there is latch During our WAN experiment, Streams fully utilizes the
available bandwidth that fluctuates between 4.5MB/s and [1] N. Arora, Oracle Streams for Near Real Time Asynchronous
Replication, VLDB Workshop on Design, Implementation and
5.2MB/s. When the WAN bandwidth drops below the rate at
Deployment of Database Replication, VLDB 2005.
which workload is generated in the source database and
[2] N. Arora, R. Blumofe, C. Plaxton, “Thread Scheduling for
captured by Streams, some LCRs must be delayed at the Multiprogrammed Multiprocessors”, ACM Symposium on Parallel
source database. Therefore, the replication latencies for those Algorithms and Architectures, 1998, pp. 119-129.
delayed LCRs increase, as shown in Fig. 18. The latency [3] J. Aspnes and M. Herlihy. Wait-Free Data Structures in the
fluctuations in WAN correspond to the fluctuations of the Asynchronous PRAM Model. In Proceedings of the 2nd Annual
available bandwidth in the WAN. Symposium on Parallel Algorithms and Architectures, July 1990, pages
340--349, Crete, Greece.
4.5
[4] E. F. Codd A Relational Model of Data for Large Shared Data Banks,
4 Communications of the ACM, Volume 13, Number 6, June 1970
[5] D. Daniels, L. Doo, A. Downing, C. Elsbernd, G. Hallmark, S. Jain, B.
3.5 Jenkins, P. Lim, G. Smith, B. Souder, and J. Stamos, “Oracle’s
3
Latency in WAN Symmetric Replication Technology and Implications for Application
Latency(s)
Design”, Proceedings of the 1994 ACM SIGMOD International
2.5 Latency in LAN Conference on Management of Data, May 24-27, 1994.
[6] D. Detlefs, P. Martin, M. Moir, and G Steele, “Lock-free Reference
2
Counting,” Distributed Computing, vol. 15, no. 4, pp. 255-271, 2002
1.5 [7] D. Duellmann, “Oracle Streams for the Large Hadron Collider at
CERN,” Oracle Open World, San Francisco, November, 2007
1
[8] J. Gray, “Notes on Database Operating Systems,” Operating Systems,
0.5 An Advanced Course, vol. 60, Springer-Verlag, New York, 1978.
[9] J. Gray, P. Helland, P. E. O'Neil, D. Shasha, “The Dangers of
0
0 200 400 600 800 1000 1200 Replication and a Solution”, SIGMOD Conf. 1996: pp.173-182 MSR-
Time(s) TR-96-17
[10] M. Herlihy, J. Wing. “Linearizability: a correctness condition for
Fig. 18 LAN vs. WAN latencies concurrent objects”, ACM Transactions on Programming Languages
and Systems, 12(3):463-492, July 1990
[11] IBM, “WebSphere Information Integrator Q replication”,
VII. CONCLUSION [Link]/developerworks/db2/library/techarticle/dm-0503aschoff/
In this paper, we described Oracle Streams, a high [12] L. Lamport, Time, clocks and the ordering of events in a distributed
performance replication solution for Oracle databases, and its system. Communications of the ACM, 21(7):558−565, July 1978
recent performance optimizations. Oracle Streams employs a [13] Y. Lin, B. Kemme, M. Patiño-Martínez, R. Jiménez-Peris,
pipeline of components to replicate database transactions “Middleware based Data Replication providing Snapshot Isolation,”
asynchronously. Through a number of optimizations on shared ACM Int. Conf. on Management of Data (SIGMOD), Baltimore,
memory, both the delay in transporting LCRs among Streams Maryland, June 2005.
components and the LCR execution time at database replicas [14] M. Michael, “High Performance Dynamic Lock-free Hash Tables and
List-based Sets,” In Proceedings of the fourteenth annual ACM
are reduced greatly. In addition, the LCR execution time at a
symposium on Parallel algorithms and architectures, 2002, pp. 73-82.
database replica is further shortened when the apply
[15] C. Mohan, B. Lindsay, and R. Obermarck, “Transaction Management
component uses a data layer API, which bypasses SQL. in the R* Distributed Database Management System,” ACM Trans.
Through simple replication benchmarks, we demonstrated Database Systems, vol. 11, no. 4, pp. 378-396, 1986.
how Oracle Streams can replicate over 20,000 LCRs per [16] Oracle Database Advanced Replication 11g Release 1 (11.1), Conflict
second on commodity hardware. Independent analysis from Resolution Concepts and Architecture.
one of our customers demonstrates more than a two-times [Link]
performance improvement in Oracle Database 11gR1 tp:%2F%[Link]%2Fdocs%2Fcd%2FB28359_01%2Fs
erver.111%2Fb28326%[Link]%23REPLN005
compared with his prior deployment.
[17] Oracle Streams Concepts and Administration 11g Release 1 (11.1).
ACKNOWLEDGMENT [Link]
tp:%2F%[Link]%2Fdocs%2Fcd%2FB28359_01%2Fs
We would like to thank Jim Stamos for numerous erver.111%2Fb28321%[Link]%23BEGIN
discussions that helped shape the arguments presented in this [18] F. Pedone, R. Guerraoui, and A. Schiper, "Exploiting atomic broadcast
paper. Randy Urbano provided valuable comments that helped in replicated databases," in Proceedings of EuroPar (EuroPar'98), Sept.
improve the layout of this paper. This work would not have 1998.
been possible without the encouragement of our manager, [19] O. Shalev and N. Shavit, “Split-ordered Lists: Lock-free Extensible
Alan Downing. Hash Tables,” Journal of the ACM (JACM), 53(3):379−405, 2006.
[20] A. Tanenbaum, A. Woodhull, Operating Systems: Design and
REFERENCES Implementation, Prentice Hall, Englewood Cliffs, NJ, 1987.
[21] TPC-C Benchmark Results, [Link]