MODERN DATA ARCHITECTURES
FOR BIG DATA II
APACHE SPARK
STREAMING API
Agenda
● Event and Processing Time
● Windows Operations
● Late Data and Watermarking
● Join Operations
● Stream Deduplication
1.
EVENT AND
PROCESSING TIME
Handling Processing time
● Processing-time is related to the moment
Spark is processing the data.
● current_timestamp returns the current
timestamp at the start of query evaluation
as a TimeStamp data type column.
Handling Processing time
Processing time
Handling Event time
● Event-time is embedded in the data itself,
but it might be referenced differently
depending on the source we’re using.
● Event-time is a column value in the row
with TimeStamp data type.
● Window-based aggregations ➡ type of
grouping and aggregation on this column:
○ Each time window is a group and each row
can belong to multiple windows/groups.
Handling Event time
Event time
2.
WINDOW
OPERATIONS
Window Operations* on Event Time
● Tumbling windows:
● Time based windowing strategies:
Recently added. We won’t
cover it
Options in
Structured Streaming
* Naming convention in “Streaming Data - Understanding the Real-Time Pipeline”.
Tumbling time-based window
pyspark.sql.functions.window(timeColumn, windowDuration)
● Bucketize rows into one time window given a
timestamp specifying column.
● Window starts are inclusive but the window ends
are exclusive ➡ for example [12:05,12:10)
● Durations provided as strings: ‘week’, ‘day’, ‘hour’,
‘minute’, ‘second’, ‘millisecond’ and ‘microsecond’.
Tumbling time-based window
key & value ➡ array of bytes;
convert them into Strings
before working with them
Event
time
A window is defined by
its start and end
Sliding window
pyspark.sql.functions.window(timeColumn, windowDuration,
slideDuration)
● Bucketize rows into one or more time windows
given a timestamp specifyingEventcolumn.
time
Processing
time
Sliding window
3.
LATE DATA AND
WATERMARKING
Events can be late
● Due to multiple factors, events can arrive late
to the analytics tier, Spark in our case.
● The stream processing engine can maintain
the intermediate state for partial aggregates.
● This allows to update the aggregates of old
windows correctly, although this is not done
forever but for a certain amount of time ➡
watermarks.
Events can be late
* More on this at Handling Late Data and Watermarking
Events can be late
Watermark of
10 minutes
* More on this at Handling Late Data and Watermarking
Watermarking in Spark Streaming
DataFrame.withWatermark(eventTime, delayThreshold)
● Defines an event time watermark for a DataFrame.
● A watermark tracks a point in time before which we
assume no more late data is going to arrive.
● eventTime is a string with the name of the column or
the column itself.
● delayThreshold is a string with an interval: “1 minute”,
“5 hours”, ...
Watermarking in Spark Streaming
Watermarking in Spark Streaming
● Conditions for watermarking to clean
aggregation state:
○ Output mode* must be Append or Update, but
not Complete (requires all aggregate data to be
preserved ➡ more resources).
○ Aggregation must have the event-time column
or a window on the event-time column.
○ withWatermark must be called on the same
timestamp column.
○ withWatermark must be called before the
aggregation
* Append is the default value if not output mode is specified
4.
JOIN
OPERATIONS
Join Operations
● Spark Structured Streaming supports:
○ Streaming DataFrame with a Static one
○ Streaming DataFrame with a Streaming one
● Streaming Join ➡ incremental results like
Streaming Aggregations.
Stream-Static Joins
● It supports inner joins and some type of
outer joins.
Stream-Static Joins
Stream-Stream Joins
● Main challenge:
○ During join operation, view of DataFrames
might be incomplete for both sides
○ Harder to find matches between inputs
● Any row received from one input stream can
match with any future, yet-to-be-received
row from the other input stream.
Stream-Stream Joins
● Approach followed:
○ past input is buffered for “a while”
○ every future input will match with past input
and accordingly generate joined results
○ late out-of-order data is handled automatically
○ “A while” is handled by using watermarks.
● We’re not going to go into detail on this type
of joins.
Stream-Stream Joins
● Example of Inner Join with Watermarking:
5.
STREAM
DEDUPLICATION
Stream Deduplication
● In computing, data deduplication* is a
technique for eliminating duplicate copies of
repeating data.
● If part of your end-to-end solution provides
at-least-once guarantee ➡ data duplication.
● We can turn at-least-once into exactly-once
by using stream deduplication.
* Data deduplication definition by the Wikipedia.
Stream Deduplication
● There has to be a unique identifier in events,
determined by one or multiple columns.
● The query keeps history from previous events
in order to filter duplicates.
● Deduplication can be used:
○ With Watermarking - bounds the size of the
history the query has to maintain
○ Without Watermarking - the query stores the
data from all the past events
Stream Deduplication
pyspark.sql.DataFrame.dropDuplicates(subset=None)
● Valid for Static and Stream DataFrames.