SCD Type-1,2 Implementation in Pyspark
SCD Type-1,2 Implementation in Pyspark
SCDs refer to data in dimension tables that changes slowly over time and not
at a regular cadence. A common example for SCDs is customer profiles—for
example, an email address or the phone number of a customer doesn't
change that often, and these are perfect candidates for SCD
Here are some of the main aspects we will need to consider while designing
an SCD:
Should we keep track of the changes? If yes, how much of the history should we maintain
Or, should we just overwrite the changes and ignore the history
Based on our requirements for maintaining the history, there are about seven
ways in which we can accomplish keeping track of changes. They are named
SCD1, SCD2, SCD3, and so on, up to SCD7
Designing SCD1
In SCD type 1, the values are overwritten and no history is maintained, so
once the data is updated, there is no way to find out what the previous value
was. The new queries will always return the most recent value. Here is an
example of an SCD1 table:
print("Full data...")
df_full.show()
print("daily data...")
df_daily_update.show()
Result:
1. We have a full data/History data and we need to change the data as use or entity requestd
2. user data will get i.e Daily data and we need to perform operations on that,
In the above we have the data and we read into the dataframes.
Result:
We have perform the operation that will insert new records and update the records using pyspark code
SCD Type-2
Type 2 dimensions are always created as a new record. If a detail in the data changes, a new row will be added to
the table with a new primary key. However, the natural key would remain the same in order to map a record
change to one another. Type 2 dimensions are the most common approach to tracking historical records.
SCD 2 Implementation
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName('sparkdf').getOrCreate()
df_full=df_full.withColumn("Active_Flag",lit("Y")).withColumn("From_date",\
to_date(current_date()))\
.withColumn("To_date",lit("Null"))
df_full.show()
df_daily=df_daily.withColumn("Active_Flage",lit("Y"))\
.withColumn("From_date",to_date(current_date()))\
.withColumn("To_date",lit("Null"))
Create a Data Frame with No changes data using the update_ds dataframe
insert_ds = df_daily.join(no_change,"id","left_anti")
insert_ds.show()
df_final=update_ds.union(insert_ds).union(no_change)
df_final.show()
Finally we have achieved the records which have the new updates and new entries.