Fraud Ebook Latest - Databricks PDF
Fraud Ebook Latest - Databricks PDF
2
This approach allows the financial institution to present
a clear set of characteristics that led to the identification
of fraud that is compliant with the General Data
Protection Regulation (GDPR). However, this approach
also poses numerous difficulties. The implementation
of the detection pattern using a hardcoded set of rules
is very brittle. Any changes to the pattern would take a
very long time to update. This, in turn, makes it difficult
to keep up with and adapt to the shift in fraudulent
behaviors that are happening in the current marketplace.
Additionally, the systems in the workflow described In this eBook, we will showcase how to convert several
above are often siloed, with the domain experts, data such rule-based detection use cases to machine learning
scientists, and data engineers all compartmentalized. use cases on the Databricks platform, unifying the key
The data engineer is responsible for maintaining massive players in fraud detection: domain experts, data scientists,
amounts of data and translating the work of the domain and data engineers. We will learn how to create a fraud-
experts and data scientists into production level code. detection data pipeline and visualize the data leveraging a
Due to a lack of common platform, the domain experts framework for building modular features from large data
and data scientists have to rely on sampled down data sets. We will also learn how to detect fraud using decision
that fits on a single machine for analysis. This leads to trees and Apache Spark MLlib. We will then use MLflow to
difficulty in communication and ultimately a lack of iterate and refine the model to improve its accuracy.
collaboration.
3
1. The lack of training labels,
SOLVING WITH ML
There is a certain degree of reluctance with regard to machine learning 2. The decision of what features to use, and
models in the financial world as they are believed to offer a “black 3. Having an appropriate benchmark for the model.
box” solution with no way of justifying the identified fraudulent cases.
Training a machine learning model to recognize the rule-based fraudulent
GDPR requirements, as well as financial regulations, make it seemingly
behavior flags offers a direct comparison with the expected output via a
impossible to leverage the power of machine learning. However, several
confusion matrix. Provided that the results closely match the rule-based
successful use cases have shown that applying machine learning to detect
detection pattern, this approach helps gain confidence in machine learning
fraud at scale can solve a host of the issues mentioned.
based fraud detection with the skeptics. The output of this model is very
easy to interpret and may serve as a baseline discussion of the expected
false negatives and false positives when compared to the original detection
pattern.
5
EXPLORING THE DATA TYPES OF TRANSACTIONS
Creating the DataFrames – Now that we have uploaded the data
Let’s visualize the data to understand the types of transactions the data
to Databricks File System (DBFS), we can quickly and easily create
captures and their contribution to the overall transaction volume.
DataFrames using Spark SQL
Now that we have created the DataFrame, let’s take a look at the schema
and the first thousand rows to review the data.
6
RULES-BASED MODEL VISUALIZING DATA FLAGGED BY RULES
We are not likely to start with a large data set of known fraud cases to train These rules often flag quite a large number of fraudulent cases. Let’s
our model. In most practical applications, fraudulent detection patterns visualize the number of flagged transactions. We can see that the rules flag
are identified by a set of rules established by the domain experts. Here, we about 4% of the cases and 11% of the total dollar amount as fraudulent.
create a column called label based on these rules.
After this ETL process is completed, you can use the display command
again to review the cleansed data in a scatterplot.
7
Selecting the Appropriate Machine
Learning Models
In many cases, a black box approach to fraud detection cannot be CREATING THE TRAINING SET
used. First, the domain experts need to be able to understand why a To build and validate our ML model, we will do an 80/20 split using
transaction was identified as fraudulent. Then, if action is to be taken, .randomSplit. This will set aside a randomly chosen 80% of the data for
the evidence has to be presented in court. The decision tree is an easily training and the remaining 20% to validate the results.
interpretable model and is a great starting point for this use case. Read
this blog “The wise old tree” on decision trees to learn more.
8
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier
display(dt_model.stages[-1])
9
MODEL TUNING MODEL PERFORMANCE
To ensure we have the best fitting tree model, we will cross-validate the We evaluate the model by comparing the Precision-Recall (PR) and Area
model with several parameter variations. Given that our data consists of under the ROC curve (AUC) metrics for the training and test sets. Both PR
96% negative and 4% positive cases, we will use the Precision-Recall (PR) and AUC appear to be very high.
evaluation metric to account for the unbalanced distribution.
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder # Build the best model (training and test datasets)
train_pred = cvModel_u.transform(train)
# Build the grid of different parameters test_pred = cvModel_u.transform(test)
paramGrid = ParamGridBuilder() \
.addGrid(dt.maxDepth, [5, 10, 15]) \ # Evaluate the model on training datasets
.addGrid(dt.maxBins, [10, 20, 30]) \ pr_train = evaluatorPR.evaluate(train_pred)
.build() auc_train = evaluatorAUC.evaluate(train_pred)
# Build out the cross validation # Evaluate the model on test datasets
crossval = CrossValidator(estimator = dt, pr_test = evaluatorPR.evaluate(test_pred)
estimatorParamMaps = paramGrid, auc_test = evaluatorAUC.evaluate(test_pred)
evaluator = evaluatorPR,
numFolds = 3) # Print out the PR and AUC values
# Build the CV pipeline print(“PR train:”, pr_train)
pipelineCV = Pipeline(stages=[indexer, va, crossval]) print(“AUC train:”, auc_train)
print(“PR test:”, pr_test)
# Train the model using the pipeline, parameter grid, and print(“AUC test:”, auc_test)
preceding BinaryClassificationEvaluator
cvModel_u = pipelineCV.fit(train) ---
# Output:
# PR train: 0.9537894984523128
# AUC train: 0.998647996459481
# PR test: 0.9539170535377599
# AUC test: 0.9984378183482442
To see how the model misclassified the results, let’s use matplotlib and
pandas to visualize our confusion matrix.
10
# Reset the DataFrames for no fraud (`dfn`) and fraud (`dfy`)
dfn = train.filter(train.label == 0)
dfy = train.filter(train.label == 1)
---
# Output:
# Total count: 5090394, Fraud cases count: 204865, Proportion of
fraud cases: 0.040245411258932016
# Balanced training dataset count: 401898
---
11
UPDATING THE PIPELINE REVIEW THE RESULTS
Now let’s update the ML pipeline and create a new cross validator. Because Now let’s look at the results of our new confusion matrix. The model
we are using ML pipelines, we only need to update it with the new dataset misidentified only one fraudulent case. Balancing the classes seems to
and we can quickly repeat the same pipeline steps. have improved the model.
# Build the best model (balanced training and full test datasets)
train_pred_b = cvModel_b.transform(train_b)
test_pred_b = cvModel_b.transform(test)
---
# Output:
# PR train: 0.999629161563572
# AUC train: 0.9998071389056655
# PR test: 0.9904709171789063
# AUC test: 0.9997903902204509
12
MODEL FEEDBACK AND USING MLFLOW
Once a model is chosen for production, we want to continuously collect
feedback to ensure that the model is still identifying the behavior of interest.
Since we are starting with a rule-based label, we want to supply future
models with verified true labels based on human feedback. This stage
is crucial for maintaining confidence and trust in the machine learning
process. Since analysts are not able to review every single case, we want to
ensure we are presenting them with carefully chosen cases to validate the
model output. For example, predictions, where the model has low certainty,
are good candidates for analysts to review. The addition of this type of
feedback will ensure the models will continue to improve and evolve with
the changing landscape.
13
CONCLUSION
We have reviewed an example of how to use a rule-based fraud
detection label and convert it to a machine learning model using
Databricks with MLflow. This approach allows us to build a scalable,
modular solution that will help us keep up with ever-changing
fraudulent behavior patterns. Building a machine learning model to
identify fraud allows us to create a feedback loop that allows the model
to evolve and identify new potential fraudulent patterns. We have
seen how a decision tree model, in particular, is a great starting point
to introduce machine learning to a fraud detection program due to its
interpretability and excellent accuracy.
A major benefit of using the Databricks platform for this effort is that it
allows for data scientists, engineers, and business users to seamlessly
work together throughout the process. Preparing the data, building
models, sharing the results, and putting the models into production
can now happen on the same platform, allowing for unprecedented
collaboration. This approach builds trust across the previously siloed
teams, leading to an effective and dynamic fraud detection program.
Try this notebook by signing up for a free trial in just a few minutes and
get started creating your own models.
14