0% found this document useful (0 votes)

168 views

Fraud Ebook Latest - Databricks PDF

This document discusses detecting financial fraud at scale using machine learning on Databricks. It notes that rule-based fraud detection approaches are brittle and difficult to update. The document proposes using machine learning models instead, specifically decision trees trained on labels from existing rule-based fraud detection patterns. This allows machine learning models to be evaluated against an expected output and helps address concerns about interpretability. The document provides an overview of how to build a fraud detection pipeline on Databricks that leverages domain experts, data scientists, and data engineers. It describes ingesting and exploring sample data, then detecting fraud using decision trees and evaluating the models.

Uploaded by

mohitbgupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

168 views

Fraud Ebook Latest - Databricks PDF

Uploaded by

mohitbgupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

Detecting Financial Fraud at

Scale with Decision Trees and

MLflow on Databricks
A Databricks guide, including code samples
and notebooks.
Introduction
Detecting fraudulent patterns at scale is a challenge, no
matter the use case. The massive amounts of data to
sift through, the complexity of the constantly evolving
techniques, and the very small number of actual examples
of fraudulent behavior are comparable to finding a needle
in a haystack while not knowing what the needle looks
like. In the world of finance, the added concerns with
security and the importance of explaining how fraudulent
behavior was identified further increases the complexity of
the task.
To build these detection patterns, a team of domain experts
often comes up with a set of rules that define fraudulent
behavior. A typical workflow may include a subject matter
expert in the financial fraud detection space putting together
a set of requirements for a particular behavior. A data scientist
may then take a subsample of the available data and build a
model using these requirements and possibly some known
fraud cases. To put the pattern in production, a data engineer
may convert the resulting model to a set of rules with
thresholds, often implemented using SQL.

2
This approach allows the financial institution to present
a clear set of characteristics that led to the identification
of fraud that is compliant with the General Data
Protection Regulation (GDPR). However, this approach
also poses numerous difficulties. The implementation
of the detection pattern using a hardcoded set of rules
is very brittle. Any changes to the pattern would take a
very long time to update. This, in turn, makes it difficult
to keep up with and adapt to the shift in fraudulent
behaviors that are happening in the current marketplace.

Additionally, the systems in the workflow described In this eBook, we will showcase how to convert several
above are often siloed, with the domain experts, data such rule-based detection use cases to machine learning
scientists, and data engineers all compartmentalized. use cases on the Databricks platform, unifying the key
The data engineer is responsible for maintaining massive players in fraud detection: domain experts, data scientists,
amounts of data and translating the work of the domain and data engineers. We will learn how to create a fraud-
experts and data scientists into production level code. detection data pipeline and visualize the data leveraging a
Due to a lack of common platform, the domain experts framework for building modular features from large data
and data scientists have to rely on sampled down data sets. We will also learn how to detect fraud using decision
that fits on a single machine for analysis. This leads to trees and Apache Spark MLlib. We will then use MLflow to
difficulty in communication and ultimately a lack of iterate and refine the model to improve its accuracy.
collaboration.

3
1. The lack of training labels,
SOLVING WITH ML
There is a certain degree of reluctance with regard to machine learning 2. The decision of what features to use, and
models in the financial world as they are believed to offer a “black 3. Having an appropriate benchmark for the model.
box” solution with no way of justifying the identified fraudulent cases.
Training a machine learning model to recognize the rule-based fraudulent
GDPR requirements, as well as financial regulations, make it seemingly
behavior flags offers a direct comparison with the expected output via a
impossible to leverage the power of machine learning. However, several
confusion matrix. Provided that the results closely match the rule-based
successful use cases have shown that applying machine learning to detect
detection pattern, this approach helps gain confidence in machine learning
fraud at scale can solve a host of the issues mentioned.
based fraud detection with the skeptics. The output of this model is very
easy to interpret and may serve as a baseline discussion of the expected
false negatives and false positives when compared to the original detection
pattern.

Furthermore, the concern with machine learning models being difficult

to interpret may be further assuaged if a decision tree model is used as
the initial machine learning model. Because the model is being trained
to a set of rules, the decision tree is likely to outperform any other
machine learning model. The additional benefit is, of course, the utmost
transparency of the model, which will essentially show the decision-
making process for fraud, but without human intervention and the need
to hard code any rules or thresholds. Of course, it must be understood
that the future iterations of the model may utilize a different algorithm
Training a supervised machine learning model to detect financial fraud
altogether to achieve maximum accuracy. The transparency of the model
is very difficult due to the low number of actual confirmed examples of
is ultimately achieved by understanding the features that went into the
fraudulent behavior. However, the presence of a known set of rules that
algorithm. Having interpretable features will yield interpretable and
identify a particular type of fraud can help create a set of synthetic labels
defensible model results.
and an initial set of features. The output of the detection pattern that has
been developed by the domain experts in the field has likely gone through The biggest benefit of the machine learning approach is that after the
the appropriate approval process to be put in production. It produces initial modeling effort, future iterations are modular and updating the set
the expected fraudulent behavior flags and may, therefore, be used as of labels, features, or model type is very easy and seamless, reducing the
a starting point to train a machine learning model. This simultaneously time to production. This is further facilitated on the Databricks Unified
mitigates three concerns: Analytics Platform where the domain experts, data scientists, data
engineers may work off the same data set at scale and collaborate directly
in the notebook environment. So let’s get started!
4
Ingesting and Exploring the Data
We will use a synthetic dataset for this example. To load the dataset In addition to reducing operational friction, Databricks is a central
yourself, please download it to your local machine from Kaggle and location to run the latest Machine Learning models. Users can leverage
then import the data via Import Data – Azure and AWS the native Spark MLLib package or download any open source Python or
R ML package. With Databricks Runtime for Machine Learning, Databricks
clusters are preconfigured with XGBoost, scikit-learn, and numpy as
The PaySim data simulates mobile money transactions based on a
well as popular Deep Learning frameworks such as TensorFlow, Keras,
sample of real transactions extracted from one month of financial logs
Horovod, and their dependencies.
from a mobile money service implemented in an African country. The
below table shows the information that the data set provides:
In this eBook, we will explore how to:

• Import our sample data source to create a Databricks table

• Explore your data using Databricks Visualizations

• Execute ETL code against your data

• Execute ML Pipeline including model tuning XGBoost Logistic Regression

5
EXPLORING THE DATA TYPES OF TRANSACTIONS
Creating the DataFrames – Now that we have uploaded the data
Let’s visualize the data to understand the types of transactions the data
to Databricks File System (DBFS), we can quickly and easily create
captures and their contribution to the overall transaction volume.
DataFrames using Spark SQL

# Create df DataFrame which contains our simulated financial

fraud detection dataset
df = spark.sql(“select step, type, amount, nameOrig,
oldbalanceOrg, newbalanceOrig, nameDest, oldbalanceDest,
newbalanceDest from sim_fin_fraud_detection”)

Now that we have created the DataFrame, let’s take a look at the schema
and the first thousand rows to review the data.

# Review the schema of your data

df.printSchema()
root
|-- step: integer (nullable = true)
|-- type: string (nullable = true)
|-- amount: double (nullable = true)
|-- nameOrig: string (nullable = true)
|-- oldbalanceOrg: double (nullable = true)
|-- newbalanceOrig: double (nullable = true)
To get an idea of how much money we are talking about, let’s also visualize
|-- nameDest: string (nullable = true)
|-- oldbalanceDest: double (nullable = true) the data based on the types of transactions and on their contribution to the
|-- newbalanceDest: double (nullable = true)
amount of cash transferred (i.e. sum(amount)).

6
RULES-BASED MODEL VISUALIZING DATA FLAGGED BY RULES
We are not likely to start with a large data set of known fraud cases to train These rules often flag quite a large number of fraudulent cases. Let’s
our model. In most practical applications, fraudulent detection patterns visualize the number of flagged transactions. We can see that the rules flag
are identified by a set of rules established by the domain experts. Here, we about 4% of the cases and 11% of the total dollar amount as fraudulent.
create a column called label based on these rules.

# Rules to Identify Known Fraud-based

df = df.withColumn(“label”,
F.when(
(
(df.oldbalanceOrg <= 56900) & (df.type ==
“TRANSFER”) & (df.newbalanceDest <= 105)) | ( (df.oldbalanceOrg
> 56900) & (df.newbalanceOrig <= 12)) | ( (df.oldbalanceOrg >
56900) & (df.newbalanceOrig > 12) & (df.amount > 1160000)
), 1
).otherwise(0))

After this ETL process is completed, you can use the display command
again to review the cleansed data in a scatterplot.

# View bar graph of our data

display(loan_stats)

7
Selecting the Appropriate Machine
Learning Models
In many cases, a black box approach to fraud detection cannot be CREATING THE TRAINING SET
used. First, the domain experts need to be able to understand why a To build and validate our ML model, we will do an 80/20 split using
transaction was identified as fraudulent. Then, if action is to be taken, .randomSplit. This will set aside a randomly chosen 80% of the data for
the evidence has to be presented in court. The decision tree is an easily training and the remaining 20% to validate the results.
interpretable model and is a great starting point for this use case. Read
this blog “The wise old tree” on decision trees to learn more.

# Split our dataset between training and test datasets

(train, test) = df.randomSplit([0.8, 0.2], seed=12345)

CREATING THE TRAINING SET

To prepare the data for the model, we must first convert categorical variables
to numeric using .StringIndexer. We then must assemble all of the
features we would like for the model to use. We create a pipeline to contain
these feature preparation steps in addition to the decision tree model so
that we may repeat these steps on different data sets. Note that we fit the
pipeline to our training data first and will then use it to transform our test
data in a later step.

8
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import VectorAssembler
from pyspark.ml.classification import DecisionTreeClassifier

# Encodes a string column of labels to a column of label indices

indexer = StringIndexer(inputCol = “type”, outputCol =
“typeIndexed”)

# VectorAssembler is a transformer that combines a given list of

columns into a single vector column
va = VectorAssembler(inputCols = [“typeIndexed”, “amount”,
“oldbalanceOrg”, “newbalanceOrig”, “oldbalanceDest”,
“newbalanceDest”, “orgDiff”, “destDiff”], outputCol =
“features”)

# Using the DecisionTree classifier model

dt = DecisionTreeClassifier(labelCol = “label”, featuresCol =
“features”, seed = 54321, maxDepth = 5)

# Create our pipeline stages

pipeline = Pipeline(stages=[indexer, va, dt])

# View the Decision Tree model (prior to CrossValidator)

dt_model = pipeline.fit(train)

Visual representation of the Decision Tree model

VISUALIZING THE MODEL
Calling display() on the last stage of the pipeline, which is the decision
tree model, allows us to view the initial fitted model with the chosen
decisions at each node. This helps to understand how the algorithm
arrived at the resulting predictions.

display(dt_model.stages[-1])

9
MODEL TUNING MODEL PERFORMANCE
To ensure we have the best fitting tree model, we will cross-validate the We evaluate the model by comparing the Precision-Recall (PR) and Area
model with several parameter variations. Given that our data consists of under the ROC curve (AUC) metrics for the training and test sets. Both PR
96% negative and 4% positive cases, we will use the Precision-Recall (PR) and AUC appear to be very high.
evaluation metric to account for the unbalanced distribution.

from pyspark.ml.tuning import CrossValidator, ParamGridBuilder # Build the best model (training and test datasets)
train_pred = cvModel_u.transform(train)
# Build the grid of different parameters test_pred = cvModel_u.transform(test)
paramGrid = ParamGridBuilder() \
.addGrid(dt.maxDepth, [5, 10, 15]) \ # Evaluate the model on training datasets
.addGrid(dt.maxBins, [10, 20, 30]) \ pr_train = evaluatorPR.evaluate(train_pred)
.build() auc_train = evaluatorAUC.evaluate(train_pred)

# Build out the cross validation # Evaluate the model on test datasets
crossval = CrossValidator(estimator = dt, pr_test = evaluatorPR.evaluate(test_pred)
estimatorParamMaps = paramGrid, auc_test = evaluatorAUC.evaluate(test_pred)
evaluator = evaluatorPR,
numFolds = 3) # Print out the PR and AUC values
# Build the CV pipeline print(“PR train:”, pr_train)
pipelineCV = Pipeline(stages=[indexer, va, crossval]) print(“AUC train:”, auc_train)
print(“PR test:”, pr_test)
# Train the model using the pipeline, parameter grid, and print(“AUC test:”, auc_test)
preceding BinaryClassificationEvaluator
cvModel_u = pipelineCV.fit(train) ---
# Output:
# PR train: 0.9537894984523128
# AUC train: 0.998647996459481
# PR test: 0.9539170535377599
# AUC test: 0.9984378183482442

To see how the model misclassified the results, let’s use matplotlib and
pandas to visualize our confusion matrix.

10
# Reset the DataFrames for no fraud (`dfn`) and fraud (`dfy`)
dfn = train.filter(train.label == 0)
dfy = train.filter(train.label == 1)

# Calculate summary metrics

N = train.count()
y = dfy.count()
p = y/N

# Create a more balanced training dataset

train_b = dfn.sample(False, p, seed = 92285).union(dfy)

# Print out metrics

print(“Total count: %s, Fraud cases count: %s, Proportion of
fraud cases: %s” % (N, y, p))
print(“Balanced training dataset count: %s” % train_b.count())

---
# Output:
# Total count: 5090394, Fraud cases count: 204865, Proportion of
fraud cases: 0.040245411258932016
# Balanced training dataset count: 401898
---

# Display our more balanced training dataset

display(train_b.groupBy(“label”).count())
BALANCING THE CLASSES
We see that the model is identifying 2421 more cases than the original rules
identified. This is not as alarming as detecting more potential fraudulent
cases could be a good thing. However, there are 58 cases that were not
detected by the algorithm but were originally identified. We are going
to attempt to improve our prediction further by balancing our classes
using undersampling. That is, we will keep all the fraud cases and then
downsample the non-fraud cases to match that number to get a balanced
data set. When we visualized our new data set, we see that the yes and no
cases are 50/50.

11
UPDATING THE PIPELINE REVIEW THE RESULTS
Now let’s update the ML pipeline and create a new cross validator. Because Now let’s look at the results of our new confusion matrix. The model
we are using ML pipelines, we only need to update it with the new dataset misidentified only one fraudulent case. Balancing the classes seems to
and we can quickly repeat the same pipeline steps. have improved the model.

# Re-run the same ML pipeline (including parameters grid)

crossval_b = CrossValidator(estimator = dt,
estimatorParamMaps = paramGrid,
evaluator = evaluatorAUC,
numFolds = 3)
pipelineCV_b = Pipeline(stages=[indexer, va, crossval_b])

# Train the model using the pipeline, parameter grid, and

BinaryClassificationEvaluator using the `train_b` dataset
cvModel_b = pipelineCV_b.fit(train_b)

# Build the best model (balanced training and full test datasets)
train_pred_b = cvModel_b.transform(train_b)
test_pred_b = cvModel_b.transform(test)

# Evaluate the model on the balanced training datasets

pr_train_b = evaluatorPR.evaluate(train_pred_b)
auc_train_b = evaluatorAUC.evaluate(train_pred_b)

# Evaluate the model on full test datasets

pr_test_b = evaluatorPR.evaluate(test_pred_b)
auc_test_b = evaluatorAUC.evaluate(test_pred_b)

# Print out the PR and AUC values

print(“PR train:”, pr_train_b)
print(“AUC train:”, auc_train_b)
print(“PR test:”, pr_test_b)
print(“AUC test:”, auc_test_b)

---
# Output:
# PR train: 0.999629161563572
# AUC train: 0.9998071389056655
# PR test: 0.9904709171789063
# AUC test: 0.9997903902204509

12
MODEL FEEDBACK AND USING MLFLOW
Once a model is chosen for production, we want to continuously collect
feedback to ensure that the model is still identifying the behavior of interest.
Since we are starting with a rule-based label, we want to supply future
models with verified true labels based on human feedback. This stage
is crucial for maintaining confidence and trust in the machine learning
process. Since analysts are not able to review every single case, we want to
ensure we are presenting them with carefully chosen cases to validate the
model output. For example, predictions, where the model has low certainty,
are good candidates for analysts to review. The addition of this type of
feedback will ensure the models will continue to improve and evolve with
the changing landscape.

MLflow helps us throughout this cycle as we train different model versions.

We can keep track of our experiments, comparing the results of different
model configurations and parameters. For example here, we can compare
the PR and AUC of the models trained on balanced and unbalanced data
sets using the MLflow UI. Data scientists can use MLflow to keep track of
the various model metrics and any additional visualizations and artifacts to
help make the decision of which model should be deployed in production.
The data engineers will then be able to easily retrieve the chosen model
along with the library versions used for training as a .jar file to be deployed
on new data in production. Thus, the collaboration between the domain
experts who review the model results, the data scientists who update the
models, and the data engineers who deploy the models in production, will
be strengthened throughout this iterative process.

13
CONCLUSION
We have reviewed an example of how to use a rule-based fraud
detection label and convert it to a machine learning model using
Databricks with MLflow. This approach allows us to build a scalable,
modular solution that will help us keep up with ever-changing
fraudulent behavior patterns. Building a machine learning model to
identify fraud allows us to create a feedback loop that allows the model
to evolve and identify new potential fraudulent patterns. We have
seen how a decision tree model, in particular, is a great starting point
to introduce machine learning to a fraud detection program due to its
interpretability and excellent accuracy.

A major benefit of using the Databricks platform for this effort is that it
allows for data scientists, engineers, and business users to seamlessly
work together throughout the process. Preparing the data, building
models, sharing the results, and putting the models into production
can now happen on the same platform, allowing for unprecedented
collaboration. This approach builds trust across the previously siloed
teams, leading to an effective and dynamic fraud detection program.

Try this notebook by signing up for a free trial in just a few minutes and
get started creating your own models.

Trading Services LTD: Bookmap™ Masterclass
0% (1)
Trading Services LTD: Bookmap™ Masterclass
9 pages
Time Series
No ratings yet
Time Series
31 pages
Big Data Medicare Fraud Detection - Finance - Project
No ratings yet
Big Data Medicare Fraud Detection - Finance - Project
25 pages
De Mod 5 Deploy Workloads With Databricks Workflows
No ratings yet
De Mod 5 Deploy Workloads With Databricks Workflows
19 pages
Google People and Ai Guidebook-Workshop-Slides
No ratings yet
Google People and Ai Guidebook-Workshop-Slides
126 pages
Analyzing IoT Data in Python Chapter3
No ratings yet
Analyzing IoT Data in Python Chapter3
30 pages
Communication Plan Template
75% (4)
Communication Plan Template
19 pages
Data Mining For Fraud Detection 4381
No ratings yet
Data Mining For Fraud Detection 4381
27 pages
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
100% (1)
Buy Ebook Data Analysis With Python and PySpark (MEAP V07) Jonathan Rioux Cheap Price
62 pages
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
No ratings yet
Dataeng-Zoomcamp - 5 - Batch - Processing - MD at Main Ziritrion - Dataeng-Zoomcamp GitHub
41 pages
Pyspark
No ratings yet
Pyspark
31 pages
Cluster Computing
100% (6)
Cluster Computing
28 pages
A Guide To The MDMPV2
No ratings yet
A Guide To The MDMPV2
81 pages
De Mod 0 Get Started With Pyspark Programming
No ratings yet
De Mod 0 Get Started With Pyspark Programming
7 pages
How To Create Secrets in Databricks? - by Ashish Garg - Medium
No ratings yet
How To Create Secrets in Databricks? - by Ashish Garg - Medium
13 pages
Bernhard Fayettevillefoia
No ratings yet
Bernhard Fayettevillefoia
16 pages
Pyspark RDD Cheat Sheet Python For Data Science
No ratings yet
Pyspark RDD Cheat Sheet Python For Data Science
1 page
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
100% (3)
Download ebooks file Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh all chapters
55 pages
Aws Three Practical Use Cases With Databricks Ebook v5 101221
No ratings yet
Aws Three Practical Use Cases With Databricks Ebook v5 101221
34 pages
Selenium Introduction: Devang Mehta Quality Analyst, Thoughtworks
No ratings yet
Selenium Introduction: Devang Mehta Quality Analyst, Thoughtworks
16 pages
Deception 101 Primer On Deception
No ratings yet
Deception 101 Primer On Deception
26 pages
2018 02 08 Whats New in Apache Spark 2 180213220045
No ratings yet
2018 02 08 Whats New in Apache Spark 2 180213220045
57 pages
Databricks Widgets
No ratings yet
Databricks Widgets
13 pages
Databricks Question
No ratings yet
Databricks Question
7 pages
_ Databricks & PySpark learning day-10
No ratings yet
_ Databricks & PySpark learning day-10
4 pages
Spoken Language Processing in Python Chapter4
No ratings yet
Spoken Language Processing in Python Chapter4
46 pages
1 - Introduction To React JS
No ratings yet
1 - Introduction To React JS
13 pages
Pyspark Code
No ratings yet
Pyspark Code
3 pages
Python For Non-Programmers - 1-1
No ratings yet
Python For Non-Programmers - 1-1
19 pages
Machine Learning GenAI Roadma
No ratings yet
Machine Learning GenAI Roadma
36 pages
Databricks Certified Machine Learning Associate Exam Guide
No ratings yet
Databricks Certified Machine Learning Associate Exam Guide
9 pages
Python Challenge
No ratings yet
Python Challenge
10 pages
Analyzing IoT Data in Python Chapter4
No ratings yet
Analyzing IoT Data in Python Chapter4
34 pages
Python Pandas Interview Questions and Answers
No ratings yet
Python Pandas Interview Questions and Answers
20 pages
An Overview of Practical Time Series Forecasting Using Pytho
No ratings yet
An Overview of Practical Time Series Forecasting Using Pytho
30 pages
Databricks Performance Tuning
No ratings yet
Databricks Performance Tuning
9 pages
Technical Interview Questions For Freshers - With Answers (2024)
No ratings yet
Technical Interview Questions For Freshers - With Answers (2024)
7 pages
Python PPT 01
No ratings yet
Python PPT 01
286 pages
React Js
No ratings yet
React Js
21 pages
ReactJS - CredoSystems
No ratings yet
ReactJS - CredoSystems
14 pages
Databricks Final
No ratings yet
Databricks Final
81 pages
Apache Pig
100% (2)
Apache Pig
80 pages
Unit 3 - Ai
No ratings yet
Unit 3 - Ai
216 pages
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
No ratings yet
Deep Learning With Databricks: Srijith Rajamohan, Ph.D. John O'Dwyer
38 pages
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
100% (4)
Download Full Learn PySpark: Build python-based machine learning and deep learning models 1st Edition Pramod Singh PDF All Chapters
55 pages
Complete Download Python for Beginners: Master Python Programming from Basics to Advanced Level Tim Simon PDF All Chapters
100% (2)
Complete Download Python for Beginners: Master Python Programming from Basics to Advanced Level Tim Simon PDF All Chapters
47 pages
Spark-Tutorial - IV - Python
No ratings yet
Spark-Tutorial - IV - Python
212 pages
5 2023 24 10 04 49 17
No ratings yet
5 2023 24 10 04 49 17
229 pages
Ravi Pyspark RDD Tutorial 1665758938
No ratings yet
Ravi Pyspark RDD Tutorial 1665758938
20 pages
Pyspark Learning Hub
No ratings yet
Pyspark Learning Hub
7 pages
CPAD Practicals Merged
No ratings yet
CPAD Practicals Merged
72 pages
Lecture # 12 - Introduction To React JS
No ratings yet
Lecture # 12 - Introduction To React JS
76 pages
Dec 01 2020
No ratings yet
Dec 01 2020
298 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
Power BI Cheat Sheet
No ratings yet
Power BI Cheat Sheet
10 pages
Unstructured Dataload Into Hive Database Through PySpark
No ratings yet
Unstructured Dataload Into Hive Database Through PySpark
9 pages
Difference Between Data Science and Machine Learning
No ratings yet
Difference Between Data Science and Machine Learning
5 pages
Koustav BigData Resume
No ratings yet
Koustav BigData Resume
2 pages
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
No ratings yet
(English (Auto-Generated) ) Building End-to-End Delta Pipelines On GCP (DownSub - Com)
24 pages
Case Study Front Page
No ratings yet
Case Study Front Page
11 pages
Introduction and Context
No ratings yet
Introduction and Context
4 pages
Report
No ratings yet
Report
14 pages
Price Prediction Evolution: From Economic Model To Machine Learning
No ratings yet
Price Prediction Evolution: From Economic Model To Machine Learning
7 pages
Individualized Indicator For All: Stock-Wise Technical Indicator Optimization With Stock Embedding
No ratings yet
Individualized Indicator For All: Stock-Wise Technical Indicator Optimization With Stock Embedding
9 pages
Dynamic Time Warping Application For Financial Pattern Recognition
No ratings yet
Dynamic Time Warping Application For Financial Pattern Recognition
19 pages
Recommended Resources: Appendix
No ratings yet
Recommended Resources: Appendix
2 pages
CME Iceberg Order Detection and Prediction
No ratings yet
CME Iceberg Order Detection and Prediction
16 pages
Improving S&P Stock Prediction With Time Series Stock Similarity
No ratings yet
Improving S&P Stock Prediction With Time Series Stock Similarity
14 pages
Profitable Momentum Trading Strategies For Individual Investors
No ratings yet
Profitable Momentum Trading Strategies For Individual Investors
32 pages
SSRN Id3420952
No ratings yet
SSRN Id3420952
53 pages
Improving Technical Trading Systems by Using A New MATLAB Based Genetic Algorithm Procedure
No ratings yet
Improving Technical Trading Systems by Using A New MATLAB Based Genetic Algorithm Procedure
6 pages
Stockagent: Application of RL From Lunarlander To Stock Price Prediction
No ratings yet
Stockagent: Application of RL From Lunarlander To Stock Price Prediction
5 pages
Recommended Brokers: Stock Trading Commissions Account Minimum Open Account
No ratings yet
Recommended Brokers: Stock Trading Commissions Account Minimum Open Account
1 page
Advanced Analysis QuantileRegression
No ratings yet
Advanced Analysis QuantileRegression
45 pages
Developed Trading Strategies by Genetic Algorithm
No ratings yet
Developed Trading Strategies by Genetic Algorithm
8 pages
Pareto - S Principle
100% (1)
Pareto - S Principle
10 pages
Material The Place Where I Live
No ratings yet
Material The Place Where I Live
28 pages
T1 W9 Reading - Cane Toad Catchers
No ratings yet
T1 W9 Reading - Cane Toad Catchers
14 pages
Pamumuhay Sa Kanayunan Case Study On Des
No ratings yet
Pamumuhay Sa Kanayunan Case Study On Des
65 pages
QD Study Guide
No ratings yet
QD Study Guide
23 pages
Topographic Survey
No ratings yet
Topographic Survey
10 pages
Mag Pi 144
No ratings yet
Mag Pi 144
100 pages
FAR610 - TEST 1 - APRIL 2018 Suggested Solution
No ratings yet
FAR610 - TEST 1 - APRIL 2018 Suggested Solution
3 pages
Section A (20 Marks) Questions 1 To 5
No ratings yet
Section A (20 Marks) Questions 1 To 5
4 pages
CLIL Session C1.2.2. Evaluation Activity
No ratings yet
CLIL Session C1.2.2. Evaluation Activity
1 page
The Art of Paper Folding - Student's
No ratings yet
The Art of Paper Folding - Student's
8 pages
5 Foundational Devops Practices PDF
No ratings yet
5 Foundational Devops Practices PDF
30 pages
Sruthi Kuriakose: Academic Tutor
No ratings yet
Sruthi Kuriakose: Academic Tutor
2 pages
q111_deck_general
No ratings yet
q111_deck_general
21 pages
Just An Idea
No ratings yet
Just An Idea
4 pages
Fungsi Green
No ratings yet
Fungsi Green
8 pages
Response by The International Bank of Commerce To The United States Trustee'S Certificate of Non-Compliance and Request For A Hearing
No ratings yet
Response by The International Bank of Commerce To The United States Trustee'S Certificate of Non-Compliance and Request For A Hearing
8 pages
Nurul Afiqah Farahana
No ratings yet
Nurul Afiqah Farahana
4 pages
Problem 5: Newton-Raphson Method
No ratings yet
Problem 5: Newton-Raphson Method
3 pages
GloMax Discover System TM397 PDF
No ratings yet
GloMax Discover System TM397 PDF
106 pages
Determining The Truthfulness and Accuracy of The Material Viewed
No ratings yet
Determining The Truthfulness and Accuracy of The Material Viewed
16 pages
SWMS Premium Scaffold Soultions Scaffold
100% (1)
SWMS Premium Scaffold Soultions Scaffold
22 pages
CL B155 Mba520 Assessment
No ratings yet
CL B155 Mba520 Assessment
12 pages
Nokia BSC Commands
100% (1)
Nokia BSC Commands
3 pages
Food Defense Plan - Site Security Risk Assessment
No ratings yet
Food Defense Plan - Site Security Risk Assessment
6 pages
Vani GATE Online Test Series
No ratings yet
Vani GATE Online Test Series
19 pages
LG Blu Ray Home Theater Manual
No ratings yet
LG Blu Ray Home Theater Manual
72 pages
IFS en Ready 2 Go System Brochure
No ratings yet
IFS en Ready 2 Go System Brochure
36 pages
Map Structural Symbols: Mesoscopic Structures
No ratings yet
Map Structural Symbols: Mesoscopic Structures
4 pages

Fraud Ebook Latest - Databricks PDF

Uploaded by

Fraud Ebook Latest - Databricks PDF

Uploaded by

Detecting Financial Fraud at

Scale with Decision Trees and

Furthermore, the concern with machine learning models being difficult

• Import our sample data source to create a Databricks table

• Explore your data using Databricks Visualizations

• Execute ETL code against your data

• Execute ML Pipeline including model tuning XGBoost Logistic Regression

# Create df DataFrame which contains our simulated financial

# Review the schema of your data

# Rules to Identify Known Fraud-based

# View bar graph of our data

# Split our dataset between training and test datasets

CREATING THE TRAINING SET

# Encodes a string column of labels to a column of label indices

# VectorAssembler is a transformer that combines a given list of

# Using the DecisionTree classifier model

# Create our pipeline stages

# View the Decision Tree model (prior to CrossValidator)

Visual representation of the Decision Tree model

# Calculate summary metrics

# Create a more balanced training dataset

# Print out metrics

# Display our more balanced training dataset

# Re-run the same ML pipeline (including parameters grid)

# Train the model using the pipeline, parameter grid, and

# Evaluate the model on the balanced training datasets

# Evaluate the model on full test datasets

# Print out the PR and AUC values

MLflow helps us throughout this cycle as we train different model versions.

You might also like