0% found this document useful (0 votes)
59 views

Divorce Prediction System: Devansh Kapoor 179202050

This document discusses a project to develop a machine learning model to predict divorce. It summarizes literature on both the pros and cons of using machine learning for legal applications. The working principle describes algorithms like decision trees, random forest, kernel PCA, and k-fold cross validation that are used to build and evaluate the predictive model. The goal is to analyze statements from couples to accurately predict the probability of divorce and help legal institutes.

Uploaded by

Aman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views

Divorce Prediction System: Devansh Kapoor 179202050

This document discusses a project to develop a machine learning model to predict divorce. It summarizes literature on both the pros and cons of using machine learning for legal applications. The working principle describes algorithms like decision trees, random forest, kernel PCA, and k-fold cross validation that are used to build and evaluate the predictive model. The goal is to analyze statements from couples to accurately predict the probability of divorce and help legal institutes.

Uploaded by

Aman
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Divorce Prediction System

SUMMER TRAINING/SEMINAR REPORT

Submitted by

Devansh Kapoor

179202050

BACHELOR OF TECHNOLOGY

ELECTRONICS & COMMUNICATION ENGINEERING

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING

School of Electrical, Electronics and Communication Engineering


Manipal University Jaipur

2020
Abstract

In the modern era of digitalization where the demand of artificial intelligence and
machine learning is increasing rapidly many goals and objectives are achieved by
the companies using machine learning and they are continuously trying to expand
its application in the legal domain. This is a working field of research in which
there are several outstanding open problems and an area of exploration. A major
concern in legal domain is the prediction of divorce and researchers have tried to
solve this problem using the technology of machine learning. This project aims
to predict whether divorce is likely to occur or not using machine learning models.

This model uses classification algorithm (Random Forest) which was applied on
the dataset which consisted of 54 parameters which were the statements of
couples in front of their legal advisor. Hence this predictor soon when the
technology of machine learning gets more evolved can quite accurately predict
the probability of divorce and can save a lot of time of legal institutes.

Literature Review

Pros Cons
It can reduce human dependencies in Dependency on human cannot be
long run. completely eradicated as machine is not
sufficiently evolved.
As more data is provided, the model’s It might contain large volume of
accuracy and efficiency to make incorrect data. Imbalance in data can
decisions improve with subsequent lead to poor accuracy of model.
training.
We can identify various trends and A machine learning problem can
patterns with huge amount of data using implement various algorithms to find a
supervised and unsupervised learning solution. It is a manual and tedious task
algorithms. to run models with different algorithms
and identify most accurate algorithm
based on the result.

Working Principle
• Algorithms:
1. Decision Tree: Decision Tree is a Supervised learning

technique that can be used for both classification and Regression


problems, but mostly it is preferred for solving Classification
problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision
rules, and each leaf node represents the outcome. In a Decision tree,
there are two nodes, which are the Decision Node and Leaf Node.
Decision nodes are used to make any decision and have multiple
branches, whereas Leaf nodes are the output of those decisions and
do not contain any further branches. The decisions or the test are
performed based on features of the given dataset. It is a graphical
representation for getting all the possible solutions to a
problem/decision based on given conditions. It is called a decision
tree because, like a tree, it starts with the root node, which expands
on further branches and constructs a tree-like structure. To build a
tree, we use the CART algorithm, which stands for Classification and
Regression Tree algorithm. A decision tree simply asks a question
and based on the answer (Yes/No), it further split the tree into
subtrees. Decision Trees usually mimic human thinking ability while
deciding, so it is easy to understand. The logic behind the decision
tree can be easily understood because it shows a tree-like structure.

Example- There is a candidate who has a job offer and wants to


decide whether he should accept the offer or not. So, to solve this
problem, the decision tree starts with the root node (Salary attribute
by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the
corresponding labels. The next decision node further gets split into
one decision node (Cab facility) and one leaf node. Finally, the
decision node splits into two leaf nodes (Accepted offers and
Declined offer).
2. Random Forest: Random Forest is a classifier that contains a
number of decision trees on various subsets of the given dataset and
takes the average to improve the predictive accuracy of that dataset."
Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of
predictions, and it predicts the final output. The greater number of
trees in the forest leads to higher accuracy and prevents the problem
of overfitting. It takes less training time as compared to other
algorithms. It predicts output with high accuracy, even for the large
dataset it runs efficiently. It can also maintain accuracy when a large
proportion of data is missing. There should be some actual values in
the feature variable of the dataset so that the classifier can predict
accurate results rather than a guessed result. The predictions from
each tree must have very low correlations.
Example: Suppose there is a dataset that contains multiple fruit
images. So, this dataset is given to the Random forest classifier. The
dataset is divided into subsets and given to each decision tree.
During the training phase, each decision tree produces a prediction
result, and when a new data point occurs, then based on most results,
the Random Forest classifier predicts the final decision.
3. Kernel PCA: Principal Component Analysis is a technique of

dimension reduction. We usually are surrounded by data with many


variables, some of which might be correlated. This correlation
between variables brings about a redundancy in the information that
can be gathered by the data set. Thus, to reduce the computational
and cost complexities, we use PCA to transform the original variables
to the linear combination of these variables which are independent.
PCA is a powerful tool for analysing data in finding the patterns in
the data (Feature extraction) as in the name “Principal Component”
means major or maximum information. Also, reducing the number of
dimensions without much loss of information (data reduction, noise
rejection, visualization, data compression etc). PCA finds the
orthonormal basis for data. It sorts dimensions in the order of
importance and discards low significance dimensions. PCA is used
in many applications such as face recognition, image compression.
Kernel PCA (KPCA) is the nonlinear form of the PCA, which better
exploits a complicated spatial structure of high-dimensional features,
where a kernel function implicitly defines a nonlinear transformation
into a feature space wherein standard PCA is performed. Despite its
success and flexibility, conventional KPCA might not perform
properly because the use of KPCA for a large-sized training dataset
imposes a high computational load and a significant storage memory
space since the required elements used for modelling have to be saved
and used for monitoring, as well.

4. Confusion Matrix: A confusion matrix is a table that is often


used to describe the performance of a classification model (or
"classifier") on a set of test data for which the true values are known.
There are two possible predicted classes: "yes" and "no". If we were
predicting the presence of a disease, for example, "yes" would mean
they have the disease, and "no" would mean they do not have the
disease. The classifier made a total of 165 predictions (e.g., 165
patients were being tested for the presence of that disease). Out of
those 165 cases, the classifier predicted "yes" 110 times, and "no" 55
times. 105 patients in the sample have the disease, and 60 patients do
not. true positives (TP): These are cases in which we predicted yes
(they have the disease), and they do have the disease. True negatives
(TN) are predicted no, and they do not have the disease. False
positives (FP) are predicted yes, but they do not actually have the
disease. (Also known as a "Type I error."). False negatives (FN) are
predicted no, but they do have the disease. (Also known as a "Type
II error.")

5. K – Fold Cross Validation: Cross-validation is primarily


used in applied machine learning to estimate the skill of a machine
learning model on unseen data. That is, to use a limited sample to
estimate how the model is expected to perform in general when used
to make predictions on data not used during the training of the model.
It is a popular method because it is simple to understand and because
it generally results in a less biased or less optimistic estimate of the
model skill than other methods, such as a simple train/test split.
Cross-validation is a statistical method used to estimate the skill of
machine learning models. It is commonly used in applied machine
learning to compare and select a model for a given predictive
modelling problem because it is easy to understand, easy to
implement, and results in skill estimates that generally have a lower
bias than other methods. Cross-validation is a resampling procedure
used to evaluate machine learning models on a limited data sample.
The procedure has a single parameter called k that refers to the
number of groups that a given data sample is to be split into. As such,
the procedure is often called k-fold cross-validation. When a specific
value for k is chosen, it may be used in place of k in the reference to
the model, such as k=10 becoming 10-fold cross-validation.
• Design Methodology:

Importing the essential libraries

Importing the data set.

Data pre-processing - Splitting the dataset into the


Training set and Test set

Feature Selection - Kernel PCA

Training the Random Forest Classification model


on the Training set

Predicting the Test set results

Creating the Confusion Matrix

Measuring the accuracy using K fold cross


validation

Visualizing the test set and training set results


Conclusions and future works: Random forest and
RBF classification methods were directly applied to the divorce data set used in
the study. In addition, to find the most significant features, correlation-based
feature selection method was applied together with the classification methods.
The effective features and their values of significance obtained by applying the
correlation-based feature selection method. For the measurement of the accuracy
of the predictor model K fold cross validation algorithm was used. When the
overall results of the study are examined, it is seen that the Divorce Predictor
model can predict divorce rates by a 98.40% accuracy. According to the results
of this project, DPS can predict divorce. This may be beneficial for ministries that
have direct contact with families such as the Ministry of Family and Social
Policies, the Ministry of Education, and the Ministry of Health, to use DPS in
their screening activities. The counselling services staff working on family
counselling and family therapies can use this scale as a means of knowing the
individual. Scores obtained from the scale may contribute to the preparation of
case formulation and intervention plan.

References:

1. https://colab.research.google.com/drive/1sWIdmcTHzyQNbkH3wL1
QDu2agF3ae-Bc?usp=sharing
2. https://www.google.com/amp/s/www.dataschool.io/simple-guide-to-
confusion-matrix-terminology/amp/
3. https://machinelearningmastery.com/k-fold-cross-validation/
4. https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set

You might also like