Model Building for Data Analytics
Last Updated :
29 May, 2023
Prerequisite - Life Cycle Phases of Data Analytics
After formulating the problem and preprocessing the data accordingly. We select the type of model we should build for our model. Like if our problem requires our result to have higher explainability then we use models like Linear regression or decision tree but if our model requires to have higher accuracy then we build models like XGBOOST or Deep Neural Network.
Model Building In Data Analytics
Model building is an essential part of data analytics and is used to extract insights and knowledge from the data to make business decisions and strategies. In this phase of the project data science team needs to develop data sets for training, testing, and production purposes. These data sets enable data scientists to develop an analytical method and train it while holding aside some of the data for testing the model. Model building in data analytics is aimed at achieving not only high accuracy on the training data but also the ability to generalize and perform well on new, unseen data. Therefore, the focus is on creating a model that can capture the underlying patterns and relationships in the data, rather than simply memorizing the training data.
To do this we divide our dataset into two parts
- Training dataset
- Test dataset
Note: Based on the dataset quality and quantity of the data one may choose to divide his dataset into three parts training and testing and validation data.
Dividing The Dataset For Model Building
To divide the dataset we will use the Python sklearn library which helps us in dividing the dataset into training and testing datasets. Here we will choose the ratio by which we want to divide the dataset by default it 3:1 for training and testing.
Python code for creating and dividing the dataset
We will first create a random array of dimensions having 2 columns and 100 rows and convert it into a dataframe using pandas. After that, we will use the sklearn package to divide the dataframe into test and train datasets and also we will separate our dataset into dependent and independent variables.
Python3
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
data = np.random.randint(low=10, high=100,
size=2000).reshape(1000, 2)
data = pd.DataFrame(data, columns=('x', 'y'))
X = data[['x', 'y']]
y = np.random.rand(1000)
train_data_x, test_data_x, train_data_y, test_data_y = \
train_test_split(data, y, test_size=0.25)
Scaling The Dataset
Sacling the dataset is an important preprocessing step before feeding the to the outliers. there are several benefits of scaling the data theses are as:
- It prevents features with different scales from dominating the model like example suppose column A has data ranging from 1 to 1000 and column B has data ranging from 0 to 1 in that case column A can influence our model decision even if it is not an important feature. But after scaling all our columns comes in the similar range
- It speeds up our model convergence. Many optimization algorithms such as gradient descent are very sensitive to the scale of the data. By scaling data between 0 to 1 these algorithm converges faster.
Effect of scaling on Gradient Descent - Scaling the dataset makes our model more robust to the outliers.
- Some algorithms like K-nearest neighbors (KNN), use the distance between data points to make predictions in this case if the columns have different scales then the distance can go higher.
Python code for scaling the columns
We will use StandardScaler object from the sklearn library to scale our independent features of the dataset.
Python3
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
train_data_x_scaled = scaler.fit_transform(train_data_x.to_numpy())
test_data_x = scaler.transform(test_data_x.to_numpy())
Modeling The Data
After scaling and splitting the data it has now become ready for fitting to the model. The choice of choosing model totally depends on our problem formulation. There are a variety of models present that we can choose from. However, before choosing the model first, we should identify these points in the data
- Whether our problem is a regression problem or a classification problem
- Whether we want a model which is more explainable or we want a model which has a higher accuracy
Python code for modeling the data
Since our target value is continuous so here we will consider it as a regression problem. For making it simple and explainable we will use the decision tree model.
Python3
from sklearn.tree import DecisionTreeRegressor,plot_tree
reg = DecisionTreeRegressor(min_samples_split=4,
max_leaf_nodes=10)
reg.fit(train_data_x_scaled,train_data_y)
y_pred = reg.predict(test_data_x)
After making the model we evaluate the model on the evaluation matrix. In our case, we will mean square error for computing the accuracy of our model.
Python code for evaluation
Python3
from sklearn.metrics import mean_squared_error
print(mean_squared_error(y_pred,test_data_y))
Output:
0.109
Since our dataset was randomly generated mean square error of 661 is not bad. One good thing about the decision tree is that we can also see the decision that was made to model the data.
Plotting The Decision Graph
We can use the plot_tree function from the sklearn library to visualize on what basis the decision is made.
Python3
fig, axes = plt.subplots(nrows=1, ncols=1, figsize=(4,4), dpi=800)
plot_tree(reg, filled=True, ax=axes, fontsize=2)
plt.show()
Output:
Decision tree diagram for The model
Similar Reads
Top 10 Trends on Big Data Analytics
The market of Big data Analytics is expected to rise shortly as big data analytics is important because it helps companies leverage their data and also identify opportunities for better performance. Big data analytics is high in demand because it provides better customer service, and improves operat
8 min read
Top 15 Automation Tools for Data Analytics
The exponential growth in data in recent times has made it imperative for organizations to leverage automation in their data analytics workflows. Data analytics helps uncover valuable insights from data that can drive critical business decisions. However, making sense of vast volumes of complex data
13 min read
SQL for Data Analysis
SQL (Structured Query Language) is an indispensable tool for data analysts, providing a powerful way to query and manipulate data stored in relational databases. With its ability to handle large datasets and perform complex operations, SQL has become a fundamental skill for anyone involved in data a
7 min read
Role of AI in Data Analytics
Artificial Intelligence (AI) has revolutionized the field of data analytics, providing powerful tools and techniques to extract valuable insights from vast amounts of data. By leveraging AI, organizations can enhance their decision-making processes, optimize operations, and gain a competitive edge i
4 min read
Role of Data Science in Big Data Analytics
In today's data-driven world, the role of data science in big data analytics is becoming increasingly vital. With the vast amounts of data being generated every day, organizations are turning to data science to make sense of it all and extract valuable insights. Data science involves collecting, ana
8 min read
Data Modeling: A Comprehensive Guide for Analysts
Data modelling is a fundamental component that facilitates the organisation, structuring, and interpretation of complicated datasets by analysts. In this tutorial we'll dive into the field of data modelling, examining its importance, the procedures involved, and answering common queries. Table of Co
8 min read
Data Analysis Challenges in the Future
In the contemporary business landscape, Data Analysis is a crucial asset for businesses across various industries, enabling them to extract valuable insights from the data for informed decision-making. However, that path to successful data analytics is filled with challenges. This article will explo
5 min read
What is Data Analytics?
Data analytics, also known as data analysis, is a crucial component of modern business operations. It involves examining datasets to uncover useful information that can be used to make informed decisions. This process is used across industries to optimize performance, improve decision-making, and ga
9 min read
Data Analysis (Analytics) Tutorial
Data Analytics is a process of examining, cleaning, transforming and interpreting data to discover useful information, draw conclusions and support decision-making. It helps businesses and organizations understand their data better, identify patterns, solve problems and improve overall performance.
4 min read
What is KNIME Analytics Platform?
The KNIME Analytics Platform is an open-source software used for data analytics, reporting and integration. It provides a robust environment for data scientists, analysts and engineers to manipulate and analyze data, build machine learning models and visualize results. The KNIME's powerful workflow-
5 min read