INDUSTRIAL TRAINING
REPORT
Submitted in partial fulfilment of the
Requirements for the award of the degree
of
Bachelor of Technology
in
Artificial Intelligence & Data Science
By:
PRAKHAR SHARMA (50113211921/AIDS/21)
Department of Artificial Intelligence & Data Science
Guru Tegh Bahadur Institute of Technology
Guru Gobind Singh Indraprastha University
Dwarka, New Delhi
Year 2021-2025
MACHINE LEARNING INTERN
Duration
5th August 2023 – 5st September 2023
By:
Prakhar Sharma (50113211921/AIDS/2021)
At
Prodigy InfoTech
DECLARATION
I declare that the content presented in this Industrial Training Report, which contributes to the
fulfillment of the requirements for the Bachelor of Technology degree in Artificial
Intelligence & Data Science at Guru Tegh Bahadur Institute of Technology, affiliated to Guru
Gobind Singh Indraprastha University Delhi, represents an authentic account of my
independent work during the internship at Prodigy Infotech. This endeavor took place from
August 5, 2023, to September 5, 2023.
Date: 2 December 2023
Prakhar Sharma (50113211921/AIDS/2021)
ii
CERTIFICATE
PRAKHAR SHARMA
iii
ACKNOWLEDGEMENT
I would like to express our great gratitude towards Ms. Nandini who has given us
support and suggestions. Without their help we could not have presented this work upto
the present standard. We also take this opportunity to give thanks to all others who gave
us support for the project or in other aspects of our study at Guru Tegh Bahadur
Institute of Technology.
Date: December 2023
Prakhar Sharma (50113211921/AIDS/2021)
prakharsharma479@[Link]
iv
ABSTRACT
At Prodigy Infotech, I served as a Machine Learning Intern. The internship revolved around
two impactful projects: "House Prices - Advanced Regression Techniques" and "Mall
Customer Segmentation Data Market Basket Analysis." These projects not only enriched my
understanding of machine learning concepts but also provided valuable insights into their
practical applications in the industry.
In the first project, "House Prices - Advanced Regression Techniques," the objective was to
predict sales prices by employing advanced regression techniques. The project involved a
multifaceted approach, including feature engineering, Random Forests (RFs), and gradient
boosting. The application of these techniques allowed for a nuanced understanding of the
factors influencing house prices and provided a hands-on experience in optimizing predictive
models.
The second project, "Mall Customer Segmentation Data Market Basket Analysis," delved
into the realm of customer behavior analysis using market basket analysis techniques. The
goal was to extract meaningful insights into customer segmentation based on their purchasing
patterns. By employing advanced analytics, the project aimed to enhance marketing strategies
and optimize business operations for improved customer satisfaction.
Throughout the internship, I was exposed to the challenges and intricacies of real-world
machine learning applications. The hands-on experience not only strengthened my technical
skills but also honed my ability to collaborate effectively within a professional team
environment. The projects required a combination of domain knowledge, data preprocessing,
and model optimization, highlighting the interdisciplinary nature of machine learning in
solving practical business problems.
This internship at Prodigy Infotech provided a holistic understanding of machine learning's
application in addressing real-world challenges. The projects not only added significant value
to the organization but also equipped me with invaluable skills and experiences that will
undoubtedly shape my future endeavors in the field of data science and machine learning.
v
LIST OF FIGURES AND TABLES
Fig No Figure Name Page
1 PRODIGY INFOTECH LOGO 1
vi
CONTENTS
Chapter Page No.
Title Page i
Declaration and Certificate ii
Acknowledgement iii
Abstract iv
Tables and figures v
1. Introduction 1
1.1 About Prodigy Infotech 1
1.2 Services
2
2. Results and observations 4
3. Conclusions 6
4. References 9
5. Appendix 11
vii
INTRODUCTION
1.1 About Prodigy InfoTech
The company is dedicated to providing students with valuable work experience,
offering them the opportunity to gain practical skills and knowledge that can
significantly benefit their future careers. The internships are meticulously designed to
simulate real-world work experiences, providing students with the chance to engage in
projects and assignments that are directly relevant to their chosen fields of study.
The mission of the organization is to create innovative and accessible learning solutions
that empower individuals of all ages and backgrounds to realize their full potential.
Whether individuals are students seeking academic improvement, professionals aiming
to upskill, or organizations looking to enhance employee training, the company
provides the necessary tools and resources for success.
Prospective participants are invited to embark on an innovative and dynamic learning
experience that will aid them in achieving their goals and unlocking their full potential.
The company envisions a transformative journey where, together, they can
revolutionize the way people learn, ultimately contributing to the creation of a better
future for all.
1
1.2 Services
Exploring Opportunities: A Comprehensive Overview of Internship Offerings at
Prodigy Infotech
Prodigy Infotech takes pride in offering a diverse range of internship opportunities across
various fields, providing students with hands-on experience and insights into their chosen
industries.
These internships are meticulously crafted to simulate real-world scenarios, allowing interns
to actively contribute to projects, collaborate with experienced teams, and gain practical
knowledge that goes beyond the classroom setting.
Machine Learning Internship:
Overview:
The Machine Learning Internship at Our Company is a comprehensive program that covers
various facets of machine learning. Interns are immersed in a structured curriculum that
includes Data Analysis, Supervised Learning, Unsupervised Learning, and Deep Learning.
This hands-on experience equips interns with the skills needed to navigate the intricacies of
machine learning and apply them to real-world challenges.
For:
Enthusiastic individuals seeking to delve into the realm of machine learning are encouraged
to apply. This internship promises a dynamic learning environment where participants can
actively engage with cutting-edge technologies and contribute to innovative projects.
Web Development Internship:
Overview:
The Web Development Internship at Our Company caters to those aspiring to become
proficient in web development. The program covers HTML5 & CSS3, Javascript, Responsive
Website Design, and Web Applications. Interns gain practical experience in developing
websites and applications, honing their skills in both front-end and back-end development.
2
For:
Individuals passionate about creating visually appealing and functional web solutions are
invited to apply. This internship offers a popular and sought-after learning experience in the
ever-evolving field of web development.
Data Science Internship:
Overview:
The Data Science Internship at Our Company provides a comprehensive exploration of the
field. Interns engage in Exploratory Data Analysis (EDA), Data Pre-processing, Data
Visualization, and gain exposure to Business
Intelligence (BI) Tools. This internship equips participants with the skills required for
meaningful data analysis and interpretation.
For:
Aspiring data scientists looking to enhance their analytical capabilities are encouraged to
apply. This internship offers a unique opportunity to work with real-world datasets and
develop a profound understanding of the data science workflow.
Android Development Internship:
Overview:
The Android Development Internship at Our Company is designed for those interested in
mobile app development. Interns delve into Kotlin programming, creating Simple Apps,
Advanced Apps, and Cloud Apps. This program provides practical experience in building
scalable and innovative Android applications.
For:
Individuals passionate about mobile app development are invited to apply. This internship
promises exposure to the latest trends in Android development and an opportunity to
contribute to the creation of diverse and functional mobile applications.
Internships provide a platform for aspiring professionals to bridge the gap between
theoretical knowledge and practical application.
3
RESULT
Project 1: House Prices - Advanced Regression Techniques
Predictive Model Development:
Implemented advanced regression techniques, including Linear Regression, Random Forests
(RFs), and Gradient Boosting.
Utilized feature engineering to enhance the predictive power of the model.
Experimented with various algorithms to identify the most effective approach for predicting
house prices.
Model Evaluation and Optimization:
Conducted thorough model evaluations using metrics such as Mean Squared Error (MSE)
and R-squared.
Employed cross-validation techniques to ensure the robustness of the models.
Optimized hyper parameters to enhance the overall performance of the predictive models.
Insights and Interpretations:
Extracted valuable insights into the factors influencing house prices.
Identified key features that significantly impact the accuracy of predictions.
Provided actionable recommendations based on the model's findings for potential
improvements in the real estate domain.
Results:
After thorough experimentation and tuning, the best performing model was the Linear
regression with an RMSE of 50045.870 on the test data. Detailed results and insights can be
found in the notebooks. [Link]
4
Project 2: Mall Customer Segmentation Data
Market Basket Analysis:
Conducted thorough exploratory data analysis (EDA) to understand patterns in customer
behavior.
Implemented market basket analysis techniques to identify associations between products and
customer segments.
Uncovered meaningful relationships and purchasing patterns that informed strategic decision-
making.
Customer Segmentation:
Utilized clustering algorithms to segment customers based on their buying behavior.
Developed comprehensive customer profiles, allowing for targeted marketing strategies.
Provided insights into the preferences and characteristics of different customer segments.
Business Impact and Recommendations:
Discussed the practical implications of the analysis for marketing and sales strategies.
Offered actionable recommendations for optimizing product placements, promotions, and
customer engagement.
Demonstrated how market basket analysis can be a valuable tool for enhancing the overall sh
opping experience and increasing revenue.
Results:
After thorough experimentation and tuning, the best performing model was the KNN with an
Accuracy of 0.925 on the test data. Detailed results and insights can be found in the
notebooks. [Link]
5
SUMMARY & CONCLUSIONS
Mall Customer Segmentation Data Analysis
Mall Customer Segmentation
This project focuses on understanding and segmenting customers based on their shopping
behavior using a dataset from Kaggle.
Customer segmentation is a crucial strategy in marketing and business. This project aims to
analyze customer data from a mall and group customers into distinct segments based on their
purchasing patterns and demographics. By understanding these segments, businesses can
tailor their marketing strategies to specific customer groups.
Dataset
The dataset used in this project is the Mall Customer Segmentation Data from Kaggle. It
contains information about customers, including their age, gender, annual income, and
spending score.
1. Explore the distribution of variables such as age, annual income, and spending score.
[Link] relationships between variables using scatter plots, histograms, and correlation
analysis.
[Link] potential outliers and understand their impact on the analysis.
Customer Segmentation
Utilize unsupervised learning techniques like K-means clustering to segment customers.
Determine the optimal number of clusters using techniques such as the elbow method.
Visualize customer segments using scatter plots or other relevant visualizations.
6
Insights
The project aims to uncover insights such as:
1. Characteristics of different customer segments (e.g., high-income, low-spending
customers).
2. How age and gender correlate with spending behavior.
3. Strategies that can be employed to target different customer segments effectively.
Housing Prices Prediction using Random forest
Housing Prices Prediction
The project is based on a dataset from Kaggle, specifically the Housing Prices Competition
dataset.
Predicting housing prices is a fundamental problem in the field of machine learning and data
science. This project aims to explore various regression techniques to predict housing prices
using features provided in the dataset. By experimenting with different algorithms and
techniques, we aim to find the most accurate model for this specific problem.
Dataset
The dataset used in this project is the Housing Prices Competition dataset from Kaggle. It
contains a comprehensive set of features related to residential properties. The dataset includes
both training and testing data, with corresponding target values.
Explore the Jupyter notebooks in the notebooks/ directory to see the step-by-step process of
data preprocessing, feature engineering, model training, and evaluation.
7
Techniques
The following techniques are implemented in this project:
1. Random Forest
2. KNNimputer
3. Heatmap (correlation graph)
8
REFERENCES
House Prices - Advanced Regression Techniques:
Kaggle. (n.d.). House Prices: Advanced Regression Techniques. Retrieved from:
[Link]
Brownlee, J. (2016). Feature Engineering for Machine Learning: A Comprehensive
Overview. Retrieved from: [Link]
engineering-how-to-engineer-features-and-how-to-get-good-at-it/
Chen, T., & Guestrin, C. (2016). XGBoost: A Scalable Tree Boosting System. In
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining.
Friedman, J. H. (2001). Greedy Function Approximation: A Gradient Boosting Machine.
The Annals of Statistics, 29(5), 1189-1232.
Liaw, A., & Wiener, M. (2002). Classification and Regression by randomForest. R News,
2(3), 18-22.
Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of
Machine Learning Research, 12, 2825-2830.
James, G., Witten, D., Hastie, T., & Tibshirani, R. (2013). An Introduction to Statistical
Learning. Springer.
Mall Customer Segmentation Data - Market Basket Analysis:
Kaggle. (n.d.). Mall Customer Segmentation Data. Retrieved from:
[Link]
Agrawal, R., Imielinski, T., & Swami, A. (1993). Mining Association Rules Between
Sets of Items in Large Databases. In Proceedings of the 1993 ACM SIGMOD
International Conference on Management of Data.
9
Tan, P. N., Steinbach, M., & Kumar, V. (2005). Introduction to Data Mining. Pearson.
Ransbotham, S., & Kiron, D. (2017). Analytics as a Source of Business Innovation. MIT
Sloan Management Review, 58(1), 1-14.
Berry, M. J. A., & Linoff, G. (2004). Data Mining Techniques: For Marketing, Sales, and
Customer Relationship Management. John Wiley & Sons.
Fournier-Viger, P., et al. (2016). A Survey of Sequential Pattern Mining. Data Science
and Pattern Recognition, 3(2), 54-77.
Jain, P., et al. (2017). A Comprehensive Review on Apriori Algorithm and its
Improvements. International Journal of Computer Applications, 173(8), 43-47.
Turban, E., et al. (2015). Data Mining for Business Intelligence: Concepts, Techniques,
and Applications in Microsoft Office Excel with XLMiner. Wiley.
General Machine Learning and Data Science References:
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
VanderPlas, J. (2016). Python Data Science Handbook. O'Reilly Media.
McKinney, W. (2018). Python for Data Analysis. O'Reilly Media.
Müller, A. C., & Guido, S. (2016). Introduction to Machine Learning with Python: A
Guide for Data Scientists. O'Reilly Media.
10
APPENDIX A
(Screenshots Results)
House Prices - Advanced Regression Techniques:
11
-
12
13
14
15
-
16
17
-
18
19
Mall Customer Segmentation Data - Market Basket Analysis:
20
21
22
23
-
24
-
25
26
APPENDIX B
(Source Code)
House Prices - Advanced Regression Techniques:
#importing all the necessary libraries
import numpy as np
import pandas as pd
import [Link] as plt
import seaborn as sns
# %% [markdown]
# **Import the training data**
df = pd.read_csv(r"/kaggle/input/house-prices-advanced-regression-techniques/[Link]")
[Link]()
#lets check the information and columns about this data
[Link]()
df2=pd.read_csv(r"/kaggle/input/house-prices-advanced-regression-
techniques/sample_submission.csv")
df2
# as we can observe :
#72 PoolQC 7 non-null
# 73 Fence 281 non-null
#74 MiscFeature 54 non-null
#these features has almost 90 percent of null values so, we can remove them.
#* I have put comments cuz they are already been deleted and cant run again.*
#df = [Link]('PoolQC',axis=1)
#df = [Link]('Fence',axis=1)
#df = [Link]('MiscFeature',axis=1)
#df = [Link]('Alley',axis=1)
[Link]()
# %% [markdown]
# ## Feature Selection
27
#we will see correlation between output columns and all other coluns to see their significance.
df[[Link][1:]].corr()['SalePrice'][:]
print("The importantt features are :\n" )
dfc=df[['Id','OverallQual','GrLivArea','GarageCars','GarageArea','TotalBsmtSF','1stFlrSF','Full
Bath','TotRmsAbvGrd','YearBuilt','LotArea','SalePrice']]
dfc
[Link](figsize = (16,5))
[Link]([Link](),annot=True)
[Link](df['SalePrice'],bins=100)
print("Right Skewed Data: More houses with price between 1 million and 3 million ")
# %% [markdown]
# ## Outliers in Data
#using box plot
[Link](figsize=(16,5))
[Link](x='OverallQual',y='SalePrice',data=dfc)
# %% [markdown]
# ## Imputation using sklearn (handling missing values)
from [Link] import KNNImputer
imp = KNNImputer()
imp.fit_transform(dfc)
print([Link]().sum())
print("\n\n\nNo missing values")
# %% [markdown]
# # Random Forest Regressor
##dividing the dataset into independent and dependent var
X=[Link][:,:-1]
y=[Link][:,-1] #TARGET
X
y
#splitting the dataset
from sklearn.model_selection import train_test_split
from [Link] import RandomForestRegressor
28
rf_reg=RandomForestRegressor()
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
#feature scaling
from [Link] import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = [Link](X_test)
rf_reg.fit(X_train, y_train)
# predicting the test set results
y_pred = rf_reg.predict(X_test)
# Plotting Scatter graph to show the prediction
[Link](y_test, y_pred, cmap = [Link])
[Link]("Price: in $1000's")
[Link]("Predicted value")
[Link]("True value vs predicted value : Linear Regression")
[Link]()
import math
from [Link] import mean_squared_error, mean_absolute_error
print("New RMSE: ", [Link](mean_squared_error(y_pred, y_test)))
y_pred.size
# %% [markdown]
# # TEST Data
#cleaning test data
df1 = pd.read_csv(r"/kaggle/input/house-prices-advanced-regression-techniques/[Link]")
df1
[Link]()
df1 =
df1[['Id','OverallQual','GrLivArea','GarageCars','GarageArea','TotalBsmtSF','1stFlrSF','FullBat
h','TotRmsAbvGrd'
,'YearBuilt','LotArea']]
df1
29
from [Link] import KNNImputer
imp = KNNImputer()
imp.fit_transform(df1)
df1[df1['GarageCars'].isnull()]
df1[['GarageCars','GarageArea','TotalBsmtSF']] =
df1[['GarageCars','GarageArea','TotalBsmtSF']].fillna(0)
[Link]().sum()
y_pred1 = rf_reg.predict(df1)
# SUBMISSIONS
submission = [Link]({
"Id": range(1461, 2920),
"SalePrice": y_pred1
})
submission.to_csv("/kaggle/working/[Link]", index = False)
30
Mall Customer Segmentation Data - Market Basket Analysis:
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
for dirname, _, filenames in [Link]('/kaggle/input'):
for filename in filenames:
print([Link](dirname, filename))
d=pd.read_csv(r'/kaggle/input/customer-segmentation-tutorial-in-python/Mall_Customers.csv')
df
[Link]()
print("\n\n\n NO missing values")
df['Gender'] = pd.get_dummies(df['Gender'],drop_first=True)
#one hot encodig successfull as we can see in the dtype
[Link]()
import [Link] as plt
import seaborn as sns
#lets see the plots between different columns of the dataset
[Link](df)
[Link]([Link](),annot=True)
31
#outliers 0 = Female , 1= Male
[Link](x='Gender',y='Age',data=df);
[Link](data=df,x='Age',y='Spending Score (1-100)',hue='Gender')
[Link]("Blue is female and orange is Male")
[Link]()
Gen =[Link]('Gender')
print("\t\t\t0 is female and 1 is male")
[Link]()
#max and min
print([Link]())
print('\n\n')
print([Link]())
# %% [markdown]
# # KNN Algorithm
X= [Link][:, [3,4]].values
from [Link] import KMeans
wcss = []
for i in range(1,11):
km = KMeans(n_clusters=i)
km.fit_predict(X)
[Link](km.inertia_)
32
km = KMeans(n_clusters=5)
y_means = km.fit_predict(X)
[Link](X[y_means == 0,0],X[y_means == 0,1],color='blue')
[Link](X[y_means == 1,0],X[y_means == 1,1],color='red')
[Link](X[y_means == 2,0],X[y_means == 2,1],color='green')
[Link](X[y_means == 3,0],X[y_means == 3,1],color='yellow')
[Link](X[y_means == 4,0],X[y_means == 4,1],color='black')
[Link]('Clusters of customers')
[Link]('Annual Income (k$)')
[Link]('Spending Score (1-100)')
[Link]()
model=KMeans(n_clusters=5)
[Link](df)
pre=[Link](df)
df["Target"]=y_means
df=df
[Link]()
X=[Link][:,1:5]
y=[Link][:,-1]
[Link]()
[Link]()
#splitting the dataset
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=0)
33
#Standardize the varriables
from [Link] import StandardScaler
Sc = StandardScaler()
sc=StandardScaler()
X_train=sc.fit_transform(X_train)
X_test=[Link](X_test)
from [Link] import KNeighborsClassifier
error_rate=[]
for i in range(1,40):
knn = KNeighborsClassifier(n_neighbors=i)
[Link](X_train,y_train)
pred_i = [Link](X_test)
error_rate.append([Link](pred_i!=y_test))
[Link](figsize=(10,5))
[Link](range(1,40),error_rate,color='blue',linestyle='dashed',marker='o',markersize=12)
[Link]("Error rate vs k value")
[Link]("k")
[Link]("Error_rate")
[Link]()
knn =KNeighborsClassifier(n_neighbors=5)
[Link](X_train,y_train)
pred_5=[Link](X_test)
from [Link] import accuracy_score
accuracy = accuracy_score(y_test, pred_5)
accuracy
34