pdf1
pdf1
Internship Report
CHAPTER 1
COMPANY PROFILE
As part of the course in Computer Science and Engineering degree, prescribed
byVisvesvarayaTechnologicalUniversity,aninternshipisundertaken.Thedetailsofthecompanyth
atprovided theinternship aregiven.
1.1 About Company
Board Infinity is a career-tech platform based in Mumbai, India, founded in 2017
byAbhay Gupta and Sumesh Nair. The company aims to enhance career development and
jobreadinessforstudentsandprofessionalsthroughpersonalizedlearningexperiencesandmentorsh
ip.BoardInfinityoffersawiderangeofcoursesinfieldssuchasdatascience,digitalmarketing,
software development, and business management. These programs are designed tobe flexible
and adaptive to individual learner needs, incorporating one-on-one mentoringsessionswith
industry experts,hands-on projects,andcareer coaching.
Theplatformhasraisedapproximately$3.2millioninfundingandconnectslearnerswithover2,000
industry experts to ensure a focused and practical learning approach. Board Infinityoperates
with the vision of bridging the gap between academic knowledgeand
industryrequirements,thereby improving employabilityand career growthfor its users.
Board Infinity also provides career transition support, helping learners shift from their
currentroles to new and more desirable positions within the industry. The company
emphasizesoutcomes, aiming to offer tangible improvements in job placements and career
advancementsforitsusers
1
Dept of C.S.E
“ Report on DataScience”
Internship Report
INSERT, UPDATE, and DELETE, which are used to retrieve, add, modify,
andremovedata,respectively.Thesecommandsarefundamentalforinteractingwithdatabasesanma
nagingdataefficiently.IalsoexploredadvancedSQLconceptslikejoins,subqueries,andfunctions,w
hichenableefficientdatamanipulationandretrievalfrommultipletables.Understanding joins
allowed me to combine data from different tables, while subqueries
andfunctionsprovidedadvanced waystohandle complexqueriesand datatransformations.
Inthedatasciencesegment,Iwasintroducedtotheentiredatasciencelifecycle,encompassingdata
collection, cleaning, analysis, and visualization. I learned about various types of
data,including structured, unstructured, and semi-structured data, and the appropriate methods
forhandling each type. The course emphasized the importance of data cleaning, as it is a
criticalstep to ensure accuracy and reliability in subsequent analyses. I gained proficiency in
datapreprocessingtechniquessuchashandlingmissingvalues,outlierdetection,anddatanormalizat
ion.
The course provided hands-on experience with key programming languages and tools
likePython,Pandas,NumPy,andMatplotlib.Pythonservedastheprimaryprogramminglanguage,a
nd I learned how to use it for data manipulation, analysis, and visualization. Pandas
wasparticularly useful for data wrangling, allowing me to work with large datasets
efficiently.NumPyprovidedcapabilitiesfornumericalcomputing,makingiteasiertoperformmathe
matical operations on arrays and matrices. Matplotlib, along with other visualizationlibraries,
enabled me to create insightful charts and graphs to communicate data findingseffectively.
1.1 Objectives
The primary objective of the data science course was to equip students with the
necessaryskills to analyze, interpret, and leverage data for informed decision-making. Key
learningoutcomesincluded.
MasteressentialSQLcommands:SELECT,INSERT,UPDATE,DELETE.
LearndatafilteringandsortingtechniquesusingWHERE,ORDERBY,GROUPBY.
ExploreadvancedSQLconceptslikejoins,subqueries,andfunctions.
Gainproficiencyindatacollection,cleaning,andpreprocessing.
Developskillsindataanalysisusingstatisticalmethods.
2
Dept of C.S.E
“ Report on DataScience”
Internship Report
Learn to visualize data effectively using Python libraries like Matplotlib.
3
Dept of C.S.E
“ Report on DataScience”
Internship Report
CHAPTER 2
TASK PERFORMED
The weekly tasks performed have been explained in detail below which gives the overview of several
concepts undertaken during the internship.
Data Science is a multidisciplinary field that combines elements of computer science, statistics,
and domain-specific knowledge to extract insights and knowledge from data. It involves using
various techniques, tools, and methods to collect, process, analyze, and interpret large amounts of
4
Dept of C.S.E
“ Report on DataScience”
Internship Report
data to gain a deeper understanding of the underlying patterns, trends, and correlations. Data
Science is an interdisciplinary field that draws on concepts and techniques from computer science,
statistics, mathematics, and domain-specific knowledge to extract insights and knowledge from
data.
The importance of Data Science cannot be overstated, as it has become a crucial aspect of various
industries. Its applications are numerous, including extracting insights from large datasets,
identifying patterns and trends, making predictions and recommendations, informing business
decisions, and driving innovation and growth. Data Science is applied in healthcare for
personalized medicine, disease diagnosis, and treatment; in finance for risk management, portfolio
optimization, and fraud detection; in marketing for customer segmentation, targeted advertising,
and campaign optimization; and in environmental science for climate modeling, predictive
analytics, and sustainability. By leveraging Data Science, organizations can gain a competitive
edge, improve decision-making, and drive business success.
Detailed description of various Data Science roles such as Data Scientist, Data Analyst,
Data Engineer, etc.
Introduction to essential tools and technologies used in Data Science (e.g., Python, R,
SQL).
Data Science encompasses various job roles, including Data Scientist, Data Analyst, Data
Engineer, and more. A Data Scientist extracts insights from data, develops predictive models, and
informs business decisions. A Data Analyst interprets data to identify trends and patterns, while a
Data Engineer designs and implements data pipelines. Other roles include Data Architect,
Business Analyst, and Machine Learning Engineer.
Essential tools and technologies in Data Science include Python, R, SQL, and more. Python is a
popular programming language used for data analysis, machine learning, and visualization. R is a
language and environment for statistical computing and graphics. SQL is a language for managing
and analyzing relational databases. Additionally, tools like Tableau, Power BI, and Excel are used
5
Dept of C.S.E
“ Report on DataScience”
Internship Report
for data visualization and analysis.
Discussion on the motivations and inspirations for pursuing a career in this field.
4. Traditional Approach vs. Data Science Approach
Pursuing a career in Data Science/AI/ML requires motivation and inspiration. Many are drawn to
the field by the opportunity to work with data, drive business decisions, and innovate. Others are
inspired by the potential to solve complex problems and make a meaningful impact.
Traditional approaches often rely on intuition and experience, whereas Data Science-driven
approaches rely on data-driven insights and statistical analysis. The traditional approach may lead
to biased decision-making, whereas Data Science approaches provide objective, data-backed
solutions. By embracing Data Science, organizations can unlock new opportunities, drive
innovation, and gain a competitive edge.
Data Science has numerous real-world applications, and one such example is predicting
customer churn for a telecom company. By analyzing customer data, such as usage patterns
and billing information, a Data Science model can identify high-risk customers and enable
targeted retention strategies. This case study demonstrates how Data Science can solve a
complex business problem and drive significant revenue savings.
MODULE - 3: Business Analytics with Microsoft Excel
2. Understanding Business Metrics
Business analytics with Excel enables users to analyze and visualize data, making it easier to
understand business performance. Excel provides various tools and functions, such as pivot tables,
charts, and formulas, to facilitate data analysis. By learning business analytics with Excel, users
can unlock insights, identify trends, and drive business growth. This module introduces the
fundamentals of business analytics using Excel, empowering users to make data-driven decisions.
This module delves deeper into business analytics concepts in Excel, covering basics, functions,
pivot tables, dashboard creation, and statistical analysis. Students will learn to leverage Excel's
capabilities to analyze and visualize data, creating informative dashboards and reports. Topics
include data manipulation, chart creation, and advanced functions like VLOOKUP and INDEX-
MATCH. Pivot tables will be explored in depth, enabling students to summarize and analyze large
datasets efficiently.
Advanced data visualization techniques will also be covered, allowing students to create
interactive and dynamic visualizations in Excel. This includes using tools like Power BI, Power
Pivot, and D3.js to create cutting-edge visualizations. Students will learn to effectively
communicate insights and trends to stakeholders, enhancing their ability to drive business
decisions with data-driven storytelling. By mastering advanced data visualization in Excel,
students will be able to unlock new insights, identify areas for improvement, and drive
7
Dept of C.S.E
“ Report on DataScience”
Internship Report
business growth.
Overview of MySQL and NoSQL databases and their relevance to Data Science.
SQL joins and techniques for combining data from multiple tables.
This module dives deeper into SQL and Python for data science, building on the foundational
knowledge gained earlier. Students will solidify their understanding of database concepts,
including data types, schema design, and normalization.
The module covers basic and advanced SQL queries, enabling students to extract insights from
databases effectively. Topics include filtering data using the WHERE and ORDER BY
clauses, as well as techniques for aggregating and grouping data. Students will learn to write
efficient SQL queries, including subqueries, window functions, and common table
expressions.
Additionally, the module explores SQL joins and techniques for combining data from multiple
tables. Students will learn to use INNER JOIN, LEFT JOIN, RIGHT JOIN, and FULL
OUTER JOIN to integrate data from different tables, enabling them to analyze complex
relationships and extract valuable insights. By mastering advanced SQL techniques, students
will be able to work with large datasets and extract meaningful information to inform data-
driven decisions.
9
Dept of C.S.E
“ Report on DataScience”
Internship Report
CHAPTER 3
PROJECT DETAILS
3.1 Project Overview
10
Dept of C.S.E
“ Report on DataScience”
Internship Report
3.2Types of Diabetes:
1) When an individual having type 1 diabetes, their immune system is not strong enough and the
white blood cells cannot to make enough insulin. There are no convincing studies that
demonstrate the
2) causes of type 1 diabetes, and there are also no effective preventative measures till now.
3) Type 2 diabetes is characterised by either insufficient insulin production by the cells or
improper insulin use by the body. 90% of people with diabetes have this kind of diabetes,
making it the most prevalent type. Both genetic and lifestyle factors contribute to its
occurrence.
4) Gestational diabetes manifests as in pregnant women who have high blood sugar levels
unexpectedly. It will return in two-thirds of patients during consecutive pregnancies. There is a
high likelihood that type 1 or type 2 diabetes will develop during a gestational diabetes
affected pregnancy. Diabetes is also caused by genetic conditions, It is caused by at least two
defective genes on chromosome 6, the chromosome that controls the body's response to
numerous antigens. The incidence of type 1 and type 2 diabetes may also be influenced by
viral infection. Infection with viruses such as rubella, mumps, hepatitis B virus, and
cytomegalovirus increase the risk of having diabetes. The goal of this study is to create a
system that, by fusing the findings of several machine learning approaches, can more
accurately conduct early diabetes prediction for a patient. To predict diabetes, we use a variety
of Machine Learning classification and ensemble techniques. Machine learning is a technique
used to intentionally train computers or other machines. By creating various categorization and
ensemble models from the obtained dataset, various machine learning techniques efficiently
capture knowledge. Many machine learning techniques are capable of making predictions, but
selecting the right method can be challenging. Therefore, we use well-known classification
and ensemble algorithms on the dataset for this aim to make predictions. To predict diabetes,
we use a variety of Machine Learning classification and ensemble techniques. Machine
learning is a technique used to intentionally train computers or other machines. By creating
various categorization and ensemble models from the obtained dataset, various machine
learning techniques
.
11
Dept of C.S.E
“ Report on DataScience”
Internship Report
12
Dept of C.S.E
“ Report on DataScience”
Internship Report
3.4 Objectives:
The purpose of this study is to evaluate the Diabetes dataset, develop, and implement a
Diabetes prediction and recommendation system built on machine learning classification
algorithms. Leukemia, anaemia, diabetes, haemophilia, blood cholesterol, cancer, HIV/AIDS,
and other blood problem illnesses exist. Diabetes Mellitus affects around 400 million
individuals worldwide. Hundreds of thousands of people are affected by this chronic illness.
These technologies are intended to detect their medical issues. 1.5.1 Goals:
● The goal is to raise awareness about the significance of diabetes as a worldwide public
health concern.
● To examine the literature on diabetes diagnosis and prediction Create a model using
machine learning techniques.
● Diabetes prevention and management are being promoted in underserved populations.
● Diagnosis of diabetes at an early stage using food intake.
● The importance of lifestyle in identifying individuals with diabetes and avoiding
complications, especially health and food. Serious actions must be undertaken to reduce
the impacts of diabetes at an initial stage, which also helps to reduce the number of
diabetic patients. Aside from that, if someone believes they have diabetes, they should
focus on preventing complications such as blindness, common illness that involves
dialysis, amputation, or perhaps death. Therefore, a balanced diet is necessary to prevent
the progression of diabetes. Accurate classification of diabetes is a fundamental step
towards diabetes prevention and control in healthcare. However, early and onset
identification of diabetes is much more beneficial in controlling diabetes. The diabetes
identification process seems tedious at an early stage because a patient has to visit a
physician regularly. The advancement in machine learning approaches has solved this
critical and essential problem in healthcare by predicting disease. Several techniques have
been proposed in the literature for diabetes prediction
13
Dept of C.S.E
“ Report on DataScience”
Internship Report
3.5 Motivation and Problem Defination The problem of diabetes prediction using
machine learning revolves around accurately identifying individuals at risk of developing
diabetes before clinical symptoms appear. Given the increasing prevalence of diabetes
worldwide, early detection is crucial for effective intervention and management. Machine
learning offers a sophisticated approach to analyze complex and multifaceted health data,
including genetic, metabolic, and lifestyle factors, which traditional methods may overlook.
The motivation behind employing machine learning in this context is driven by the potential to
enhance predictive accuracy, enable personalized healthcare strategies, and ultimately reduce
the global burden of diabetes. By leveraging advanced algorithms, healthcare providers can
identify high-risk individuals early, tailor preventative measures, and improve patient
outcomes, thereby addressing a significant public health challenge.
3.7 Motivation
1. **Early Detection**: Early identification of at-risk individuals can significantly reduce
the incidence of diabetes and its complications. Machine learning models can predict
diabetes before the onset of symptoms, allowing for timely intervention.
2. **Personalized Medicine**: Machine learning enables the development of personalized
risk profiles by considering a wide range of variables, including genetic information, blood
biomarkers, dietary habits, and physical activity levels. This personalized approach can
lead to more effective prevention and treatment strategies.
3. **Efficiency and Scalability**: Machine learning models can analyze large datasets
quickly and accurately, making them scalable solutions for healthcare systems. This
efficiency can help in managing the growing number of diabetes cases worldwide.
14
Dept of C.S.E
“ Report on DataScience”
Internship Report
4. **Cost-Effectiveness**: Early prediction and intervention can reduce healthcare costs
by preventing severe diabetes-related complications that require expensive treatments.
5. **Data Utilization**: The increasing availability of health data from electronic health
records, wearable devices, and genetic testing presents an opportunity to leverage machine
learning for comprehensive diabetes prediction. This data-rich environment enhances the
predictive power of machine learning models. Applications 1. Risk Assessment Tools:
Developing user-friendly tools for clinicians and patients that provide real-time risk
assessments based on machine learning predictions.
2. Preventive Programs: Designing targeted preventive programs for high-risk individuals
identified by machine learning models, focusing on lifestyle modifications and regular
monitoring.
3. Clinical Decision Support: Integrating machine learning models into clinical decision
support systems to aid healthcare providers in making informed decisions about diabetes
prevention and management.
4. Research and Development: Using machine learning to identify new biomarkers and
risk factors for diabetes, contributing to the ongoing research and development in the field.
3.8 Challenges
1. Data Quality and Integration: Ensuring the quality and consistency of data from multiple
sources is crucial for accurate predictions. Integrating heterogeneous data types (e.g.,
genetic, clinical, lifestyle) poses a significant challenge.
2. Model Interpretability: Developing models that are not only accurate but also
interpretable to healthcare providers is essential for gaining trust and facilitating clinical
adoption.
3. Privacy and Security: Protecting patient data privacy and ensuring secure data handling
are critical when dealing with sensitive health information.
4.Bias and Fairness: Addressing potential biases in machine learning models to ensure fair
and equitable predictions across diverse populations. By addressing these challenges and
leveraging the capabilities of machine learning, significant advancements can be made in
the early prediction and management of diabetes, ultimately imp
15
Dept of C.S.E
“ Report on DataScience”
Internship Report
3. Data Input: The pre-processed data is fed into the system for further processing.
4. Data Division: The data is split into two sets: Training data: Used to train the
forecasting model. Testing data: Used to evaluate the model’s performance on unseen data.
5. Forecasting Model: A forecasting model is built using the training data. This model
learns patterns and relationships within the data to predict future electricity consumption.
6. Hyperparameter Tuning: The model’s parameters are optimized to achieve the best
possible performance. This involves adjusting settings that control the model’s behavior.
7. Is Forecasting Accurate? The model’s predictions are compared to the actual values in
the testing data. If the accuracy is not satisfactory, the process may iterate back to
hyperparameter tuning or model selection.
8. Forecasted Output: Once the model’s accuracy is deemed acceptable, it generates
16
Dept of C.S.E
“ Report on DataScience”
Internship Report
4.Requirements
• Software Requirement:
To build a machine learning model for diabetes prediction using the specified software
requirements, here's a comprehensive list of tools and libraries you'll need along with their
respective purposes: Software Requirements
1. Python 3.7 or Higher: Core programming language: Python is essential for writing and
executing the code for data manipulation, model building, and interface creation. 2.
Streamlet: For creating web application interfaces: Streamlit allows for the creation of
interactive and user-friendly web applications to visualize data and model predictions.
3.NumPy: Numerical operations and array handling: NumPy is used for performing
efficient numerical computations, handling arrays, and performing mathematical
operations.
17
Dept of C.S.E
“ Report on DataScience”
Internship Report
CHAPTER 4
CONCLUSION AND FUTURE SCOPE
18
Dept of C.S.E
“ Report on DataScience”
Internship Report
REFERENCES
[1] Board Infinity Data Science Course – Refer to the specific course materials and
lecturesavailable on Board Infinity that provided foundational knowledge in SQL and Python
for datascience(https://www.boardinfinity.com/lms/free-data-science-course/overview).
[2] SQLDocumentation–EssentialforunderstandingSQL
fundamentalsandadvanceddatabasemanagement techniques(https://dev.mysql.com/doc/).
[3] Python Documentation – Key resource for learning Python programming, including
itsapplicationindatascienceandintegrationwithlibraries(https://docs.python.org/3/).
[4] PandasDocumentation–
CrucialfordatamanipulationandanalysiswiththePandaslibraryinPython
(https://pandas.pydata.org/).
[5] W3Schools–
ProvidestutorialsandreferencesforbasicconceptsinSQLandPython,usefulforsupplementary
learning (https://www.w3schools.com/).
19
Dept of C.S.E
“ Report on DataScience”
Internship Report
CERTIFICATION
20
Dept of C.S.E