0% found this document useful (0 votes)
2 views

Data Science_Chapter 3

The document discusses the applications and importance of Data Science in various fields, including finance, genetics, internet search, and advertising. It outlines the process of leveraging Data Science for problem-solving, particularly in predicting food waste in restaurants, and emphasizes the significance of data collection, exploration, and modeling. Additionally, it highlights the use of Python and its packages for data analysis and AI development.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Data Science_Chapter 3

The document discusses the applications and importance of Data Science in various fields, including finance, genetics, internet search, and advertising. It outlines the process of leveraging Data Science for problem-solving, particularly in predicting food waste in restaurants, and emphasizes the significance of data collection, exploration, and modeling. Additionally, it highlights the use of Python and its packages for data analysis and AI development.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

DATA SCIENCE

Domains of AI
Rock, Paper & Scissors Game
https://www.afiniti.com/corporate/rock-paper-scissors

Try to play the game of Rock, Paper Scissors against an Al model.


The challenge here is to win 20 games against Al.

• Did you manage to win?


• What was the strategy that you applied to win this game against the Al machine?
• Was it different playing Rock, Paper & Scissors with an Al machine as compared to a
human?
• What approach was the machine following while playing against you?
Applications of Data Sciences
Data Science is not a new field. Data Sciences majorly work around analysing
the data and when it comes to Al, the analysis helps in making the machine
intelligent enough to perform tasks by itself. There exist various applications of
Data Science in today's world.
Fraud and Risk Detection*: The earliest applications of data science
were in Finance. Companies were fed up of bad debts and losses every
year. However, they had a lot of data which use to get collected during the
initial paperwork while sanctioning loans. They decided to bring in data
scientists in order to rescue them from losses.
Over the years, banking companies learned to divide and conquer data via
customer profiling, past expenditures, and other essential variables to
analyse the probabilities of risk and default. Moreover, it also helped
them to push their banking products based on customer's
purchasing power.

Genetics & Genomics*: Data Science applications also enable an


advanced level of treatment personalization through research in
genetics and genomics. The goal is to understand the impact of the
DNA on our health and find individual biological connections
between genetics, diseases, and drug response. Thus the use of DS
has done wonders in predicting genetic risks in advance and
provided better individual care..
Internet Search*: When we talk about search engines, we think
`Google'. Right? But there are many other search engines like
Yahoo, Bing, Ask, AOL, and so on. All these search
engines make use of data science algorithms to deliver the best
result for our searched query in the fraction of a second.
Google processes more than 20 petabytes of data every day.

Targeted Advertising*: The entire digital marketing spectrum


starting from the display banners on various websites to the
digital billboards at the airports — almost all of them are
decided by using data science algorithms. This is the reason why
digital ads have been able to get a much higher CTR (Call-
Through Rate) than traditional advertisements. They can be
targeted based on a user's past behaviour.
• Website Recommendations:* They not only help us find relevant products
from billions of products available with them but also add a lot to the user
experience. A lot of companies have used this engine to promote their products
in accordance with the user's interest and relevance of information. Internet
giants like Amazon, Twitter, Google Play, Netflix, Linkedln, IMDB and many
more use this system to improve the user experience. The recommendations are
made based on previous search results for a user.

Airline Route Planning*: The Airline Industry across the world is known to bear heavy
losses. Except for a few airline service providers, companies are struggling to maintain
their occupancy ratio and operating profits. With high rise in air-fuel prices and the need to
offer heavy discounts to customers, the situation has got worse. It wasn't long before
airline companies started using Data Science to identify the strategic areas airline
companies can:
•Predict flight delay
•Decide which class of airplanes to buy
•Whether to directly land at the destination or take a halt in between(For ex, A flight can
have a direct route from New Delhi to New York. Alternatively, it can also choose to halt
in any country.)
• Effectively drive customer loyalty programs
• Image recognition-When you upload a picture, an automatic tag recognition system
used by applications like Facebook, suggests you people to tag. It uses a face recognition
algorithm for this.
• Speech Recognition: Siri, Google voice, Cortana etc use speech recognition algorithm in
Data Science to simplify our lives by converting our speech out messages into text.
• Medicine: DS is used in Medical Image Analysis such as detecting tumors, artery
stenosis etc to study different parameters.
• Gaming: Some games are designed using machine learning algorithms that help the level
in games to either improve or upgrade themselves based on the player’s previous moves
and accordingly mold up its game to next level.
• Virtual Reality: VR headset is using computing knowledge , algorithms and data to
provide the best viewing experience.
Getting Started
Data Sciences is a combination of Python and Mathematical concepts like Statistics, Data Analysis,
probability, etc. Concepts of Data Science can be used in developing applications around Al as it
gives a strong base for data analysis in Python.

Revisiting Al Project Cycle


How Data Sciences can be leveraged to a
problem around us.
The Scenario:
Humans are social animals. We tend to organise and/or participate in various kinds of social
gatherings all the time. We love eating out with friends and family because of which we can find
restaurants almost everywhere and out of these, many of the restaurants arrange for buffets to
offer a variety of food items to their customers. Be it small shops or big outlets, every restaurant
prepares food in bulk as they expect a good crowd to come and enjoy their food. But in most
cases, after the day ends, a lot of food is left which becomes unusable for the restaurant as they
do not wish to serve stale food to their customers the next day. So, every day, they prepare food
in large quantities keeping in mind the probable number of customers walking into their outlet.
But if the expectations are not met, a good amount of food gets wasted which eventually
becomes a loss for the restaurant as they either have to dump it or give it to hungry people for
free. And if this daily loss is taken into account for a year, it becomes quite a big amount.
Problem Scoping
Take a deeper look into the problem to find out more about various factors around it.
Problem Statement Template
Data Acquisition
After finalising the goal of our project, let us now move towards looking at various
data features which affect the problem in some way or the other. Since any Al-based
project requires data for testing and training, we need to understand what kind of data
is to be collected to work towards the goal. In our scenario, various factors that would
affect the quantity of food to be prepared for the next day consumption in buffets
would be:
Now let us understand how these factors are related to our problem statement. For this, we can
use the System Maps tool to figure out the relationship of elements with the project's goal.
Here is the System map for our problem statement.
Data Exploration
After creating the database, we now need to look at the data collected and understand
what is required out of it. In this case, since the goal of our project is to be able to predict
the quantity of food to be prepared for the next day, we need to have the following data:

Thus, we extract the required information from the curated dataset and clean it up in such
a way that there exist no errors or missing elements in it.
Modelling
Once the dataset is ready, we train our model on it. In this case, a regression model is
chosen in which the dataset is fed as a dataframe and is trained accordingly.
Regression is a Supervised Learning model which takes in continuous values of data
over a period of time. Since in our case the data which we have is a continuous data
of 30 days, we can use the regression model so that it predicts the next values to it in
a similar manner. In this case, the dataset of 30 days is divided in a ratio of 2:1 for
training and testing respectively. In this case, the model is first trained on the 20-day
data and then gets evaluated for the rest of the 10 days.
Evaluation
Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is
working properly or not. Let us see how the model works and how is it tested.
Step 1: Feed data to the trained model. Eg: the name of the dish and the quantity produced for the same.
Step 2: Feed the data of quantity of unconsumed food of the same dish on previous days.
Step 3: The model then works upon the entries according to the training it got at the modelling stage.
Step 4: The Model predicts the quantity of food to be prepared for the next day.
Step 5: The prediction is compared to the testing dataset value. From the testing dataset, ideally, we can say
that the quantity of food to be produced for next day's consumption should be the total quantity minus the
unconsumed quantity.
Step 6: The model is tested with different datasets at least 10 times during training.
Step 7: Prediction values and actual values are compared to check the efficiency of the model.
Step 8: If the prediction value is same or almost similar to the actual values, the model is said to be accurate.
If not, either the model selection is changed or the model is trained on more data for better accuracy.
Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for real-
time usage.
Data Collection
• Data collection is nothing new which has come up in our lives. It has been in our society
since ages. Even when people did not have fair knowledge of calculations, records were
still maintained in some way or the other to keep an account of relevant things. Data
collection is an exercise which does not require even a tiny bit of technological
knowledge. But when it comes to analysing the data, it becomes a tedious process for
humans as it is all about numbers and alpha-numerical data. That is where Data Science
comes into the picture. It not only gives us a clearer idea around the dataset, but also adds
value to it by providing deeper and clearer analyses around it. And as Al gets incorporated
in the process, predictions and suggestions by the machine become possible on the same.
• Now that we have gone through an example of a Data Science based project, we have a
bit of clarity regarding the type of data that can be used to develop a Data Science related
project. For the data domain-based projects, majorly the type of data used is in numerical
or alpha-numerical format and such datasets are curated in the form of tables. Such
databases are very commonly found in any institution for record maintenance and other
purposes.
Examples of datasets

All the type of data which has been mentioned above is in the form of tables. Tables
which contain numeric or alpha-numeric data.
Sources of Data
There exist various sources of data from where we can collect any type of data required and the
data collection process can be categorised in two ways: Offline and Online.

While accessing data from any of the data sources, following points should be kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone's privacy to collect data.
4. Data should only be taken form reliable sources as the data collected from random sources
can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of the
Al model.
Types of Data
For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored
in different formats. Some of the commonly used formats are:
1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each
line of this file is a data record and each record consists of one or more fields which are separated by
commas. Since the values of records are separated by a comma, hence they are known as CSV files.
2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting
and recording data using rows and columns into which information can be entered. Microsoft excel is a
program which helps in creating spreadsheets.
3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain-
specific language used in programming and is designed for managing data held in different Kinds of
DBMS (Database Management System) It is particularly useful in handling structured data.
A lot of other formats of databases are also there.
Issues related to data
• Erroneous Data- Data set is not received as per the expectations in that position. Two
ways in which data can be erroneous are
➢Incorrect Value-The values in the dataset at random places are not correct. Either the
data is mismatched or is not relevant for that position.
➢Invalid or Null Values-Value is either corrupted or has no meaning. These values has
to be removed as they hold no value for data processing
• Missing Data-It means data not present at the desired location of a dataset. This is
considered as an incomplete dataset. These are not erroneous data.
• Outliers data-Data which differs drastically from the rest of the data. These data need
to removed or replaced, if not the result will not be accurate.
Issues related to data-Examples

➢Incorrect Value- Marks column does not have value in decimal, Ph no. column has
only 8 digits instead of 10 digits.
➢Invalid or Null Values-E mail address without @ sign
➢Missing – Email address and pin code missing for a set of students.
➢Outliers data- Value zero given in marks of a student who is absent instead of
exemption. This will not give an accurate class average.
Python for Data Science
Data Science is using a combination of Python and mathematical concepts, like
Statistics, Data analysis, Probability etc. Python is the most suitable , simple and easy
language to write the code and can handle the highly complex mathematical processing
required to develop applications using AI.

A file created in Python and saved with an extension .py is called a module. A
collection of relevant modules saved under the same directory and name is called a
package. There are various packages related to various purposes available for free to
be used in Python.
Open source packages available needed for AI

• Numpy- Numerical Array Data handling Package. It is used for data analysis and
calculation related to large numerical data sets.
• OpenCV-Image processing package. It is used for manipulating and processing of
images like cropping, resizing, editing etc.
• Matplolib- Data visualization package. Used for graphical representation to produce
high quality data visualization of numerical data.
• NLTK(Natural Language toolkit)-Natural language processing package. It helps in
tasks related to textual data
• Pandas-Data related to two more dimensions in handled using Pandas. The source
of data is arranged in tabular form either using spreadsheets or database software.
After the data is collected through different methods, this data needs to be
accessed through Python code so that it can be arranged in a structured manner
and analysed as required by AI model.
Questions
• In what ways Data Science is helpful to the airline industry?
• What are the important points to remember when data is collected?
• Name the 5 stages of the AI project?
• What is erroneous data? Explain its two types.
• Explain the different formats of data used for storing?
• What are packages in Python? Explain any three packages

You might also like