Data Science_Chapter 3
Data Science_Chapter 3
Domains of AI
Rock, Paper & Scissors Game
https://www.afiniti.com/corporate/rock-paper-scissors
Airline Route Planning*: The Airline Industry across the world is known to bear heavy
losses. Except for a few airline service providers, companies are struggling to maintain
their occupancy ratio and operating profits. With high rise in air-fuel prices and the need to
offer heavy discounts to customers, the situation has got worse. It wasn't long before
airline companies started using Data Science to identify the strategic areas airline
companies can:
•Predict flight delay
•Decide which class of airplanes to buy
•Whether to directly land at the destination or take a halt in between(For ex, A flight can
have a direct route from New Delhi to New York. Alternatively, it can also choose to halt
in any country.)
• Effectively drive customer loyalty programs
• Image recognition-When you upload a picture, an automatic tag recognition system
used by applications like Facebook, suggests you people to tag. It uses a face recognition
algorithm for this.
• Speech Recognition: Siri, Google voice, Cortana etc use speech recognition algorithm in
Data Science to simplify our lives by converting our speech out messages into text.
• Medicine: DS is used in Medical Image Analysis such as detecting tumors, artery
stenosis etc to study different parameters.
• Gaming: Some games are designed using machine learning algorithms that help the level
in games to either improve or upgrade themselves based on the player’s previous moves
and accordingly mold up its game to next level.
• Virtual Reality: VR headset is using computing knowledge , algorithms and data to
provide the best viewing experience.
Getting Started
Data Sciences is a combination of Python and Mathematical concepts like Statistics, Data Analysis,
probability, etc. Concepts of Data Science can be used in developing applications around Al as it
gives a strong base for data analysis in Python.
Thus, we extract the required information from the curated dataset and clean it up in such
a way that there exist no errors or missing elements in it.
Modelling
Once the dataset is ready, we train our model on it. In this case, a regression model is
chosen in which the dataset is fed as a dataframe and is trained accordingly.
Regression is a Supervised Learning model which takes in continuous values of data
over a period of time. Since in our case the data which we have is a continuous data
of 30 days, we can use the regression model so that it predicts the next values to it in
a similar manner. In this case, the dataset of 30 days is divided in a ratio of 2:1 for
training and testing respectively. In this case, the model is first trained on the 20-day
data and then gets evaluated for the rest of the 10 days.
Evaluation
Once the model has been trained on the training dataset of 20 days, it is now time to see if the model is
working properly or not. Let us see how the model works and how is it tested.
Step 1: Feed data to the trained model. Eg: the name of the dish and the quantity produced for the same.
Step 2: Feed the data of quantity of unconsumed food of the same dish on previous days.
Step 3: The model then works upon the entries according to the training it got at the modelling stage.
Step 4: The Model predicts the quantity of food to be prepared for the next day.
Step 5: The prediction is compared to the testing dataset value. From the testing dataset, ideally, we can say
that the quantity of food to be produced for next day's consumption should be the total quantity minus the
unconsumed quantity.
Step 6: The model is tested with different datasets at least 10 times during training.
Step 7: Prediction values and actual values are compared to check the efficiency of the model.
Step 8: If the prediction value is same or almost similar to the actual values, the model is said to be accurate.
If not, either the model selection is changed or the model is trained on more data for better accuracy.
Once the model is able to achieve optimum efficiency, it is ready to be deployed in the restaurant for real-
time usage.
Data Collection
• Data collection is nothing new which has come up in our lives. It has been in our society
since ages. Even when people did not have fair knowledge of calculations, records were
still maintained in some way or the other to keep an account of relevant things. Data
collection is an exercise which does not require even a tiny bit of technological
knowledge. But when it comes to analysing the data, it becomes a tedious process for
humans as it is all about numbers and alpha-numerical data. That is where Data Science
comes into the picture. It not only gives us a clearer idea around the dataset, but also adds
value to it by providing deeper and clearer analyses around it. And as Al gets incorporated
in the process, predictions and suggestions by the machine become possible on the same.
• Now that we have gone through an example of a Data Science based project, we have a
bit of clarity regarding the type of data that can be used to develop a Data Science related
project. For the data domain-based projects, majorly the type of data used is in numerical
or alpha-numerical format and such datasets are curated in the form of tables. Such
databases are very commonly found in any institution for record maintenance and other
purposes.
Examples of datasets
All the type of data which has been mentioned above is in the form of tables. Tables
which contain numeric or alpha-numeric data.
Sources of Data
There exist various sources of data from where we can collect any type of data required and the
data collection process can be categorised in two ways: Offline and Online.
While accessing data from any of the data sources, following points should be kept in mind:
1. Data which is available for public usage only should be taken up.
2. Personal datasets should only be used with the consent of the owner.
3. One should never breach someone's privacy to collect data.
4. Data should only be taken form reliable sources as the data collected from random sources
can be wrong or unusable.
5. Reliable sources of data ensure the authenticity of data which helps in proper training of the
Al model.
Types of Data
For Data Science, usually the data is collected in the form of tables. These tabular datasets can be stored
in different formats. Some of the commonly used formats are:
1. CSV: CSV stands for comma separated values. It is a simple file format used to store tabular data. Each
line of this file is a data record and each record consists of one or more fields which are separated by
commas. Since the values of records are separated by a comma, hence they are known as CSV files.
2. Spreadsheet: A Spreadsheet is a piece of paper or a computer program which is used for accounting
and recording data using rows and columns into which information can be entered. Microsoft excel is a
program which helps in creating spreadsheets.
3. SQL: SQL is a programming language also known as Structured Query Language. It is a domain-
specific language used in programming and is designed for managing data held in different Kinds of
DBMS (Database Management System) It is particularly useful in handling structured data.
A lot of other formats of databases are also there.
Issues related to data
• Erroneous Data- Data set is not received as per the expectations in that position. Two
ways in which data can be erroneous are
➢Incorrect Value-The values in the dataset at random places are not correct. Either the
data is mismatched or is not relevant for that position.
➢Invalid or Null Values-Value is either corrupted or has no meaning. These values has
to be removed as they hold no value for data processing
• Missing Data-It means data not present at the desired location of a dataset. This is
considered as an incomplete dataset. These are not erroneous data.
• Outliers data-Data which differs drastically from the rest of the data. These data need
to removed or replaced, if not the result will not be accurate.
Issues related to data-Examples
➢Incorrect Value- Marks column does not have value in decimal, Ph no. column has
only 8 digits instead of 10 digits.
➢Invalid or Null Values-E mail address without @ sign
➢Missing – Email address and pin code missing for a set of students.
➢Outliers data- Value zero given in marks of a student who is absent instead of
exemption. This will not give an accurate class average.
Python for Data Science
Data Science is using a combination of Python and mathematical concepts, like
Statistics, Data analysis, Probability etc. Python is the most suitable , simple and easy
language to write the code and can handle the highly complex mathematical processing
required to develop applications using AI.
A file created in Python and saved with an extension .py is called a module. A
collection of relevant modules saved under the same directory and name is called a
package. There are various packages related to various purposes available for free to
be used in Python.
Open source packages available needed for AI
• Numpy- Numerical Array Data handling Package. It is used for data analysis and
calculation related to large numerical data sets.
• OpenCV-Image processing package. It is used for manipulating and processing of
images like cropping, resizing, editing etc.
• Matplolib- Data visualization package. Used for graphical representation to produce
high quality data visualization of numerical data.
• NLTK(Natural Language toolkit)-Natural language processing package. It helps in
tasks related to textual data
• Pandas-Data related to two more dimensions in handled using Pandas. The source
of data is arranged in tabular form either using spreadsheets or database software.
After the data is collected through different methods, this data needs to be
accessed through Python code so that it can be arranged in a structured manner
and analysed as required by AI model.
Questions
• In what ways Data Science is helpful to the airline industry?
• What are the important points to remember when data is collected?
• Name the 5 stages of the AI project?
• What is erroneous data? Explain its two types.
• Explain the different formats of data used for storing?
• What are packages in Python? Explain any three packages