Data Science Fundamentals QB
Data Science Fundamentals QB
CS3353
DATA SCIENCE FUNDAMENTALS
DEPARTMENT OF MECHANICAL ENGINEERING
ACADEMIC YEAR: 2024 - 2025 (ODD SEMESTER)
OCS353 DATA SCIENCE FUNDAMENTALS
UNIT I
PART A
Data science is the study of data. It involves developing methods of recording, storing, and
analyzing data to effectively extract useful information. The goal of data science is to gain insights
and knowledge from any type of data — both structured and unstructured.
Data science is the requirement of every business to make business forecasts and predictions based
on facts and figures, which are collected in the form of data and processed through data science.
Data science is also important for marketing promotions and campaigns.
Data science is necessary for research and analysis in health care, which makes it easier for
practitioners in both fields to understand the challenges and extract results through analysis and
insights proposed based on data.
The advanced technology of machine learning and artificial intelligence has been infused in the
field of medical science to conduct biomedical and genetic researches more easily.
The clinical history of a patient and the medicinal treatment given is computerized and kept as a
digitalized record, which makes it easier for the practitioner and doctors to detect the complex
diseases at an early stage and understand its complexity.
Online personality tests and career counseling suggest students and individuals select a career path
based on their choices. This is done with the help of data science, which has designed algorithms
and predictive rationalities that conclude with the provided set of choices.
user’s mood, behavior, and opinions towards a specific product, incident, or event with the help of
texts, feedbacks, thoughts, views, reactions, and suggestions.
The new technology emerging around us to ease life and alter our lifestyle is all due to data science.
Some applications collect sensory data to track our movements and activities and also provide
knowledge and suggestions concerning our health, such as blood pressure and heart rate. Data
science is also being implemented to advance security and safety technology, which is the most
important issue nowadays.
Financial institutions use data science to predict stock markets, determine the risk of lending
money, and learn how to attract new clients for their services.
* Structured
* Unstructured
* Natural language
* Machine-generated
* Graph-based
* Streaming
Yes, Natural language is a special type of unstructured data; it’s challenging to process because it
requires knowledge of specific data science techniques and linguistics. The natural language
processing is used in entity recognition, topic recognition, summarization, text completion, and
sentiment analysis.
Examples of machine data are web server logs, call detail records, network event logs, and
telemetry.
Graph or network data is, in short, data that focuses on the relationship or adjacency of objects. The
graph structures use nodes, edges, and properties to represent and store graphical data. Graph-based
data is a natural way to represent social networks.
13. List the name of the steps involved in data science processing?
An outlier is an observation that seems to be distant from other observations or, more specifically,
one observation that follows a different logic or generative process than the other observations.
The two operations that are performed in various data types to combine them are as follows:
i. Joining- enriching an observation from one table with information from another table.
ii. Appending or stacking- adding the observations of one table to those of another table.
16. What is big data?
Big data is a blanket term for any collection of data sets so large or complex that it becomes
difficult to process using traditional data management techniques. They are characterized by the
four Vs: velocity, variety, volume, and veracity.
Data cleansing is a sub process of the data science process that focuses on removing errors in your
data so your data becomes a true and consistent representation of the processes.
18. Define the term setting the research goal (in data science processing).
The term setting the research goal mean Defining what, why, and how of your projection in a
project charter.
The term retrieving data means Finding and getting access to data needed in your project. This data
is either found within the company or retrieved from a third party.
It is the process of Checking and remediating data errors, enriching the data with data from other
data sources, and transforming it into a suitable format for your models.
Data exploration process of Diving deeper into your data using descriptive statistics and visual
techniques.
It is the process of Building a model is an iterative process that involves selecting the variables for
the model, executing the model, and model diagnostics. It is generally Using machine learning and
statistical techniques to achieve our project goal.
Finally, present the results to the business. These results can take many forms, ranging from
presentations to research reports. Presenting the results to the stakeholders and industrializing the
analysis process for repetitive reuse and integration with other tools.
24. What are the types of database?
Column databases
Document stores
Streaming data
Key-value stores
SQL on Hadoop
New SQL
Graph databases.
In this method Data is stored in columns, which allows algorithms to perform much faster queries.
Newer technologies use cell-wise storage. Table-like structures are still important.
Document stores no longer use tables, but store every observation in a document. This allows for a
much more flexible data scheme.
Data is collected, transformed, and aggregated not in batches but in real time. Although we’ve
categorized it here as a database to help you in tool selection, it’s more a particular type of problem
that drove creation of technologies such as Storm.
When Data isn’t stored in a table; rather you assign a key for every value, such as
org.marketing.sales.2015: 20000. This scales well but places almost all the implementation on the
developer.
Batch queries on Hadoop are in a SQL-like language that uses the map-reduce framework in the
background
This class combines the scalability of NoSQL databases with the advantages of relational
databases. They all have a SQL interface and a relational data model.
32. Explain Graph databases?
Not every problem is best stored in a table. Particular problems are more naturally translated into
graph theory and stored in graph databases. A classic example of this is a social network.
Uses:
A frequency distribution helps us to detect any pattern in the data (assuming a pattern exists) by
superimposing some order on the inevitable variability among observations.
When observations are sorted into classes of more than one value, the result is referred to as a
frequency distribution for grouped data.
When observations are sorted into classes of single values the result is referred to as a frequency
distribution for ungrouped data.
Cumulative frequency distributions show the total number of observations in each class and in all
lower-ranked classes. This type of distribution can be used effectively with sets of scores, such as
test scores for intellectual or academic aptitude, when relative standing within the distribution
assumes primary importance.
6. What is relative frequency distribution?
Relative frequency distributions show the frequency of each class as a part or fraction of the total
frequency for the entire distribution.
The percentile rank of a score indicates the percentage of scores in the entire distribution with
similar or smaller values than that score.
8. What is Histograms?
A histogram is the most commonly used graph to show frequency distributions. It is a bar-type
graph for quantitative data. The common boundaries between adjacent bars emphasize the
continuity of the data, as with continuous variables.
Equal units along the horizontal axis (the X axis, or abscissa) reflect the various class intervals of
the frequency distribution. Equal units along the vertical axis (the Y axis, or ordinate) reflect
increases in frequency. (The units along the vertical axis do not have to be the same width as those
along the horizontal axis.). The body of the histogram consists of a series of bars whose heights
reflect the frequencies for the various classes.
Frequency Polygon is a line graph for quantitative data that also emphasizes the continuity of
continuous variables. Frequency polygons may be constructed directly from frequency
distributions.
The mean is found by adding all scores and then dividing by the number of scores.
The median reflects the middle value when observations are ordered from least to most. The
median splits a set of ordered observations into two equal parts, the upper and lower halves. In
other words, the median has a percentile rank of 50, since observations with equal or smaller values
constitute 50 percent of the entire distribution.
Mode reflects the value of the most frequently occurring score.It is easy to assign a value to the
mode if the data are organized.
14. What if a distribution have More than one mode or no mode at all?
Distributions can have more than one mode (or no mode at all). Distributions with two obvious
peaks, even though they are not exactly the same height, are referred to as bimodal. Distributions
with more than two peaks are referred to as multimodal. The presence of more than one mode
might reflect important differences among subsets of data.
Range: The range is the difference between the largest and smallest scores
Degrees of freedom (df) refers to the number of values that are free to vary, given one or more
mathematical restrictions, in a sample being used to estimate a population characteristic.
The normal curve is a theoretical curve defined for a continuous variable, and noted for its
symmetrical bell-shaped form.
The normal curve is symmetrical, its lower half is the mirror image of its upper half. Being bell
shaped, the normal curve peaks above a point midway along the horizontal spread and then tapers off
gradually in either direction from the peak(without actually touching the horizontal axis, since, in
theory, the tails of a normal curve extend infinitely far)
19. What is Z-score?
A z score is a unit-free, standardized score that, regardless of the original units of measurement,
indicates how many standard deviations a score is above or below the mean of its distribution.
where X is the original score and μ and σ are the mean and the standard deviation, respectively.
PART B & C
3. Specify the real limits for the lowest class interval in this frequency distribution
4. Analyze how graph can be used to represent qualitative and quantitative data?
5. Generate the ungrouped and grouped frequency table for the following
data
90,92,87,88,87,92,98,90,90,87,87,88,88,89,90,87,89,92,92,92,98,90,95,87,87
(i) How many people scored 98?
(ii) How many people scored 90 or less?
(iii) What proportion scored 87?
6. (i) Calculate the sum of square population standard deviation for the given x data
value 13,10,11,7,9,11,9
(ii) Calculate the sample standard deviation for the given data 7,3,1,0,4
7. Suppose the IQ score have a bell shaped distribution with a mean of 100 and standard deviation of
15 then calculate the following,
(i) What percentage of people should have an IQ score between 85 and 115 .
(ii) What percentage of people should have an IQ score between 70 and 130
(iii) What percentage of people should have an IQ score more than 130
(iv) A person with an IQ score greater than 145 is considered as genius. Does the empirical
rule support this statement
DEPARTMENT OF MECHANICAL ENGINEERING
ACADEMIC YEAR: 2024 - 2025 (ODD SEMESTER)
OCS353 DATA SCIENCE FUNDAMENTALS
UNIT III
PART A
1. What is correlation?
Correlation is a statistical measure that expresses the extent to which two variables are linearly
related. It’s a common tool for describing simple relationships without making a statement about
cause and effect.
The sample correlation coefficient, r, quantifies the strength of the relationship. Correlations are also
tested for statistical significance.
2. Define Scatterplots?
A scatterplot is a graph containing a cluster of dots that represents all pairs of scores.With a little
training, you can use any dot cluster as a preview of a fully measured relationship.
A correlation coefficient is a number between –1 and 1 that describes the relationship between pairs
of variables. The type of correlation coefficient, designated as r, that describes the linear relationship
between pairs of variables form quantitative data.
4. Define Regression.
A predictive modeling technique that evaluates the relation between dependent (i.e. the target
variable) and independent variables is known as regression analysis. Regression analysis can be used
for forecasting, time series modeling, or finding the relation between the variables and predict
continuous values.
4. Logistic Regression
5. Ridge Regression
6. Lasso Regression
Linear regression is the most basic form of regression algorithms in machine learning. The model
consists of a single parameter and a dependent variable has a linear relationship. We denote simple
linear regression by the following equation given below
y = mx + c + e
where m is the slope of the line, c is an intercept, and e represents the error in the model
When the number of independent variables increases, it is called the multiple linear regression
models.
y = b0 + b1x1
Ridge Regression is another type of regression in machine learning and is usually used when there is
a high correlation between the parameters. This is because as the correlation increases the least square
estimates give unbiased values.
The decision tree as the name suggests works on the principle of conditions. It is efficient and has
strong algorithms used for predictive analysis. It has mainly attributed that include internal nodes,
branches, and a terminal node. Every internal node holds a “test” on an attribute, branches hold the
conclusion of the test and every leaf node means the class label. It is used for both classifications as
well as regression which are both supervised learning algorithms.
PART B & C
PART A
Import numpy as np
np.array([1,4,2,5,3])
OUTPUT: array([1,4,2,5,3])
i. np.array([3.14, 4, 2, 3])
ii. np.array([1, 2, 3, 4], dtype='float32')
iii. np.array([range(i, i + 3) for i in [2, 4, 6]])
iv. np.zeros(10, dtype=int)
v. np.ones((3, 5),
dtype=float) vi. np.full((3,
5), 3.14)
vii. np.arange(0, 20, 2)
viii. np.linspace(0, 1, 5)
ix. np.random.random((3, 3))
x. np.random.normal(0, 1, (3, 3))
OUTPUT:
i.array([3.14, 4. , 2. , 3. ])
Example
val = np.array([1, np.nan, 3,
4]) print( val.dtype)
Output: dtype('float64')
9. How the operations can be performed on null values in pandas data structure?
There are several useful methods for detecting, removing, and replacing null values in Pandas data
structures.
They are:
isnull() - Generate a Boolean mask indicating missing values
notnull() - Opposite of isnull()
dropna() - Return a filtered version of the data
fillna() - Return a copy of the data with missing values filled or imputed
The pivot table takes simple column-wise data as input, and groups the entries into a two-
dimensional table that provides a multidimensional summarization of the data.
PART B & C
1.Briefly explain the basics of numpy arrays with example
2.Describe about fancy indexing with example
3. Explain structured data in numpy array.
4. What is universal function? Explain clearly each function with example.
5. Explain aggregate functions with example
6. What is broadcasting and explain the rules with
examples 7.Explain data objects in pandas.
8. Briefly explain the hierarchical indexing with examples
9. What is pivot table? Explain it clearly
DEPARTMENT OF MECHANICAL ENGINEERING
ACADEMIC YEAR: 2024 - 2025 (ODD SEMESTER)
OCS353 DATA SCIENCE FUNDAMENTALS
UNIT V
PART A
1. What is the purpose of matplotlib?
Matplotlib is a cross-platform, data visualization and graphical plotting library for Python and its
numerical extension NumPy.
One of Matplotlib’s most important features is its ability to play well with many operating systems
and graphics backends.
The dual interfaces of matplotlib are: a convenient MATLAB-style state-based interface, and a more
powerful object-oriented interface.
plt.style.use('seaborn-whitegrid')
import numpy as np
fig = plt.figure()
ax = plt.axes()
ax.plot(x, np.sin(x))
plots. Example:
y = np.sin(x)
plt.plot(x, y, 'o',
color='black'); Output:
A second, more powerful method of creating scatter plots is the plt.scatter function, which can be
used very similarly to the plt.plot function.
Example:
plt.scatter(x, y, marker='o')
The primary difference of plt.scatter from plt.plot is that it can be used to create scatter plots where
the properties of each individual point (size, face color, edge color, etc.) can be individually
controlled or mapped to data.
Contour plot used to plot three dimensional data into two dimensional data. A contour plot can be
created with the plt.contour function. It takes three arguments: a grid of x values, a grid of y values,
and a grid of z values.
7. What are the functions can be used to draw the contour plots?
plt.contour, plt.contourf, and plt.imshow are the functions used to draw the contour plots.
Example:
x = np.linspace(0, 5, 50)
y = np.linspace(0, 5, 40)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)
plt.contour(X, Y, Z, colors='black')
Output:
A simple histogram is very useful in understanding a dataset. It is used to specify the frequency
distributions between two varibales.
Example:
plt.style.use('seaborn-white')
data = np.random.randn(1000)
plt.hist(data)
plt.style.use('seaborn-white')
data =
np.random.randn(1000)
plt.hist(data)
Output:
10. How to create a three-dimensional wireframe plot?
The three-dimensional plots that work on gridded data are wireframes. It take a grid of values and
project it onto the specified three dimensional surface, and can make the resulting three-dimensional
forms quite easy to visualize.
Example:
fig = plt.figure()
ax = plt.axes(projection='3d')
ax.plot_wireframe(X, Y, Z,
color='black') ax.set_title('wireframe')
Output:
A surface plot is like a wireframe plot, but each face of the wireframe is a filled polygon. Adding a
colormap to the filled polygons can aid perception of the topology of the surface being visualized.
Example:
ax = plt.axes(projection='3d')
ax.plot_surface(X, Y, Z, rstride=1,
ax.set_title('surface')
Output:
Seaborn is a open-source Python library built on top of matplotlib. It is used for data visualization and
exploratory data analysis. Seaborn works easily with dataframes and the Pandas library. The graphs
created can also be customized easily.
Graphs can help us find data trends that are useful in any machine learning or forecasting project.
Visually attractive graphs can make presentations and reports much more appealing to the reader.
PART B & C
5. How graphical data can be projected using matplotlib? Explain with example.