Mark Anthony Legaspi - Module 2 - The Field of Data Science
Mark Anthony Legaspi - Module 2 - The Field of Data Science
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph
Data Science
Module 2: The Field of Data Science
I. Introducti on
This module is a follow-up on the fundamental concepts covered in the
previous module. You will find in this module a thorough discussion of the
different fields of data science which is summarized using the data science
infographic available in your Google Classroom class.
The data science infographic is a good visualization of all the related fields of
data science that illustrates the what, when, why, where, who and how of
data science. It is expected that after completing this module, you will have a
greater understanding of the data science fields especially the tools that you
need to learn and apply data science concepts and techniques.
In fact, because no one definition fits the bill seamlessly, it is up to those who
do data science to define it.
Recognizing the need for a clear-cut explanation of data science, the 365
Data Science Team designed the What-Where-Who infographic. We define
the key processes in data science and disseminate the field. Here is our
interpretation of data science.
Of course, this might look like a lot of overwhelming information, but it really
isn’t.
Please watch Video Lecture 7 - Applying Traditional Data, Big Data, BI,
Traditional Data Science and ML
Please watch Video Lecture 8: The Reason Behind Studying Data Science
Disciplines.
Before anything else, there is always data. Data is the foundation of data
science; it is the material on which all the analyses are based. In the
context of data science, there are two types of data: traditional, and big
data.
Big data, on the other hand, is… bigger than traditional data, and not in
the trivial sense. From variety (numbers, text, but also images, audio,
mobile data, etc.), to velocity (retrieved and computed in real time), to
volume (measured in tera-, peta-, exa-bytes), big data is usually
distributed across a network of computers.
That said, before being ready for processing, all data goes
through pre-processing. This is a necessary group of
operations that convert raw data into a format that is more
understandable and hence, useful for further processing.
Common processes are:
Collect raw data and store it on a server
This is untouched data that scientists cannot analyze straight
away. This data can come from surveys, or through the more
popular automatic data collection paradigm, like cookies on a
website.
Class-label the observations
This consists of arranging data by category or labelling data
points to the correct data type. For example, numerical, or
categorical.
Data cleansing/data scrubbing
Dealing with inconsistent data, like misspelled categories and
missing values.
Data balancing
If the data is unbalanced such that the categories contain an
unequal number of observations and are thus not
representative, applying data balancing methods, like
extracting an equal number of observations for each category,
and preparing that for processing, fixes the issue.
Data shuffling
Re-arranging data points to eliminate unwanted patterns and
improve predictive performance further on. This is applied
when, for example, if the first 100 observations in the data are
from the first 100 people who have used a website; the data
isn’t randomized, and patterns due to sampling emerge.
Please watch Video Lecture 11 - Techniques for Working with Big Data
Data Architects and Data Engineers (and Big Data Architects, and Big
Data Engineers, respectively) are crucial in the data science market.
The former creates the database from scratch; they design the way
data will be retrieved, processed, and consumed. Consequently, the
data engineer uses the data architects’ work as a stepping stone and
processes (pre-processes) the available data. They are the people who
ensure the data is clean and organized and ready for the analysts to
take over.
That said, once data processing is done, and the databases are clean
and organized, the real data science begins.
E. Data Science
There are also two ways of looking at data: with the intent to explain
behavior that has already occurred, and you have gathered data for it;
or to use the data you already have in order to predict future
behavior that has not yet happened.
The starting point of all data science is data. Once the relevant data is
in the hands of the BI Analyst (monthly revenue, customer, sales
volume, etc.), they must quantify the observations, calculate KPIs and
examine measures to extract insights from their data.
a. Linear Regression
In data science, the linear regression model is used for quantifying
causal relationships among the different variables included in the
analysis. Like the relationship between house prices, the size of
the house, the neighborhood, and the year built. The model
calculates coefficients with which you can predict the price of a
new house, if you have the relevant information available.
b. Logistic regression
Since it’s not possible to express all relationships between
variables as linear, data science makes use of methods like the
logistic regression to create non-linear models. Logistic regression
operates with 0s and 1s. Companies apply logistic regression
algorithms to filter job candidates during their screening process.
If the algorithm estimates that the probability that a prospective
candidate will perform well in the company within a year is above
50%, it would predict 1, or a successful application. Otherwise, it
will predict 0.
c. Cluster analysis
This exploratory data science technique is applied when the
observations in the data form groups according to some criteria.
Cluster analysis takes into account that some observations exhibit
similarities, and facilitates the discovery of new significant
predictors, ones that were not part of the original
conceptualization of the data.
d. Factor analysis
If clustering is about grouping observations together, factor
analysis is about grouping features together. Data science resorts
to using factor analysis to reduce the dimensionality of a problem.
For example, if in a 100-item questionnaire each 10 questions
pertain to a single general attitude, factor analysis will identify
these 10 factors, which can then be used for a regression that will
deliver a more interpretable prediction. A lot of the techniques in
data science are integrated like this.
The data analyst, on the other hand, is the person who prepares
advanced types of analyses that explain the patterns in the data that have
already emerged and overlooks the basic part of the predictive analytics.
The main advantage machine learning has over any of the traditional data
science techniques is the fact that at its core resides the algorithm. These
are the directions a computer uses to find a model that fits the data as
well as possible. The difference between machine learning and traditional
data science methods is that we do not give the computer instructions on
how to find the model; it takes the algorithm and uses its directions to
learn on its own how to find said model. Unlike in traditional data science,
machine learning needs little human involvement. In fact, machine
learning, especially deep learning algorithms are so complicated, that
humans cannot genuinely understand what is happening “inside”.
Once the training is complete, the machine will be able to apply the
complex computational model it has learned to novel data still to the
result of highly reliable predictions.
a. Supervised Learning
Supervised learning rests on using labeled data. The machine gets
data that is associated with a correct answer; if the machine’s
performance does not get that correct answer, an optimization
algorithm adjusts the computational process, and the computer
does another trial. Bear in mind that, typically, the machine does
this on 1000 data points at once.
b. Unsupervised Learning
When the data is too big, or the data scientist is under too much
pressure for resources to label the data, or they do not know what
the labels are at all, data science resorts to using unsupervised
learning. This consists of giving the machine unlabeled data and
asking it to extract insights from it. This often results in the data
being divided in a certain way according to its properties. In other
words, it is clustered.
c. Reinforcement Learning
b. Client retention
With machine learning algorithms, corporate organizations can
know which customers may purchase goods from them. This
means the store can offer discounts and a ‘personal touch’ in an
efficient way, minimizing marketing costs and maximizing profits.
A couple of prominent names come to mind: Google, and Amazon.
The machine learning engineer. This is the specialist who is looking for
ways to apply state-of-the-art computational models developed in the
field of machine learning into solving complex problems such as business
tasks, data science tasks, computer vision, self-driving cars, robotics, and
so on.
Two main categories of tools are necessary to work with data and data
science: programming languages and software.
a. Programming languages in data science
Knowing a programming language enables the data scientist to
devise programs that can execute specific operations. The biggest
advantage programming languages have is that we can reuse the
programs created to execute the same action multiple times.
R and Python are the two most popular tools across all data
science sub-disciplines. Their biggest advantage is that they can
manipulate data and are integrated within multiple data and data
science software platforms. They are not just suitable for
mathematical and statistical computations; they are adaptable.
In fact, Python was deemed “the big Kahuna” of 2019 by IEEE (the
world’s largest technical professional organization for the
advancement of technology) and was listed at number 1 in its
annual interactive ranking of the Top 10 Programming Languages.
Big data in data science is handled with the help of R and Python,
of course, but people working in this area are often proficient in
other languages like Java or Scala. These two are very useful when
combining data from multiple sources.
Power BI, SaS, Qlik, and especially Tableau are top-notch examples
of software designed for business intelligence visualizations.
The name of the company that Dhanurjay Patil has worked on is the
Whitehouse Office of Science and Technology Policy.
Healthcare
Cybersecurity and the health ecosystem
Ending suicide and improving mental health
End cancer - Cancer Moonshot
Criminal justice
Increase trust between law enforcement & citizens - Police Data
Initiative - 44M Americans and 130 jurisdictions (launched by the
President in Camden New Jersey)
End the endless of cycle of incarceration - Data Driven Justice
Initiative - 94M Americans, 141 jurisdictions and 10 States
Big data and Artificial Intelligence National Strategy
All data courses must have ethics & security
Ensuring data isn’t used for discrimination
Data and improving life of Americans
Using data to help local communities connect to opportunities -
launched the Opportunity Project
Addressing the drastic rise in traffic fatalities
Increasing Federal capacity to be data-driven
Helped establish ~40 Chief Data Scientists/Officers across the
Federal Government
Established the Data Cabinet and associated leadership group
across national security
National Security
Encryption
Bring Silicon Valley & the Pentagon closer together - helped
establish DIUx and the Defense Digital Service
c. What is his/her job title and how did he/she get the job? Its pre-
requisites? Skill set?
Dhanurjay Patil was appointed as the first U.S. Chief Data Scientist and
established the mission of the office: To responsibly unleash the power of
data for the benefit of the American public and maximize the nation’s
return on its investment in data.
Dhanurjay Patil stays on his job for 2 years from February 2015 to January
2017. He is now a Former U.S. Chief Data Scientist.
e. Is it a fulfilling job that aspiring data scientist must aim for? Why?
For me, I think it is really a fulfilling job if you are really passionate in this.
We know that data scientist is not an easy job because it will take you a
lot of work and energy and the expectations on your performance are
high. If you are happy in working with data’s and math, then you will be
fulfilled in this job. The demand for data scientist is high and the world is
generating a massive amount of data every day. It is a fast-growing field
and provides high salaries. Getting this job is really a big accomplishment
because only few can get this job and those who get this job are the data
enthusiast. If you really want this job you won’t care about how hard it is
and you will only focus on achieving it.
Your answer:
Link/Source https://www.linkedin.com/in/dpatil/
https://en.wikipedia.org/wiki/DJ_Patil
Photo
make data
driven
recommendatio
ns.
What DATA DATA ANALYZE THE Linear Supervised
COLLECTION COLLECTION DATA Regression Learning
-Linear -Supervised
PREPROCESSING: PREPROCESSING: EXTRA INFO AND regression attem Learning input is
Collect raw data Class-label the PRESENT IT IN pts to model the provided
and store it on a data THE FORM OF: relationship as a labelled
server -The big data is -Metrics between two dataset; a model
-This are the extremely varied, -KPIs variables by can learn from it
data’s come from instead of -Reports fitting to provide the
a linear equatio result of the
surveys and it is ‘numerical’ vs -Dashboards
n to observed problem easily.
not yet touched by ‘categorical’, the
data.
the scientist. labels are ‘text’,
-SVMS
Class-label the ‘digital image data
-NNs
observations ‘and etc.
-deep learning
-This is arranging Data cleansing
-random forests
the data to their -Detecting and Logistic -Bayesian
correct data type correcting corrupt regression networks
like it’s either or inaccurate -Logistic
numerical or records from a Regression is a Unsupervised
categorical. record set, table, statistical Learning
Data or database analysis method -There is no
cleansing/data Data masking used to predict complete and
scrubbing -Is the process of a data value clean labelled
-Correcting the hiding original based on prior dataset in
misspelled data data with modified observations of unsupervised
and missing content. a data set and it learning.
values. CASE SPECIFIC: operates with Unsupervised
Data balancing Text Data Mining 0s and 1s. learning is self-
-This is applying Confidentiality organized
the data balancing -Is the process of learning. Its
methods wherein deriving high- main aim is to
they extract an quality explore the
equal number of information from underlying
Cluster analysis patterns and
observations for text.
-Cluster predicts the
each category and
analysis is a output.
prepares for fixing
technique to
the issues.
group similar -k-means
Data shuffling
observations -deep learning
-This is eliminating
the unwanted
Reinforcement
patterns by
Learning
arranging the data
Time series
analysis
-Time series is
simply a series
of data points
ordered in time.
Given the table above, your task is to fill-out the table and provide at
least one example for each data science field.
You should also provide a brief definition using your own words for all the
techniques and processes that will be part of your answer.
C. Explore
1. Based on the data science infographic discussed, fill-out the table below
by listing down at least two (2) software tools or programming languages,
not given in the infographic, that can be use in each of the following data
science process or technique: (20 points)
Programming Languages:
-PL/SQL
-Perl
Amazon Redshift
Programming languages:
-Ruby
Programming languages:
-Julia
-GNU Octave Sisense Software
Highcharts Software
Programming languages:
-C#
-Rust
V. References
1. Udemy. 2020. “Complete Data Science Training: Mathematics, Statistics, Python,
Advanced Statistics in Python, Machine & Deep Learning”. Retrieved from:
https://www.udemy.com/course/the-data-science-course-complete-data-
science-bootcamp/learn/lecture/
2. 365 Data Science. “Defining Data Science: The What, Where and How of Data
Science”. Retrieve from: https://365datascience.com/defining-data-science/