0% found this document useful (0 votes)
365 views27 pages

Mark Anthony Legaspi - Module 2 - The Field of Data Science

This document discusses the field of data science and its related topics. It begins by introducing the module and its learning objectives, which are to understand the different fields of data science, their applications, and the tools used. It then defines some key concepts. The document is organized into sections that describe the different types of data used in data science, traditional and big data, as well as the processes involved in working with each type of data, including collecting, classifying, cleaning, and preprocessing the data for analysis. Real-life examples are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
365 views27 pages

Mark Anthony Legaspi - Module 2 - The Field of Data Science

This document discusses the field of data science and its related topics. It begins by introducing the module and its learning objectives, which are to understand the different fields of data science, their applications, and the tools used. It then defines some key concepts. The document is organized into sections that describe the different types of data used in data science, traditional and big data, as well as the processes involved in working with each type of data, including collecting, classifying, cleaning, and preprocessing the data for analysis. Real-life examples are also discussed.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 27

Republic of the Philippines

City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Data Science
Module 2: The Field of Data Science

Name (LN,FN,MN): Legaspi, Mark Anthony A. Program/Yr/Block: BSIT-3A

I. Introducti on
This module is a follow-up on the fundamental concepts covered in the
previous module. You will find in this module a thorough discussion of the
different fields of data science which is summarized using the data science
infographic available in your Google Classroom class.

The data science infographic is a good visualization of all the related fields of
data science that illustrates the what, when, why, where, who and how of
data science. It is expected that after completing this module, you will have a
greater understanding of the data science fields especially the tools that you
need to learn and apply data science concepts and techniques.

II. Learning Objecti ves


After completing this module, you should be able to:
1. Explain the different fields of data science in simple terms.
2. Determine when a specific data science field is applied.
3. Explain the importance of learning a specific data science field and its
relationship with other data science fields.
4. Enumerate and explain the different techniques and processes that are
involved in each field of data science.
5. Explore how each of the data science field is applied in real-life.
6. Enumerate the different tools available to apply the data science
techniques and processes.
7. Differentiate the different job roles related to data science.

III. Topics and Key Concepts


Data Science is a term that escapes any single complete definition, which
makes it difficult to use, especially if the goal is to use it correctly. Most
articles and publications use the term freely, with the assumption that it is
universally understood. However, data science – its methods, goals, and
applications – evolve with time and technology. Data science 25 years ago
referred to gathering and cleaning datasets then applying statistical methods
to that data. In 2018, data science has grown to a field that encompasses
data analysis, predictive analytics, data mining, business intelligence,
machine learning, and so much more.

In fact, because no one definition fits the bill seamlessly, it is up to those who
do data science to define it.

Prepared by: Mr. Arnie Armada


1
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Recognizing the need for a clear-cut explanation of data science, the 365
Data Science Team designed the What-Where-Who infographic. We define
the key processes in data science and disseminate the field. Here is our
interpretation of data science.

Prepared by: Mr. Arnie Armada


2
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Of course, this might look like a lot of overwhelming information, but it really
isn’t.

Please watch Video Lecture 6: The Data Science Infographic.

Please watch Video Lecture 7 - Applying Traditional Data, Big Data, BI,
Traditional Data Science and ML

Please watch Video Lecture 8: The Reason Behind Studying Data Science
Disciplines.

A. The Data in Data Science

Before anything else, there is always data. Data is the foundation of data
science; it is the material on which all the analyses are based. In the
context of data science, there are two types of data: traditional, and big
data.

Traditional data is data that is structured and stored in databases which


analysts can manage from one computer; it is in table format, containing
numeric or text values. Actually, the term “traditional” is something we
are introducing for clarity. It helps emphasize the distinction between big
data and other types of data.

Big data, on the other hand, is… bigger than traditional data, and not in
the trivial sense. From variety (numbers, text, but also images, audio,
mobile data, etc.), to velocity (retrieved and computed in real time), to
volume (measured in tera-, peta-, exa-bytes), big data is usually
distributed across a network of computers.

B. What do you do to Data in Data Science?


a. Traditional data in Data Science
Traditional data is stored in relational database management
systems.

Prepared by: Mr. Arnie Armada


3
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

That said, before being ready for processing, all data goes
through pre-processing. This is a necessary group of
operations that convert raw data into a format that is more
understandable and hence, useful for further processing.
Common processes are:
 Collect raw data and store it on a server
This is untouched data that scientists cannot analyze straight
away. This data can come from surveys, or through the more
popular automatic data collection paradigm, like cookies on a
website.
 Class-label the observations
This consists of arranging data by category or labelling data
points to the correct data type. For example, numerical, or
categorical.
 Data cleansing/data scrubbing
Dealing with inconsistent data, like misspelled categories and
missing values.
 Data balancing
If the data is unbalanced such that the categories contain an
unequal number of observations and are thus not
representative, applying data balancing methods, like
extracting an equal number of observations for each category,
and preparing that for processing, fixes the issue.
 Data shuffling
Re-arranging data points to eliminate unwanted patterns and
improve predictive performance further on. This is applied
when, for example, if the first 100 observations in the data are
from the first 100 people who have used a website; the data
isn’t randomized, and patterns due to sampling emerge.

Please watch Video Lecture 9 - Techniques for Working with


Traditional Data

Please watch Video Lecture 10 - Real Life Examples of


Traditional Data

b. Big Data in Data Science


When it comes to big data and data science, there is some
overlap of the approaches used in traditional data handling,
but there are also a lot of differences.
First of all, big data is stored on many servers and is infinitely
more complex.

Prepared by: Mr. Arnie Armada


4
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

In order to do data science with big data, pre-processing is


even more crucial, as the complexity of the data is a lot larger.
You will notice that conceptually, some of the steps are similar
to traditional data pre-processing, but that’s inherent to
working with data.
 Collect the data
 Class-label the data
Keep in mind that big data is extremely varied, therefore
instead of ‘numerical’ vs ‘categorical’, the labels are ‘text’,
‘digital image data’, ‘digital video data’, digital audio data’, and
so on.
 Data cleansing
The methods here are massively varied, too; for example, you
can verify that a digital image observation is ready for
processing; or a digital video, or…
 Data masking
When collecting data on a mass scale, this aims to ensure that
any confidential information in the data remains private,
without hindering the analysis and extraction of insight. The
process involves concealing the original data with random and
false data, allowing the scientist to conduct their analyses
without compromising private details. Naturally, the scientist
can do this to traditional data too, and sometimes is, but with
big data the information can be much more sensitive, which
masking a lot more urgent.

C. Where does data come from?


Traditional data may come from basic customer records, or historical
stock price information.

Big data, however, is all-around us. A consistently growing number of


companies and industries use and generate big data. Consider online
communities, for example, Facebook, Google, and LinkedIn; or
financial trading data. Temperature measuring grids in various
geographical locations also amount to big data, as well as machine
data from sensors in industrial equipment. And, of course, wearable
tech.

Prepared by: Mr. Arnie Armada


5
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Please watch Video Lecture 11 - Techniques for Working with Big Data

Please watch Video Lecture 12 - Real Life Examples of Big Data

D. Who handles the data?


The data specialists who deal with raw data and pre-processing, with
creating databases, and maintaining them can go by a different name.
But although their titles are similar sounding, there are palpable
differences in the roles they occupy. Consider the following.

Data Architects and Data Engineers (and Big Data Architects, and Big
Data Engineers, respectively) are crucial in the data science market.
The former creates the database from scratch; they design the way
data will be retrieved, processed, and consumed. Consequently, the
data engineer uses the data architects’ work as a stepping stone and
processes (pre-processes) the available data. They are the people who
ensure the data is clean and organized and ready for the analysts to
take over.

The Database Administrator, on the other hand, is the person who


controls the flow of data into and from the database. Of course, with
Big Data almost the entirety of this process is automated, so there is
no real need for a human administrator. The Database Administrator
deals mostly with traditional data.

That said, once data processing is done, and the databases are clean
and organized, the real data science begins.

E. Data Science
There are also two ways of looking at data: with the intent to explain
behavior that has already occurred, and you have gathered data for it;
or to use the data you already have in order to predict future
behavior that has not yet happened.

Prepared by: Mr. Arnie Armada


6
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

F. Data Science explaining the past


a. Business Intelligence
Before data science jumps into predictive analytics, it must
look at the patterns of behavior the past provides, analyze
them to draw insight and inform the path for forecasting.
Business intelligence focuses precisely on this: providing data-
driven answers to questions like: How many units were sold?
In which region were the most goods sold? Which type of
goods sold where? How did the email marketing perform last
quarter in terms of click-through rates and revenue
generated? How does that compare to the performance in the
same quarter of last year?

Although Business Intelligence does not have “data science” in


its title, it is part of data science, and not in any trivial sense.

G. What does Business Intelligence do?


Of course, Business Intelligence Analysts can apply Data Science to
measure business performance. But in order for the Business
Intelligence Analyst to achieve that, they must employ specific data
handling techniques.

The starting point of all data science is data. Once the relevant data is
in the hands of the BI Analyst (monthly revenue, customer, sales
volume, etc.), they must quantify the observations, calculate KPIs and
examine measures to extract insights from their data.

a. Data Science is about telling a story


Apart from handling strictly numerical information, data
science, and specifically business intelligence, is about
visualizing the findings, and creating easily digestible images
supported only by the most relevant numbers. After all, all
levels of management should be able to understand the
insights from the data and inform their decision-making.

Prepared by: Mr. Arnie Armada


7
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Business intelligence analysts create dashboards and reports,


accompanied by graphs, diagrams, maps, and other
comparable visualizations to present the findings relevant to
the current business objectives.

H. Where is business intelligence used?


a. Price optimization and data science
Notably, analysts apply data science to inform things like price
optimization techniques. They extract the relevant information in
real time, compare it with historical data, and take actions
accordingly. Consider hotel management behavior: management
raise room prices during periods when many people want to visit
the hotel and reduce them when the goal is to attract visitors in
periods with low demand.

b. Inventory management and data science


Data science, and business intelligence, are invaluable for handling
over and undersupply. In-depth analyses of past sales transactions
identify seasonality patterns and the times of the year with the
highest sales, which results in the implementation of effective
inventory management techniques that meet demands at
minimum cost.

Please watch Video Lecture 13 - Business Intelligence (BI)


Techniques

Please watch Video Lecture 14 - Real Life Examples of Business


Intelligence (BI)

I. Who does the BI branch of data science?


A BI analyst focuses primarily on analyses and reporting of past historical
data.

Prepared by: Mr. Arnie Armada


8
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

The BI consultant is often just an ‘external BI analysts. Many companies


outsource their data science departments as they don’t need or want to
maintain one. BI consultants would be BI analysts had they been
employed; however, their job is more varied as they hop on and off
different projects. The dynamic nature of their role provides the BI
consultant with a different perspective, and whereas the BI Analyst has
highly specialized knowledge (i.e., depth), the BI consultant contributes to
the breadth of data science.
The BI developer is the person who handles more advanced programming
tools, such as Python and SQL, to create analyses specifically designed for
the company. It is the third most frequently encountered job position in
the BI team.

J. Data Science predicting the future


Predictive analytics in data science rest on the shoulders of explanatory
data analysis, which is precisely what we were discussing up to this point.
Once the BI reports and dashboards have been prepared and insights –
extracted from them – this information becomes the basis for predicting
future values. And the accuracy of these predictions lies in the methods
used.
Recall the distinction between traditional data and big data in data
science.
We can make a similar distinction regarding predictive analytics and their
methods: traditional data science methods vs. Machine Learning. One
deals primarily with traditional data, and the other – with big data.

K. Traditional forecasting methods in Data Science: What are they?


Traditional forecasting methods comprise the classical statistical methods
for forecasting – linear regression analysis, logistic regression analysis,
clustering, factor analysis, and time series. The output of each of these
feeds into the more sophisticated machine learning analytics, but let’s
first review them individually.

A quick side-note. Some in the data science industry refer to several of


these methods as machine learning too, but in this module machine
learning refers to newer, smarter, better methods, such as deep learning.

Prepared by: Mr. Arnie Armada


9
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

a. Linear Regression
In data science, the linear regression model is used for quantifying
causal relationships among the different variables included in the
analysis. Like the relationship between house prices, the size of
the house, the neighborhood, and the year built. The model
calculates coefficients with which you can predict the price of a
new house, if you have the relevant information available.

b. Logistic regression
Since it’s not possible to express all relationships between
variables as linear, data science makes use of methods like the
logistic regression to create non-linear models. Logistic regression
operates with 0s and 1s. Companies apply logistic regression
algorithms to filter job candidates during their screening process.
If the algorithm estimates that the probability that a prospective
candidate will perform well in the company within a year is above
50%, it would predict 1, or a successful application. Otherwise, it
will predict 0.

c. Cluster analysis
This exploratory data science technique is applied when the
observations in the data form groups according to some criteria.
Cluster analysis takes into account that some observations exhibit
similarities, and facilitates the discovery of new significant
predictors, ones that were not part of the original
conceptualization of the data.

d. Factor analysis
If clustering is about grouping observations together, factor
analysis is about grouping features together. Data science resorts
to using factor analysis to reduce the dimensionality of a problem.
For example, if in a 100-item questionnaire each 10 questions
pertain to a single general attitude, factor analysis will identify

Prepared by: Mr. Arnie Armada


10
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

these 10 factors, which can then be used for a regression that will
deliver a more interpretable prediction. A lot of the techniques in
data science are integrated like this.

e. Time series analysis


Time series is a popular method for following the development of
specific values over time. Experts in economics and finance use it
because their subject matter is stock prices and sales volume –
variables that are typically plotted against time.

Please watch Video Lecture 15 - Techniques for Working with


Traditional Methods

Please watch Video Lecture 16 - Real Life Examples of Traditional


Methods
L. Where does data science find application for traditional forecasting
methods?
The application of the corresponding techniques is extremely broad; data
science is finding a way into an increasingly large number of industries.
That said, two prominent fields deserve to be part of the discussion.

a. User experience (UX) and data science


When companies launch a new product, they often design surveys
that measure the attitudes of customers towards that product.
Analyzing the results after the BI team has generated their
dashboards includes grouping the observations into segments
(e.g. regions), and then analyzing each segment separately to
extract meaningful predictive coefficients. The results of these
operations often corroborate the conclusion that the product
needs slight but significantly different adjustments in each
segment in order to maximize customer satisfaction.

b. Forecasting sales volume


This is the type of analysis where time series comes into play.
Sales data has been gathered until a certain date, and the data
scientist wants to know what is likely to happen in the next sales
period, or a year ahead. They apply mathematical and statistical
models and run multiple simulations; these simulations provide
the analyst with future scenarios. This is at the core of data
science, because based on these scenarios, the company can
make better predictions and implement adequate strategies.

M. Who uses traditional forecasting methods?


The data scientist. But bear in mind that this title also applies to the
person who employs machine learning techniques for analytics, too. A lot
of the work spills from one methodology to the other.

Prepared by: Mr. Arnie Armada


11
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

The data analyst, on the other hand, is the person who prepares
advanced types of analyses that explain the patterns in the data that have
already emerged and overlooks the basic part of the predictive analytics.

N. Machine Learning and Data Science


Machine learning is the state-of-the-art approach to data science. And
rightly so.

The main advantage machine learning has over any of the traditional data
science techniques is the fact that at its core resides the algorithm. These
are the directions a computer uses to find a model that fits the data as
well as possible. The difference between machine learning and traditional
data science methods is that we do not give the computer instructions on
how to find the model; it takes the algorithm and uses its directions to
learn on its own how to find said model. Unlike in traditional data science,
machine learning needs little human involvement. In fact, machine
learning, especially deep learning algorithms are so complicated, that
humans cannot genuinely understand what is happening “inside”.

O. What is machine learning in data science?


A machine learning algorithm is like a trial-and-error process, but the
special thing about it is that each consecutive trial is at least as good as
the previous one. But bear in mind that in order to learn well, the
machine has to go through hundreds of thousands of trial-and-errors,
with the frequency of errors decreasing throughout.

Once the training is complete, the machine will be able to apply the
complex computational model it has learned to novel data still to the
result of highly reliable predictions.

There are three major types of machine learning: supervised,


unsupervised, and reinforcement learning.

Prepared by: Mr. Arnie Armada


12
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

a. Supervised Learning
Supervised learning rests on using labeled data. The machine gets
data that is associated with a correct answer; if the machine’s
performance does not get that correct answer, an optimization
algorithm adjusts the computational process, and the computer
does another trial. Bear in mind that, typically, the machine does
this on 1000 data points at once.

Support vector machines, neural networks, deep learning, random


forest models, and Bayesian networks are all instances of
supervised learning.

b. Unsupervised Learning
When the data is too big, or the data scientist is under too much
pressure for resources to label the data, or they do not know what
the labels are at all, data science resorts to using unsupervised
learning. This consists of giving the machine unlabeled data and
asking it to extract insights from it. This often results in the data
being divided in a certain way according to its properties. In other
words, it is clustered.

Unsupervised learning is extremely effective for discovering


patterns in data, especially things that humans using traditional
analysis techniques would miss.

Data science often makes use of supervised and unsupervised


learning together, with unsupervised learning labelling the data,
and supervised learning finding the best model to fit the data. One
instance of this is semi-supervised learning.

c. Reinforcement Learning

Prepared by: Mr. Arnie Armada


13
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

This is a type of machine learning where the focus is on


performance (to walk, to see, to read), instead of accuracy.
Whenever the machine performs better than it has before, it
receives a reward, but if it performs sub-optimally, the
optimization algorithms do not adjust the computation. Think of a
puppy learning commands. If it follows the command, it gets a
treat; if it doesn’t follow the command, the treat doesn’t come.
Because treats are tasty, the dog will gradually improve in
following commands. That said, instead of minimizing an error,
reinforcement learning maximizes a reward.

Please watch Video Lecture 17 - Machine Learning Techniques

Please watch Video Lecture 18 - Types of Machine Learning


P. Where is Machine Learning applied in the world of data science &
business?
a. Fraud detection
With machine learning, specifically supervised learning, banks can
take past data, label the transactions as legitimate, or fraudulent,
and train models to detect fraudulent activity. When these models
detect even the slightest probability of theft, they flag the
transactions, and prevent the fraud in real time.

b. Client retention
With machine learning algorithms, corporate organizations can
know which customers may purchase goods from them. This
means the store can offer discounts and a ‘personal touch’ in an
efficient way, minimizing marketing costs and maximizing profits.
A couple of prominent names come to mind: Google, and Amazon.

Please watch Video Lecture 19 - Real Life Examples of Machine


Learning

Q. Who uses machine learning in data science?


As mentioned above, the data scientist is deeply involved in designing
machine algorithms, but there is another star on this stage.

The machine learning engineer. This is the specialist who is looking for
ways to apply state-of-the-art computational models developed in the
field of machine learning into solving complex problems such as business
tasks, data science tasks, computer vision, self-driving cars, robotics, and
so on.

R. Programming languages and Software in data science

Prepared by: Mr. Arnie Armada


14
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Two main categories of tools are necessary to work with data and data
science: programming languages and software.
a. Programming languages in data science
Knowing a programming language enables the data scientist to
devise programs that can execute specific operations. The biggest
advantage programming languages have is that we can reuse the
programs created to execute the same action multiple times.

R, Python, and MATLAB, combined with SQL, cover most of the


tools used when working with traditional data, BI, and
conventional data science.

R and Python are the two most popular tools across all data
science sub-disciplines. Their biggest advantage is that they can
manipulate data and are integrated within multiple data and data
science software platforms. They are not just suitable for
mathematical and statistical computations; they are adaptable.

In fact, Python was deemed “the big Kahuna” of 2019 by IEEE (the
world’s largest technical professional organization for the
advancement of technology) and was listed at number 1 in its
annual interactive ranking of the Top 10 Programming Languages.

SQL is king, however, when it comes to working with relational


database management systems, because it was specifically
created for that purpose.

SQL is at its most advantageous when working with traditional,


historical data, for example when preparing a BI analysis.

MATLAB is the fourth most indispensable tool for data science. It


is ideal for working with mathematical functions or matrix
manipulations.

Big data in data science is handled with the help of R and Python,
of course, but people working in this area are often proficient in
other languages like Java or Scala. These two are very useful when
combining data from multiple sources.

JavaScript, C, and C++, in addition to the ones mentioned above,


are often employed when the branch of data science the specialist
is working in involves machine learning. They are faster than R and
Python and provide greater freedom.

b. Software in data science

Prepared by: Mr. Arnie Armada


15
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

In data science, the software or, software solutions, are tools


adjusted for specific business needs.

Excel is a tool applicable to more than one category—traditional


data, BI, and Data Science. Similarly, SPSS is a very famous tool for
working with traditional data and applying statistical analysis.

Apache Hadoop, Apache Hbase, and Mongo DB, on the other


hand, are software designed for working with big data.

Power BI, SaS, Qlik, and especially Tableau are top-notch examples
of software designed for business intelligence visualizations.

In terms of predictive analytics, EViews is mostly used for working


with econometric time-series models, and Stata—for academic
statistical and econometric research, where techniques like
regression, cluster, and factor analysis are constantly applied.

Please watch Video Lecture 20 - Programming Languages Used in


Data Science

Please watch Video Lecture 21 - Careers in Data Science

Please watch Video Lecture 22 - Debunking Common


Misconceptions about Data Science

IV. Learning Tasks


A. Engage (30 points)
1. You are tasked to know someone who is working as a data scientist or any
related job role. You may conduct an interview, search and read Internet
articles/blogs or watch a documentary or vlog. What you need to do is to
describe and document the job role(s) of your subject. You must be able
to answer the following questions:
a. What is the name of company or organization he/she is working
on?

 The name of the company that Dhanurjay Patil has worked on is the
Whitehouse Office of Science and Technology Policy.

 He also worked as VP of RelateIQ and Head of Data Products and


Chief Scientist at Linkedin. And he also held different positions
at Paypal, Ebay and Skype.

b. What is (are) the main products/services of the company he/she is


working on?

Prepared by: Mr. Arnie Armada


16
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

 We know that the White House is the official residence and


workplace of the president of the United States. He works there as a
Deputy Chief Technology Officer for Data Policy and Chief Data
Scientist. The Whitehouse Office of Science and Technology Policy
are responsible for:

 Healthcare
Cybersecurity and the health ecosystem
Ending suicide and improving mental health
End cancer - Cancer Moonshot
 Criminal justice
Increase trust between law enforcement & citizens - Police Data
Initiative - 44M Americans and 130 jurisdictions (launched by the
President in Camden New Jersey)
End the endless of cycle of incarceration - Data Driven Justice
Initiative - 94M Americans, 141 jurisdictions and 10 States
 Big data and Artificial Intelligence National Strategy
All data courses must have ethics & security
Ensuring data isn’t used for discrimination
 Data and improving life of Americans
Using data to help local communities connect to opportunities -
launched the Opportunity Project
Addressing the drastic rise in traffic fatalities
 Increasing Federal capacity to be data-driven
Helped establish ~40 Chief Data Scientists/Officers across the
Federal Government
Established the Data Cabinet and associated leadership group
across national security
 National Security
Encryption
Bring Silicon Valley & the Pentagon closer together - helped
establish DIUx and the Defense Digital Service

c. What is his/her job title and how did he/she get the job? Its pre-
requisites? Skill set?

 Dhanurjay Patil was appointed as the first U.S. Chief Data Scientist and
established the mission of the office: To responsibly unleash the power of
data for the benefit of the American public and maximize the nation’s
return on its investment in data.

 He gets the job because he graduated from the University of California in


San Diego with a Bachelor’s degree in mathematics. He also did his PhD in
applied mathematics from University of Maryland, College Park. He has
been working as a research scientist in Ebay, Paypal and Skype from July

Prepared by: Mr. Arnie Armada


17
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

2006 to May 2008. He experienced working in LinkedIn in May 2008 to


March 2011 as a head of Data Products, Chief Scientist and Chief Security
Officer. And he was a Data Scientist in Residence at Greylock. We can say
that he has a lot of work experience in data science that’s why he got
hired in the Whitehouse as a Chief Data Scientist.

 In order to become a data scientist you must earn a bachelor's degree in


IT, computer science, math, physics, or another related field, earn
a master's degree in data or related field, and gain experience in the field
you intend to work in (ex: healthcare, physics, business).

 Data scientist requires technical skills in computer science like Python


coding, Hadoop Platform, SQL Database, R programming, Apache Spark,
Machine Learning and AI, data visualization and unstructured data. It also
requires non-technical skills like intellectual curiosity, business acumen,
good communication skills and teamwork.

d. How long did he/she stay on the job?

 Dhanurjay Patil stays on his job for 2 years from February 2015 to January
2017. He is now a Former U.S. Chief Data Scientist.

e. Is it a fulfilling job that aspiring data scientist must aim for? Why?

 For me, I think it is really a fulfilling job if you are really passionate in this.
We know that data scientist is not an easy job because it will take you a
lot of work and energy and the expectations on your performance are
high. If you are happy in working with data’s and math, then you will be
fulfilled in this job. The demand for data scientist is high and the world is
generating a massive amount of data every day. It is a fast-growing field
and provides high salaries. Getting this job is really a big accomplishment
because only few can get this job and those who get this job are the data
enthusiast. If you really want this job you won’t care about how hard it is
and you will only focus on achieving it.

Your answer:
Link/Source https://www.linkedin.com/in/dpatil/
https://en.wikipedia.org/wiki/DJ_Patil

Name of your Subject Dhanurjay “DJ” Patil


Job Title/Role U.S Chief Data Scientist and Deputy Chief
Technology Officer for Data Policy
Company/Organization Whitehouse Office of Science and Technology
Policy.
Company Location Eisenhower Executive Office Building
725 17th Street NW, Washington, D.C., U.S.

Prepared by: Mr. Arnie Armada


18
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Salient Dhanurjay “DJ” Patil (born August 3, 1974) is


Points/Summary an American mathematician and computer
scientist who served as the Chief Data Scientist of
the United States Office of Science and
Technology Policy from 2015 to 2017. He has
held a variety of roles in Academia, Industry, and
Government. He is Head of Technology for
Devoted Health, a Senior Fellow at the Belfer
Center at the Harvard Kennedy School, and an
Advisor to Venrock Partners.

Dr. Patil was appointed by President Obama to be


the first U.S. Chief Data Scientist where his efforts
led to the establishment of nearly 40 Chief Data
Officer roles across the Federal government. He
also established new health care programs
including the Precision Medicine Initiative and the
Cancer Moonshot, new criminal justice reforms
including the Data-Driven Justice and Police Data
Initiatives that cover more than 94 million
Americans, as well as leading the national data
efforts. He also has been active in national
security and for his efforts was awarded by
Secretary Carter the Department of Defense
Medal for Distinguished Public Service which the
highest honor the department bestows on a
civilian.
In industry, he led the product teams at RelateIQ
which was acquired by Salesforce, was founding
board member for Crisis Text Line which works to
use new technologies to provide on demand
mental and crisis support, and was a member of
the venture firm Greylock Partners. He has also
been Chief Scientist, Chief Security Officer and
Head of Analytics and Data Product Teams at the
LinkedIn Corporation where he co-coined the
term Data Scientist. He has also held a number of
roles at Skype, PayPal, and eBay.

Prepared by: Mr. Arnie Armada


19
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Photo

(Optional, if you did an interview) Attach:


1. Your guide questions
2. Transcript of interview
B. Explain
1. The Data Science Infographic (50 points)
Data Big Data Business Traditional Machine
Intelligence Methods Learning
When This is data that is This is applied in After the data has After the BI After BI reports
structured and the beginning of been gathered reports and have been
stored in the analysis and its and organized dashboards created and
databases which data are bigger and have been discussed and it
analysts can than the it must look at prepared and is also applied
manage from one traditional data. the patterns of have insights, it when there is
computer and it is the past behavior is extracted and more limited,
applied in the to draw them it is the basis structured data
beginning of the insights. for predicting available.
analysis the future
values.

Why Data-driven It is more accurate It uses reports Predictive Predictive


decisions require and data-driven and dashboards Analytics Analytics
well-organized decisions require to easily visualize -It uses -Utilize artificial
and relevant raw well-organized and gain business advanced intelligence to
data stored in a and relevant raw insights. statistical predict behavior
digital format and data stored in a methods to in
it can increase the digital format. evaluate the unprecedented
time before future scenarios ways and it
business value can helps us in
be realized from immense
the data. amount of data
and let the
computer
analyze and

Prepared by: Mr. Arnie Armada


20
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

make data
driven
recommendatio
ns.
What DATA DATA ANALYZE THE Linear Supervised
COLLECTION COLLECTION DATA Regression Learning
-Linear -Supervised
PREPROCESSING: PREPROCESSING: EXTRA INFO AND regression attem Learning input is
Collect raw data Class-label the PRESENT IT IN pts to model the provided
and store it on a data THE FORM OF: relationship as a labelled
server -The big data is -Metrics between two dataset; a model
-This are the extremely varied, -KPIs variables by can learn from it
data’s come from instead of -Reports fitting to provide the
a linear equatio result of the
surveys and it is ‘numerical’ vs -Dashboards
n to observed problem easily.
not yet touched by ‘categorical’, the
data.
the scientist. labels are ‘text’,
-SVMS
Class-label the ‘digital image data
-NNs
observations ‘and etc.
-deep learning
-This is arranging Data cleansing
-random forests
the data to their -Detecting and Logistic -Bayesian
correct data type correcting corrupt regression networks
like it’s either or inaccurate -Logistic
numerical or records from a Regression is a Unsupervised
categorical. record set, table, statistical Learning
Data or database analysis method -There is no
cleansing/data Data masking used to predict complete and
scrubbing -Is the process of a data value clean labelled
-Correcting the hiding original based on prior dataset in
misspelled data data with modified observations of unsupervised
and missing content. a data set and it learning.
values. CASE SPECIFIC: operates with Unsupervised
Data balancing Text Data Mining 0s and 1s. learning is self-
-This is applying Confidentiality organized
the data balancing -Is the process of learning. Its
methods wherein deriving high- main aim is to
they extract an quality explore the
equal number of information from underlying
Cluster analysis patterns and
observations for text.
-Cluster predicts the
each category and
analysis is a output. 
prepares for fixing
technique to
the issues.
group similar -k-means
Data shuffling
observations -deep learning
-This is eliminating
the unwanted
Reinforcement
patterns by
Learning
arranging the data

Prepared by: Mr. Arnie Armada


21
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

points and also into a number -It is neither


improve the of clusters based on
predictive based on the supervised
performance observed values learning nor
of several unsupervised
CASE SPECIFIC variables for learning. Here
-Balancing and each individual. the algorithms
shuffling data sets learn to react to
-Entity an environment
Relationship on their own. It
Factor analysis
Diagram is rapidly
-Factor analysis
growing and
is a way to take
moreover
a mass of data
producing a
and shrinking it
variety of
to a smaller
learning
data set that is
algorithms. 
more
manageable
-Relational and more
Diagram understandable

Time series
analysis
-Time series is
simply a series
of data points
ordered in time.

Where -Basic Customer -Social Media -Price -User -Fraud Detection


Data -Financial Trading Optimization Experience (UX) -Client Retention
-Historical Stock Data -Inventory -Sales
Price Data Management Forecasting
Who -Data Architect -Big Data Architect -BI Analyst -Data Scientist -Data Scientist
-Database -Big Data Engineer -BI Consultant -Data Analyst -Machine
Engineer -BI Developer Learning
-Database Engineer
Administrator
How Programming Programming Programming Programming Programming

Prepared by: Mr. Arnie Armada


22
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Languages: Languages: Languages: Languages: Languages:


-R, Python, SQL -R, Python, Java -R, Python, SQL -R, Python and -R, Python, Java
and MATLAB and Scala and MATLAB MATLAB MATLAB,
JavaScript, C, C+
Software: Software: Software: Software: + and Scala
-Excel and IBM -Apache Hadoop, -Excel, Qlik, -Excel, IBM
SPSS HBase and Power BI, Tableau SPSS, EViews Software:
MongoDB and SAS and Stata -Microsoft Azure
and RapidMiner

Given the table above, your task is to fill-out the table and provide at
least one example for each data science field.
You should also provide a brief definition using your own words for all the
techniques and processes that will be part of your answer.

C. Explore
1. Based on the data science infographic discussed, fill-out the table below
by listing down at least two (2) software tools or programming languages,
not given in the infographic, that can be use in each of the following data
science process or technique: (20 points)

Data Science Process Software/Programming Languages Screenshot


(Logo)
Data Acquisition Software Tools Informatica Software
(ETL) -PowerCenter Informatica
-Talend Studio

Programming Languages:
-PL/SQL
-Perl

Prepared by: Mr. Arnie Armada


23
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Talend Studio Software

Data Cleansing Software Tools: OpenRefine Software


-OpenRefine
-Trifacta Wrangler

Trifacta Wrangler Software


Programming languages:
-Ruby
-Lua

Prepared by: Mr. Arnie Armada


24
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Data Warehousing Software Tools: Google BigQuery


-Google BigQuery
-Amazon Redshift

Amazon Redshift

Programming languages:
-Ruby

Data Analysis Software Tools:


-KNIME
-Sisense KNIME Software

Programming languages:
-Julia
-GNU Octave Sisense Software

Prepared by: Mr. Arnie Armada


25
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

Data Visualization Software Tools: Domo Software


-Domo
-Highcharts

Highcharts Software

Programming languages:
-C#
-Rust

Prepared by: Mr. Arnie Armada


26
Republic of the Philippines
City of Olongapo
GORDON COLLEGE
Olongapo City Sports Complex, Donor St., East Tapinac, Olongapo City
www.gordoncollege.edu.ph

V. References
1. Udemy. 2020. “Complete Data Science Training: Mathematics, Statistics, Python,
Advanced Statistics in Python, Machine & Deep Learning”. Retrieved from:
https://www.udemy.com/course/the-data-science-course-complete-data-
science-bootcamp/learn/lecture/
2. 365 Data Science. “Defining Data Science: The What, Where and How of Data
Science”. Retrieve from: https://365datascience.com/defining-data-science/

Prepared by: Mr. Arnie Armada


27

You might also like