Unit-1 Data Science
Unit-1 Data Science
I I
0
I 19
•
_()(l< :.::010
8
2020
1
8
Commun,c••• Oat.:,
Jl:ti.~Uh1- Pr •p•r•uon
I
Mod•I
Planning
IAod•!
o... 1lchng
SQL
Analysis SAS/
R ACCESS
Services
\lodd huildi11 g: In this phase. de, clop datasets lot tra1111ng and testing
purpo~es. I !ere rnrl'>1d1.:1 whethc, tht: c,1-;1111g tools \\ill iiUnicc lc.)1 1111111mg the
node! cir 11 \Viii 11et:d a more 1ubu ... t cnv1ro11mcn1 Analy:tc vMtOtb k.1rn1ng
techn1quc~ lil e class1fii.:i1\l(J1t, :1s~ociat1011 and dustenng to bt11ld the 11101.kl
I0
~
SAS WEKA SPCS Matlab Alpine Stat1stica
tnterpnse Modeler Miner
t,.tmer
Opl.'l"U1io11 ali1e: In tlm. ph,ise, deliver final n.:po1 ls, bricl'1ngs. code and
tochnH:al documentl) arc done. In addi tion. somct1mcs a pilol proJccl is abo
1mp1emt:nt<.>d m a real-time production cnv1ronmcnl. This will provide a clear
picture of the performance and othc1 related constraints on a small scale before full
deplo;ment.
<om municalc results : !\Jo\\ il is 11nporlant to evaluate the process of
aclue, mg the goat that had planned m the firsl phase. So. m the last phase. 1dcnt1 fy
all the ke~ findmgs. communicate to the stakeholders and determmc 1! the results of
the proJect 31e a success or a foilu1 e based on lhc cntcria dc,cloped in Phase 1.
RCtbObU
Pboou,g
IA1chm• learning
Data science
Data
i.NN Texl prep ,rooon
min:ng
Sl.'.llllitia
Proc=
mnrng
:/1sual1ZDL'OO
Processing
p: rad,gms
E' >.p nm r IJllOn
o e bu t\ c wnte111
lltt:,
12
U t,u1 v.rth <l 1tt1, v.h1d1 <.tan rnngc flom n i:;1rnplc arrny of ,1 fcv.
r, t1on to ,1 complex mt1111x ul 1111llio11s of ubscrvallons with
11alIJ 1,i.;1c11cc utilucs \.:crt.1111 spcci:di1cd t:1J111pulcltional
n 1d r Ill d1'ic0\c1 mc.111111rlul ,111d u cl11l M111c1111cs wi1l11 11 a dataset. ·1 he
I 1put (X)
lnr 1 (1.1
M,sc.; h 1t10 Ruµ1c~c11ldl1vc
lm.unln~ model
0 lpu 1
~}
Prndt,)NI '
ov1 ut (:)
I !' l '1 MI l11n1.: I ni11111 •
13
DJW <..1ence 1mohc~ C"lr.1..lln • buildm '• <.ombmmg und learning data
;\ l h.h ,., d,1.,:,1f1ed here
t \lt ,, ~lin:.! i\fi•:,nin:!lul l'aitt•l'fl •~: Knowledge d1scovcrym database,, the
non I\ 1 1 process of 1dent1fymg , ahd no"cl. potcnually useful, and ultimately
underst nd blc pJUem~ or rclat1onsh1ps wulun a data~et m order to make 1mponant
dec1~10n Data science 1m oh es inference and sterauon of many differen~
hypolhc~s One of the key aspects of data science is the process of gcncrahzatton
of p ttemc. lrom a dataset. The gcncrali/.ation 5hould be valid, not just for the
da1.isct u ed to obser. e the pattern, but also for new unseen data. 'Jbe ult1male
obJectl\ e of data science 1s to find potenttally useful conclus1ons that can be acted
upon b) the users of the analysts.
Buildin~ Rcpn:'>entath c \fod els: In statistics. a model is the
repre~en·.auon of a relauonship bem ecn variables in a dataset. ft descnbes hov. one
r more , anables m the d.ita are related to other vanables. Modeling 1s a process m
v.l 1ch a repre entame abstraction is built from the observed dataset. For example,
based on credit score, rncome level, and requested loan amount, a model can be
de, eloped to determme the mterest rate of a loan. For this task, previously known
bservauonal data inc]uding credit score, income level, loan amount, and imeresl
rate urc needed. Once the reprcsentaLJ\ e model 1s created, 11 can be used to predict
me value: of the mtcrest rate, bac;ed on all the mput variables.
( c,r11him11i1,n ,,1 ~latis;lin, \lachine Learning. and ( r,111puting: In the
pur H f -.:trac.ttng uselul and rclc\oant mfonnat,on trom large oatasctS, data
e borro\\ ~ computatJonal techmques from the di!;Ctplmes of sta11st1c~,
c·q,enmenta11on. und d.Jtabasc theories The algonthms u ed m
d.Jt.l 1ence ongrn.itc from theM: d1cc1plmes but have smcc evolved to adopt more
di r ~ le hmques 6UCh & paralld computing, evolutwnary cornpt:tmg. hngu1s11cs,
1 r.,I tud,ec;
14
Jata :111,dy-..1, tcch111ques. ::-.1c.tny of these algorithms were developed in the past few
drc.1tk-.. ,md arc a pa11 of machine learning and artificial intelligence. Some
nlgornhm-.. arc based on the foundations of Bayesian probabilistic theories and
r\.!gre._,,on analys1::,, originating from hundreds of years ago.
The.-,e 11erat1ve algorithms automate the process of searching for an optimal
solution for a given data problem. Based on the problem, data science is classified
into ta-,ks such as classification, association analysis, clustering, and regression.
Each data science task uses specific learning algorithms like decision trees, neural
networks. k-nearest neighbors (k-NN), and k-means clustering, among others .
. \5sociated Fields: While data science covers a wide set of Lechniqucs.
,tpplJc,rnons. and disciplines. there a few associated fields that data science heavily
relic:-. on. The rechn1qucs used in the steps of a data science process and in
conJunctio11 w1Lh the term "data science" are:
• D1mens1onal Scalmg
• If ypothc..,,s Testing
• Data Lng111cenng
• Bu'imcss Intclli~cnce
1.5 Ca'>es for Dala Science:
l 1:id1t1011al analysis tcch111ques like dimensional slicing, hypothesis tcstrng.
"1d dl: ~c, 1p11ve statistics ea11 only go so far 111 1nlormation discovery. A paradigm ,.,
needed 111 11Ja11:1gt tile massive volun,e ol data, explore the inter-rclatiorn,h1ps of
ands ot variablt.:~, and deploy machine learning algorithms to deduce optimal
from datascts.
15
. \ set of frameworks, tools, and techniques are needed to intelligently assi::.t
humans to proce.-.s all these data and extract valuable infom1ation. Data science is
one "uch paradigm that can handle large Yolumes with multiple attnbutes and
deplt,:,. rnmplex algomhms to search for patterns from data. Each key motivation
tor u..;mg data science techniques is explored.
Y olum e: The sheer \-Olume of data captured by organuat,ons 1s
e-...ronent1ally increasing. The rapid decline m storage costs and ad\ ancement::, in
capmring e\ ery transaction and event. comb med with the bustness need to extract
a-:-. much leverage as possible using data, creates a strong motivation to store more
data than e\er. As data become more granular, the need to use large volume data to
extract information increases. A rapid increase in the volume of data exposes the
limitattom, of current analysts methodologies. Jn a few implemcntat10m,, the time to
create generalization models is critical and data volume plays a maJor part in
detem1in111g the llrr.e frame of de\elopment and deployment.
Dim ension s: The three characteristics of the Big Data phenomenon arc
high vnlumc. high velocity. and high variety. The variety of data relates to the
multiple t 1 pes ohalucs (numetical, categorical), fo1111ats of data (audio files, video
files), and the application of the data (location coordinates, graph data). Every
single record or data point contains multiple attribute::, or variables to provide
context for the record. For example, every user record of an ecommerce site can
contain auributes such as products viewed; products purchased, user demographics,
function ur rdatiom,h1p based on labeled tn11ning data and uses this funcuon to map
.new unlabeled data. Supervised techniques predict the value of the output \ 'anablcs
don a set or input variab les
16
·1 t' dt, tlm,. a mndcl ts dt·, doped Crom a training dataset when; the values ol'
111pu1 and tllllput a1c pre, t011-;ly l...1H)'.\'I\ fhc model gcnernli1es the relationship
bL't\\ L'Cn thL' mput lrnd output \'ari<1bks and uses it to predict for a dataset whne
tlllh 111pu1 , .mablcs ::ire kno\\ n. The output , ·anahlc that 1s being predicted I'> also
L'.llkd ,1 d,tss L1bL'l or urJ_cl , anablc Supcn 1scd data science needs a sunic1ent
numbc1 t'C labeled rcconh to learn the model from the data Um,upcn 1scd 01
u11d11cc1cd dat.1 science unco, crs hidden patte1 ns 111 unlabeled data. In unsuperv1<,cd
dat.1 -;c1cncc, there arc no output vanabks to predict. The objective or this class or
JaLJ -..c1cncc techniques i::i LO find patterns 111 data based on the relationship between
data pomts themselves. An application can employ both supervised and
un-.,upcn 1-.ed learners.
Data science problems can also be classified into tasks such as:
cla~::iiticat1011 regression, association analysis, clustering, anomaly detect1011,
recommendauon engines, feature selection, time series forecasting, deep lcairnng,
and tC\.l 111111mg. C la::isification and regression techniques predict a target , anable
based on input , ariablcs. The prediction is based on a generalized model built l rom
a previous!)- kno\l.. n datasi.::t. In regression ta::.ks, the output variable 1s nume11c.
Clas:-.1fic.it1on iasks prcd,et output , arinbles. which arc categorical or polynom,al
which 1s illustrated 111 fig 1.5. Deep lea, ning is a more sophisticated artificial neural
nen1.ork that 1s increasingly used for classification and regression problems.
Clu-.,te, 111g is the proces::i or idenll lying the natural groupings in a dataset.
[;or example, clustering is helpful in finding natural clusters in cu::itomer
datasets. ,, h1ch can be used for market segmentation. Since this is unsupervised
da1~1 -.cienct:, 1t 1s up to the end u::icr to imestigate why these clusters arc formed in
tht: data and gt:ncral17e the uniquent:ss or each clustc1. In reratl analytics, 1t 1s
common to 1dcnt1fy pairs of items that are purchased together, so that specific 1te1m,
bc bundled 01 plact:d nt:xt to each otht:1. This task is called 111a1 ket bitsket
or assoc1a11on ana1 1 s1s, which is co111monly used in cross selling.
17
!1cgro::,::,1on
Assocotiori
F~ ;:ituro sctoc1,or1
an:ilys1r;
Data scie nce
Ancmal/
Tt\! I !flt I!:)
(ip[(>(l/01
Recommcndat1on engi nes arc the systems that recommend items to the
u... cr-, ba:-,cd on individual user preference. Anomaly or outlier detcet1on identifici>
the data ro1nts that arc s1gnilicantly different from other data points 111 a dataset.
Credit card t1amact1on fraud dctectio11 is one or the most prolific app lications of'
an omaly tktcct1on. Timl: scric:. lo rcca-,ti ng is the prOCl:SS of' predicting the 1·uturl:
value of a vanabk Text mining ii> a data i>Ctence application whcre the input data is
texL which can be 111 the fo1111 or documents, messages, emails, 01 web pages.
To aid tht..: data i>Ctence on text data, the !ext files are first converted into
docurn i:nt \ectori> where each unique word 1s an attribute. Once the text file is
comcrtcd to documcnt vectors, standard data sciL:nce tasks such as clasi>ifica!1011
'
clu ... tcnng, ctc., can be app lied. Fcature si: lcction is a pnJcess in which auributes i 11
a d..it:t'ict a1t..'. rcduced to a few atlribulc!:> that rea ll y niatte1 . Un'iupervisl:d teel1111ques
pro, idc an incri:used undcrsta11d111g or the datai>el a11d hence, are somct1111cs called
18
,\s an e'l.ampl e of ho\\ both unsu pervised and supervised data science can
be combmed in an application, consi der the fo llowing scenari o. In marketing
an..1lytics, clustering can be used to find the natural clusters in customer records.
Each customer is assigned a cluster label at the end of the clustering process. A
labdcd customer dataset can now be used to develop a model that assigns a cluster
l.ibd for an) new customer record wi th a supervised classification technique.
1.7 Data Science Algorithms:
\ n algorithm is a logical step-by-step procedure for solving a problem. In
da!J \l.'tcnce. it is the bluepri nt for how a parti cul ar data prob lem is solved. Many of
th(' kam1ng algorithms are recursive, where a set of steps are repeated many times
until a limiting cond ition is met. Some algorithms also contain a random variable as
an mput and are aptly call ed randomized algorithms. A classification task can be
sol\ ed using many differen t learning algorithms such as decision trees, artificial
neural netv,orks, k-NN. and even some regression algorithms. The choice of which
algomhm to use depends on the type of dataset, objective, strncture of the data,
presence 01 outl iers, ava ilab le computati onal power, number of records, number of
attnbutes. and so on. It is up to the data science practitioner to decide which
algomhm to use by evaluating the performance of multiple algorithms. There have
been hundred~ of algo ri thms developed in the last few decades to solve data science
problems.
Oala ~c1ence algorithms can be impl emented by custom-developed
in almost any computer language. This obviously is a time
order to foc us the appropriate amount of time on data and
rithms, data science too ls or statistica l programming tools, like R, Rapid Miner,
on, SAS En terprise Miner, etc., which can impl ement these algorithms wi th
, can be leveraged. These data science tools offer a libra ry of algo rithms as
t1 on~, wh ich can be interfaced through programming code or confi!:,'l.Irated
ugh graphica l U!>Cr in terfaces.