0% found this document useful (0 votes)
2 views

DataSciene Module 1-2

The document discusses various aspects of data science, including the importance of big data, statistical analysis, and machine learning. It highlights the skills required for data scientists, such as data visualization, communication, and statistical thinking, while also addressing the challenges faced in the industry. Additionally, it emphasizes the need for effective sampling methods and exploratory data analysis to derive meaningful insights from data.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

DataSciene Module 1-2

The document discusses various aspects of data science, including the importance of big data, statistical analysis, and machine learning. It highlights the skills required for data scientists, such as data visualization, communication, and statistical thinking, while also addressing the challenges faced in the industry. Additionally, it emphasizes the need for effective sampling methods and exploratory data analysis to derive meaningful insights from data.

Uploaded by

kanti chandrakar
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Module

BigBig Dala , Dala Sulence Hype


Lak of delenadiono assaond bagc teaminolosy
Bio Dats -
ttous
Big Relálive
Doda Suence - ttow u t Suence, veliun ghip BgData
academin %
3 Lok o respec frt reseacheg
indas tay
Stathiuan, mahs, computea Suus, unguneLas, ML

3. Hype i Crazy
Maste urverse , mosA2 Eoulay, tncaeaseg the
noiSeto-siznal vohio

4. staisticam fcel Key


aRe alcady studyig wuks
Buu thoy donot knod koDw
on Sue e of Dola
nuRk.
cdeuity hat
Data S u e n e not rebrandng ML &st, a

freld in tsel
Any Huin that has to call tse a suevuu NDs
Dala Suence ttse Rpeseile nohng but Caaje

Geting Past
4he Heype
acadesuA
. DisfeAmee bn industy
Ma ue awout D A
2 Doctahcalon
abumdance Ot Comp uhing pou
Tating all
Shoppin. Commwmieaion, Seachi
aspet g Ahamce, Meducal phasmaceulialLs
8-tusina hem
to dala Sou hJellas, educalan r e l a

Ye tomedáLLon ssem
ouae, tuutba , Linked tnmaluoy
walkin , akeg
SemSo, Camesa Soogte Glass valul

2Hincy
Role o Soclal sieiligt n ds
-Soual Poduty - ConlexkE

humamuser behauior -Souta Pobemg


Swbs ta ve -fiiovnd comurbhm
people Yov know
ueshum asFas invegbatwe - c o m u n y
- h i s t t tal

DolA Sutene TobS


Cormpulu Sutence

Slat fics
3 Commuucoliom &Pesentabon skls
DaaVisualizallcn
5 Domatn Eppekse
MaHs
Mahmmt teagning
Skiu Aoas.

Busine 1Beunig
Date
Dot Cyestiue 3. MLBy Dale

3 Dta Devetope 3. Mak

Date Resca3 dhee .Pomaminz


SStt
The uen townds cape venn Diaaiam)

Deta Suleniut
In Acodemua Jn Jndusty
- No one tauls themsel ag depemds on Level of seuoru
Doa S u l y t , slondaiy -thuek Dola Suehse

CEO,CTo
-se g for opphong Funds Jngla Suuke, dedstn
- a l Studleds OO 6 becm
Makn1 d l s teeiern
DS wodol Alutty m
Noe de Ds tano, fsMaly collelnA , dpams, hunin
dlebned Debugng, wsw'2tAon
auneJ n oual Sune to wnuwCod laocugShup
bioloo To aluL
statistial hinking i Age afBia Daa
8 wndle of Technowgug
Rev oluhon m Measuseme
abut ho decisions
Point f rieus or hlosophy
ill be Made in tuse.

stats, Lineas Aaebo. ML,


Dota pupaalion, Munging modelingCodin3
VISualzauon , Cornmulnicouo

StodigLicol JnfeUmce.
w&ld g conplex, Ravdom, uncastaun o dl
da
9emahaos) Mahine

iegke al
steck
shoppuna,email, dNA,
JnHesnet,
emeaat data
ofeal wi ld pooces ,
epcgeuty the baces
Dla data
Colle
deuided by u
wwoh dala to
Sampltn9 ehUd
cole uon undeg staud the2
d»ta aturec fo
Anatyze he
wod % Paoci.
wald to
wld

PoCe o Jotning Rowm


The
The oveaall
wod ie Calad
dala back to
aud h e n Rom
dat
Sfatisthoal JnleseMCe
mehod
developmemt o potcedn eg,
uth
Jt
Jt deals 8
allows to exaack maanina
extbat
thaE
heoemg
dala qeuesalad
A n A R l a d
by
SntemmaLLon) 8on
Stucanshc (oavdom)

daa daveloped
Matemolal Modlels or feudiong of
estimatos
Ose called stadist ca
Populadion:- set o objeutr or wwu such as twedtg,
or sadß,or uses
Photoglaph
of all bjecto ale extbadad
- chosacluisthics
Called set of obseguationg
af tsau lion?
repxsetg total
N no.

the popwäluon
Populalo) sevdeh name, ust
he
Ex Jn email

ofTeipientg dlale gemt, ext, no. of choaadlu


no. of geutny, no veabs, 1engm o tine
unut s t epla
populatuo) i eramied,
Sample et o b j e t y iomm a sampleg uhch
a callad
'nSub obet
nake tnfeseni
a k e injeenL
to draw
onclueiong &
a3R uRed
about he populaien
ase Made
u gt
t not
n ot have
have bias
mehod g
Samplna
he
ca diskaiE tne tn keuce
in kreuCe
wwuch
Seloctes at dAMd omn
eMaulg
- Yo emplajoe

cO we Re
e al populaluoo)
Meed Of Samplina Cudy

Sampung solveg Some engineeirng thallengeg


etsed
- Hou Muchh dota
radoop hamdlle LaAe dala
. Bios

3 Sanpuin du'stnbutuen
New inds uf data:
Thaditunal - Numesi cal, catgtgal, binaay
ohle
2. Text -
tUaetg, emails,
vSen leve! dola , hmestapad event,
Json l
3 Rec&d3 -

locau) dota
Geo based
5Nefw&k T. m a G
6Semso Date
8ia Dota
.Bi
Bia movina tasge (sIze of pata)
o n one madhine
Bia uhen you Camt tt t
ulhulal phenomemon.
Biq bata u
Valuua
4V's volume, vateky,velocu

Bg Dota Bi Assumptnm
data patha +hann sample
C o u e t g use lot o
Megsineas in 4ous dala
Accepting
EnouLin4 h e calses
3 6iung up on

Ca N = ALL ?
beueve data Speakas
Data n o aevee - w o n s to

N 14 sample size o

Modelin
Build mocdel om dala Colecked
Ma9siL amot dola2
-
Repes@dinz
Daia Sciert caplua the uncaBaunuy
Soishuamg 8
uith
andomnes O geneaating paocesg
dalla
s
mathematical lundions t h a expes shape
SRutue Odala sel
undesstamd pesem
mocdol i an ade mpt to
view uke
peRticuules
the talty thaough
a
the Motamalual
or
as thiketuaal , biological
Consutin whe a l
- moolel a m ashkiual
extraneous defaily aseemoved.

Vasiables ae Anuded
Croeleualeut
chek whethe
excludod
O Rey Vasuablug
SHatighiol Modoling Cwie some eelailp abok st-ts)
Mathernucal expgasIÓn? that incdude posameleag
as not Enouon.
bu- vabes o po9amlou
ed for paamete
Geek tetleu aAe

Lain lette or Dalk


wuh Lünaa)elotunship
inah Delotnship
dota column Ss y
E
u L m ag
P ,ase pesamdeus
uah
dlaam
utle
data {ow W
Whha
a
Some people 3
hoo 4ingg ofecte
asAows hauing
ouea hime
nappens abou snueise
Struciae
as Mode
ok agsumphn uuny thege
thee
Lot ghouuld lxplotn uhy
ealby-one
a e made
.

Choises
tial g erovr ikration
Do EDA Matns plok,
H'stog3mg, ctler plos
mode
Mole
simple s accuRale
Toede off b
90. 13.
eme equsLd Complpi

Poobabuty Diskubulle
he l w ld
eagRenealg emualad by
She MOhematical Shapeg,
Some
Pocogollo
these mathemtical tundiong
Jhe poameler hor
dala
dala
ca be eshimated Rom h e
oa models,
ase buldug bloks
outtome tolbe inerpaele!,
inerpelel
oPoseable
phobobiluky
fx-u
22
N MT) N.
2
Sople disubuhiong ace

N&mal . Epponenlual I Double Epponeitial


T weibu 2 PovUA NBas
3. t 13 Pbwe Lozeinmas
3. amma
S w-Sauss lo Bela

Romdom vouade [x y i asgumd to hAve oTEspCd i n

poobalilW dsipudhon, P(x) whith map to pokthe


Ma numbér
fumtin one n uhich , f we
poobabuy dosdr is
uave. C
inegrste P(a) +o ge aea mden
Une,patkd as obabiduty
theA
disi blion kmlion 6Y h u
Seletthe appoopmate
ve pooble m
dustibituon
d ustdibduon iiR wsed
used hor
Joint
Muivasuate
kntiog | Vamdem
Vauinbse
MBe Mam
uith
Pooblemg

PC,u
oul plx,4)
wust
must be 1. 1
tnlegma deus Ty unchn
Double
Condihona
Pobabli p (xly) g
peRhcua9
Value of
Of ven a
PCXIY>5)

a Moeel peaamelas o
pasamelis og
FHin eang
estimaknA
estimatëng
the
he
Fttna model
a ng obsaved data.
Model
Suth as
Suh a
ophimiz en methods8, alaiitHhm
nvolves
e8mahm to Pasam les
Psamas
MAXIMum ikelhood

egkimalu pasomnaktae , thy 0 bntimg


Estimal&s uiu
i u

Dat y T.a +4.S


R to ind he
Coding don m Pyfher
yaluos O Araelag
oveithin the palametug
whn elatase sed to eghimalin8
paform well beyond Sampled
of a modol do no
lota
eualusded by alusACH
ued in Ating the mcdel
data
potdic ony Sampld
well. but not Ohea data
Moduue-2|
Exploaly Data Anaysis
u DS
EDA s tinst slep touwascls building uodel
as bumch af histofiams,
Scatlu2 plots
t is pse3ented a

EDA hous hypothes no modellw

EDA a
poocoS o deg skamding t h e Data cpooblem
hat we ase oluina

Basic TboLs ot EDA aAe plot, oophs %SuwmoAy SJals


all the da omd
Systemn goia segh
al Vasiabls,
poting disubilDns o
Meam, minimunm
OMpuhing pas-- u cl tion ghp ,

louLet uashilay Outue, vaAiema,SD


MaxiUm, uppea &
about the dolTa
nto ihon shape s insight
EDA poouide
gesc pooces
EDA bn datta DotA Scieilist

philosophy o EDA

do EDA even Nuth Los dala.


dala
G00{e Rom loz flag
Smaller dote data genlaled
Don
abouw dola
tD gain inttion
to make bw distibuh on
tompalugon
foSanty checkug (hetkig dala S cele g formot)
to hnd Rwme Da
-to find outlieas Suwul ze àila
EDA hulps m debugging he lognS Pocaog uins logs
uing lo23
EDA helps m makn Sune hat prodiud pr}orm
Dala
a beginwg of Analysis
EDA done me e d to communtcate h
Visunlization 8 don ot

indings developmnem Algos


EDA helps n inAmun impaovin
Laiked
c a n be
ex. Rating Agos populosty
cicks, ommei
no.

betlear ham unnin


Tunning algo immeduntely
EDA L8 fon
Dong
on data set.

Suenco boocogs8.
Jhe Data

Diagsam
-Explain eah POat

DotA Scieigt Pole


-DiaRam
-Explatn eah paat
method
Scetite
Rata Suiemce as

Ask a queguon
Do backgond &eash

3.
3 Conetuut a hpoth&
by dong eppe mt
Tes youa hypotho
4 a concug Dr
dala g daaw
SS Analyze Youh
OsdeSullg.
6 COwMmmCate

Case shhdu Ho RealDíx d ma morey|


Real Diect RealDiatt DotA starlaty
- use oll D to mpoone home
dala
buyina s selliMg. CsiLvd
uut bka, ,
People sell komes euly Tyeas3
- Real Diset dhclchyes both booker Syslem & Dota Qualt
- Real Diet hisL ucense xal estal ageg
- cgents woks togeha ith pool of heia nonedge

- has integfece for sellea


ime
act as dota brokusprouides
aend
Yecomm em bation

agett
9emte e nformahon tools to kecp da mew&
relave pubucy available.

wdks cal-time feeds uhen home


Real diect on

buuy e & gelluesgosgl geasgung


prowide pLCe band has homegtly
touw RealDrtt malke Money
Poouudeg Substuphon to selleas ab ou $a15 amonth.
to aceRS gellig tools
ot educed commissl
eloug gellis o ie apndo
qULUal ov 3.
Sale tham 2.5,
aTrOMd 22
to take Smal o m i s s l n
3 Roaldibe uudka on pooling
% thus et mbe volume

fr both buye % gelloos fo momaze


Jthas platfäM
Sale sudh as ative, ofa
aduve, c f e Made,
sletug on ste
Maintains
ete.
dgeted, shouuin9,
in cortaa d
of fe
fdr Sale
-Law ROUSAS
gistraton of houin&
houes ot Multi ple baokeas
bookeag
sellng com List
school3
consda neo by PoRks, Subwauj ,
-buyea M nealdiec seRvic
Malla, thge Ce past o}
shoppin
Shoppin
Real Die Dala shmkem Straleay
buyesthn
websile houo sellea g
OExplde Exishng ,

monioin usage
aduise about Logs 3 dlodesefs, Yepohinz
EDA, of dala
Shapins
Think about loading. claming,
Summasuce inding a p r k fo CEO
3 a b o u r data
fo people to qet infosmadh on
speak
Jnu elne domaun epp
dhik about a uaheth set ot bes Poa.clica for Dode st4g
3 dasses of algsithns
1 Dota Engineerin Dota Mungen pepeakng, proco8ia|
we soing , Mapeduce Pregel
optnizatum Algo> polameter eSthimakin9, sBoc stic radicu|
descent, Neuuton Mehod, Least Squaes
3. Mathine leasning

Mahine Lecsnins AGo


or cluses.
used to potdick, classify Sueiugt
Senesaluz iong to congideled by Data
Broad b o s e d on mahine

. J n a p r efing pasamekag Leons Stetis-t on


Confidente witu vals

3 Role o ASsumphirf K-meame


k-NN,
uneon ReR033lon ,

Tree Basic AlaRithm

Line Regagson AHing model


3 Evalucehn Mekrec
A d d Moe assuphons
about eovos
Add DE poedicBoa
5 ToaMsfvr ming e paedictor% .

ruo
mekhod Fo expAARS
Macthemalicel elotionship bo
Bosic
vaiableg
outcome vasuable
ed uhen inea elatoShp b
vosuable g seveal othe ouables
0r behween one
paedCkor

n one vasiable cor>elate unesaly wlth chamges


hamags
Vahable
thea
.

an0 y o u make moR mOse.


ake Sold
mda ubrella
Ex 3h uh sbpe inkacep-|
sbpe 89 nkscep-|
deeanuushc ineg
uneg
n e a eRABioo
8
( x ) = Po+
sotial Newakmg sde
Subsphian
Revemue C
Ex

y-25 200

T No. ONeuo Ruewds


No. O5 MembbeAs
FHing the madel
be üke
ARUmun
ARUung men velotiom&hip
yPe + P Ma nototuon =
cbsenvoting dota (X,19,).
Begt choice for P o P u M
C,n)
,that mininizes distawe bl all points 8 Gnt
Caluutating P
Residual sum of saunaes, Rss() is sum O SAuases oF
dileces between poedioled :s%cbeeaved s .
RSs(B) = Z (yi -Px;

omae OueR al data Ponts

T o minimine R$s (p) = (y-Ar) (y-B)


diuenbiote w.r.t. omd set dt to O,,Sole hor P

O Addina m Modelin assumpmng


about i o s Jime yoo-
Spemt
- Copbuse vauabuti in model

y- Pot , + e
notse ETros tesm,differemce bf
2
E No. erievds.
obSeuahen ue reRion une
e N(o, * )
Londihonal diskibutior ob ven *

P CYlx) n NCPo+ B , )

Acual oro's e s

egtimad vosioMa () o e
T-2
meam SawaiQd eror
Evaluahon meics.

R-squaaed

Poopothn tvaaiamte o achual valun Captusd


Ou model
obseunq dala e obseau.
P -P-voaluug probabuty o
l e s ukely Fo obsem
Low p-valul indicode
PC)
-

Suth data n d s n u l hupothe3IS

hgh p-value
Cooss Validohon
3
tunina & 20, intext ,
dala into 80, Compae wnh
wh
diude
sek & cCompatl
t Ra model o n -wt trinn

Lest
Adding paRdickt
Mulipe tineas Reqekion

Po+ P,x+ Pxt t¬


e histogams
-

drauL s o l r Pots

ToamsloTmahm
taamshr ned as
polynot'al sip co be
reloliom
vasuöbi
neas by heating ne

medel basedon Z
buld u n e a i Respsim

AssunPhLmg

Leneasy distibiod uilth


wih mean
nean 0
leams nomaly
ETror
indepedad of eath oher
3 ErToY tesmg
yosuamte accOsS
Valus o
have Constam
Err O eam
Prtdickors e
5
k-NN bunch o
hat s ued to classiylabel
alooihm
objets label
similas oojetg o
J+ ugea alseady classyied
unknoun obeds.
Coedit
clossikg people as hugh Cedd, lo low CamA HSsk
podiet as hgb Camcs DSk
vosiable , b u
as Contnous
Lineas Regosion oulput
here label
m N w wamls Caleqdial
otha ilems deened
dettned
KNN onsilu most simuat
thas labels and ive
based bn attgibuleg ,
look at
nass n e d em).
may&uky voBe simlasy
Simuasby
deide

KNN considea hou to we


uwe conside
Mamy
neighbovs
houo

Exampe. J00

age cneome CoRdit


Income
69 3
51 low
6
9 ow
49
O
ow 66
20
58 26 iig ag
high
mw
w em
mee
kNN PDOCess distamce
simlaaly oy lest doala
8 test
& dala
I.Detide on
eladase
dalase
into toaining
taaunina
Labeled
omiaina Cisclassifi ahn
eahon eke)
ek)
R.splut
m e i c (iscla
Ssi fi evaluoto
e valuothuo
evaluahon heEk
3. Pick chongung
k,
few timee,
Run N ev o l u a h a me
m en
aßßuuss
ea

measL ev oluuaha
beg
beß
pickinig
. optimi2 e
k by egt s
Se w h nolabek
e w nolwbele
se
Cotate g
Same torunins
6 USe
Simuloaty or Disance mdaics
. Euui'dean DIstamco
. Cosine SimdasAuty
eal -valw ed vetos g Y
-

bl 2

Vallue o inidepend e t
1 exaaiy same
- exacty oppoSite

- Cos ( , Y ) =

3. Jacasd Disame
di'stance b sek oobjet
ines
E x emds A ={ EaM, Maik , Lura
B= Malda,Mogk, kal..
TCA,B) =JA0B
A UBI
4. Maha lanobi3 Dishamce
tuo vawed veCs
-disBamu b[w

d(,) =
NR-)T S' -U)
S Covagiance motrixk

5. Hammins Distane
O DNA SeaOMce
distauce bw t lngs
Same lemgth ouve iB A Ccuftee o)
cean &
duHence blu 3 (befRee)
hose
shoe &
check
cheik
Cah pokihim
thsough
o
6. Man hattan vect&e
k - dimensiDna
-dstaMe blw tuo eal -valned
Lte fashon
-Mamhatan cty ad -

ith element o
wheaei
y) z -y; ,

d C
=
eath Vetor.
acunung Teshng setg
Jn Toaining , Coele a nodo &toin t
Teshng phae, use new data to teg e Modol as if
mocol.
we dont Knoudhe
om cloanned data
The 20 Ok cta
selectad amdomuy

PiC Cm evolunluan melLe

Seusitiuty
Speufi
Preuston
- Recsl

AccuvenC
Mis classikicion (-Accunacy)

choosing have contol ovG.


that we
Poasmets
deffmt value3 of
-Run k-N fo uith
uth
amd Seled te
seled- he dne
dne
mellic
check evaluation

beter ModeliG F-NN.


AssuMp-tionc wuhe nothon of
Some e u a space
. Dota
aks Semsee
dstamce tuuo % Mde Class|
Labelted
Labelled uth
u it
nas beem
dla
.Toauinung

Pick he no. neighboas to nuse, *.


33. & labels
ase
SsomehoLO aso Cialos
soMekow associaleJ

dbseauusd. featuaeg
evaluotion mehic to
4. Assume evauobuorn methc to
check. ugig
add
add Mee telip
veuy eusstia:
K- raomg.
i s is a uaupeuised Ieaiin t e h a u ,to ftnd he
usleys o dol
Data te Suuey dola, edicoal data O SAT Sog.

Ex ge gendes| ncome stode | househo dsize


similal
7oal
gool 3 to
to segmot/ cluska the usev tiding
u e s % buncung them togefhea.
ypes oh
Dhexemk sesuicu| expeime Cam be u t n to dlet
oup Usesas.

be sed for dlfeR oups.


Model can
Dibeent So o)
bins' 2 0 - 2 4 , 25-30.
caM be used fo Caeole
Age amd ele h e n
gend bins
lo age bins, 2
J we hae poSS1ble
bins.

egult in 10 x 2 x 50 X\O x3 30000


e
e Loet, no. O
g no: Qbins
-meam
ugter we sneed

Algsilhm: random
k-cekoids Cpointt) in d-space
Jntially, Pick b u difeseut Rom
one

be neas +o dols poins


Le h e m
Omotha
clos
los eg Cetaoid
C etaold
eah cdala point to a eg
Ass in locotion dala Pointe
Cekotd to avesage
Move
Move the
assigned to it. the
ess merls
unill
tup eps
Repeat the ppeCeedung le.
dont h o nge dr Chom Vea
1SSus
moe o a a d haM
SaeMce
choognna k
Choogun3
o clala poing
bomd 1EkEm, no.

Convexgene
S 0 - Soluh m failk to exítalgo lcops
fos eh soluhon
broblem -lhl amswes n o t
JHe p9elaouiiCan be a

usefell,

You might also like