Big Data Mid Term

Uploaded by

Hemonter Akash

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

11 views

Big Data Mid Term

Uploaded by

Hemonter Akash

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 14

12/10/23(Thursday) Big Data L-1 1. What is big data? Matured Domain of Statistics Young Domain of Computer Science Big Data is the combination of the very mature domain of statistics with the relatively young domain of computer science. As such, it builds upon the collective knowledge of mathematics, statistics and data analysis techniques in general. L-2 2 data operations DATA OPERATIONS Statistical operations U mean,vw min, Umax, v Probability distribution, and VU regression. Machine learning operations V linear regression, v logistic regression, uv classiication, and clustering Universal data processing operations are as follows: * Data cleaning: This option is to clean massive datasets * Data exploration: This option is to explore all the possible values of datasets * Data analysis: This option is to perform analytics on data with descriptive and predictive analytics data visualization, that is, visualization of analysis output programming. 3. Data analytics project cycle 1. Identifying the problem 2. Designing data requirement 3. Preprocessing data 4. Performing analytics over data4. Exploring web pages categorisation Web pages Popular categorization High Home Page Home Page Services Support Medium Products Service Contact Us Products About Us tow Support Contact Us About Us L-4 1. Statistics Population In statistics, population is the entire set of items from which you draw data for a statistical study. It can be a group of individuals, a set of items, etc. It makes up the data pool for a study. Sample Asample is defined as a smaller and more manageable representation of a larger group. A subset of a larger population that contains characteristics of that population. A sample is used in statistical testing when the populationsize is too large for all members or observations to be included in the test. The sample is an unbiased subset of the population that best represents the whole data. Parameter A population parameter is a numerical value that describes a characteristic of a population, such as the mean or standard deviation. It is usually unknown and is estimated from sample data. For example, the population mean height of all students in a school is a population parameter. Statistics A sample statistics, on the other hand, is a numerical value that describes a characteristic of a sample, such as the sample mean or sample standard deviation. It is calculated from sample data and used to make inferences about the population. For example, the sample mean height of a group of randomly selected students is a sample statistic. Difference between parameter and statistics The key difference between a population parameter and a sample statistic is that the former describes the entire population, while the latter describes only a sample from the population. In general, population parameters are more precise and accurate, as they are calculated using all available data. However, they are usually unknown and can only be estimated from sample data, which is where sample statistics come into play. Random Variable A random variable is a variable whose value is unknown or a function that assigns values to each of an experiment's outcomes. The use of random variables is most common in probability and statistics, where they are used to quantify outcomes of random occurrences. Probability Distribution A probability distribution is an idealized frequency distribution. A frequency distribution describes a specific sample or dataset. It's the number of times each possible value of a variable occurs in the dataset. The number of times a value occurs in a sample is determined by its probability of occurrence. 2. Statistics (function)Example — Normal distribution Normal distribution, also known as the Gaussian distribution, is a probability distribution that is symmetric about the mean, showing that data near the mean are more frequent in occurrence than data far from the mean. In graphical form, the normal distribution appears as a "bell curve". 1 1/2-H#y2 = —3 (>) z)= e 2a f(z) oV2n £(2) = probability density function = standard deviation # =mean DRAWING SAMPLES FROM PROBABILITY DISTRIBUTION Uniform random numbers Syntax: runif(n, min=0, max=1) > runif(1) > runif(10) Normally distributed random numbers Syntax: rnorm(n, mean = 0, sd = 1) >rnorm(1) > rnorm(15, mean = 5.4, sd = 0.5) L-5 1. Machine learning (types of machine learning)uv Machine learning is a branch of artificial intelligence (Al) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy. Types of Machine Learning v Supervised Learning v Unsupervised Learning uv Semi-supervised Learning v Active Learning uv Reinforcement Learning 2. Supervised learning regression (In function) uv Supervised learning, also known as supervised machine learning, is a subcategory of machine learning and artificial intelligence. u It is defined by its use of labeled datasets to train algorithms that would be used to classify data or predict outcomes accurately. v As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately, which occurs as part of the cross validation process. # Values of height x = 151, 174, 138, 186, 128, 136, 179, 163, 152, 131 # Values of weight. y =63, 81, 56, 91, 47, 57, 76, 72, 62, 48 Im() Function This function creates the relationship model between the predictor and the response variable. relation = Im(y~x) # Find weight of a person with height 170. a 2. data framaly = 179)result <- predict(relation,a) print(result) 3. Active learning V Active learning is a special case of machine learning in which a learning algorithm can interactively query a user (or some other information source) to label new data points with the desired outputs. u In statistics literature, it is sometimes also called optimal experimental design. v The information source is called teacher or oracle L-6 1. Multiple linear regression Y Multiple independent variables have impact on a single dependent variable Y = Bo + Bix1 + Boxe ++. +BnXn Example Independent variables: X1 - transaction date X2- house age X3- distance to the nearest MRT station X4- number of convenience stores Xs - latitude Xe - longitude Dependent variable Y - house price of unit area EXAMPLE WITH R library(ggplot2) library(caret) Ube ec rccflecesine’ee houseData =read.csv(fil str(houseData) colnames(houseData)[1] <- "SL" colnames(houseData)[2] <- "date" colnames(houseData)[3] <- "age" colnames(houseData)[4] <- "dist" colnames(houseData)[5] <- "store" colnames(houseData)[6] <- "lat" colnames(houseData)[7] <- “long” colnames(houseData)[8] <- “price” ="RealEstate.csv", header=TRUE, sep=",") hD = subset(houseData, select = -c(SL, date, lat, long)) print(colnames(hD)) fitO = Im(price~age, data=hD) fit] = Im(price~dist, data=hD) fit2 = Im(price~store, data=hD) fit = Im(price~., data=hD) fit$coeff residuals(fit) hist(residvals(fit)) PREDICTION d <- data.frame(age = 0, dist=3, store=15) result <- predict(fit,d) print(result) print(colnames(hD)) 2. decision treev A decision tree is a machine learning method for classification and prediction and for facilitating decision-making in sequential decision problems. u Supervised learning v Three types of Decision Trees 1. Decision Trees to recommend a course of action based on a sequence of information nodes 2. Classification and Regression trees and 3. Survival Trees. 3.example of decision tree Suppose we want to build a decision tree to predict whether a person is likely to buy a new car based on their demographic and behavior data. uv The decision tree starts with the root node, which represents the entire dataset. The root node splits the dataset based on the “income” attribute. If the person's income is less than or equal to $50,000, the decision tree follows the left branch, and if the income is greater than $50,000, the decision tree follows the right branch. V The left branch leads to a node that represents the “age” attribute. If the person's age is less than or equal to 30, the decision tree follows the left branch, and if the age is greater than 30, the decision tree follows the right branch. The right branch leads to a leaf node that predicts that the person is unlikely to buy a new car. EXAMPLE DECISION TREE Income <= $50,000Let us learn from an example - Dataset: Golf playing decision Day Outlook Temp Humidity Wind Decision 4 _| Sunny | Hot High | Weak | No 2 | Sunny | Hot T High | Strong | No [3 | Overcast Hot High | Weak | Yes 4 Rain Mild High Weak Yes 5 Rain | Cool Normal Weak Yes 6 | Rain Cool Normal | strong | No 7 Overcast Cool Normal Strong Yes 8 Sunny Mild High Weak No 9 Sunny Cool Normal Weak Yes | 10 | Rain | Mila | Normal | Weak | Yes n ‘| Sunny Mild | Normal | Strong [Yes 12 Overcast Mild High Strong Yes 13__| Overcast | Hot Normal | Weak | Yes 4 Rain Mild High Strong No EXAMPLE Sunny = 2, overcast = 1, rain = 0 Hot = 2, mild = 1, cool = 0 Humidity high = 2, normal = Wind Strong = 2, weak = 0 Decision Yes = 1, No = 0 golfData =read.csv(file="golfCSV.csv", header=TRUE, sep=",") gD = subset(golfData, select = -c(Day)) 4.decision tree in R(syntax) Package ‘rpart’ - Recursive Partitioning and Regression Trees Rpart is a powerful machine learning library in R that is used for building classification and regression trees. This library implements recursive partitioning and is very easy to use.The R function rpart is an implementation of the CART [Classification and Regression Tree] supervised machine learning algorithm used to generate a decision tree. Function rpart Syntax: rpart(formula, data, weights, subset, na.action = na.rpart, method, model = FALSE, x = FALSE, y = TRUE, parms, control, cost, ...) 5. Rpart example library(rpart) golfData =read.csv(file="golfCSV.csv", header=TRUE, sep=",") gD = subset(golfData, select = -<(Day)) fit = rpart(Decision~Outlook+Temp+Humidity+Wind, gD, method="class", minsplit = 2, minbucket = 1) d <- data.frame(Outlook=1, Temp=0 , Humidity=0, Wind = 0) result <- predict(fit, d) print(result) minsplit is the minimum number of observations that must exist in a node in order for a split to be attempted minbucket is the minimum number of observations in any terminal node. library(rpart.plot) rpart.plot(fit) L-9 1.DFSee a ea re ne ee ee eee This is an area of active research interest today. Clients should view a DFS the same way they would a centralized FS; the distribution is hidden at a lower level. A DFS provides high throughput data access and fault tolerance. 2.Hadoop stack Hadoop Stack Hadoop Internal Software Architecture Hatoopecasystem nga [5 [ool a Languages feel ams |e E — = - ili TD Row /column data \ File data : He Distributed data procesing i le