UNIT-1 – CHAPTER -1
Introduction to Data Science
Contents
Definition of Data Science
Big Data and Data Science hype
Getting past the hype
Datafication
Current landscape of perspectives
Statistical Inference
Populations and samples
Statistical modelling
Probability distributions
Fitting a model
Over fitting
Explolratory Data analysis
Definition of Data Science
Data Science is the area of study that extracts,
manages, manipulates, and interprets knowledge
from from vast amounts of data using various
scientific methods, algorithms, and processes.
Data Science is an multidisciplinary field that allows
you to extract knowledge from structured or
unstructured data.
Data science enables you to translate a business
problem into a research project and then translate it
back into a practical solution.
Definition of Data Science
Data science refers to set of
theories and techniques from
many fields and disciplines are
used to investigate and
analyze a large amount of
data to help decision makers in
many industries such as
science, engineering, e-
commerce, economics,
politics, finance, and
education.
Data Science Process or Life cycle
1. Discovery: Discovery step
involves acquiring data from
all the identified internal &
external sources, which helps
you answer the business
question.
2. Preparation: Data can have
many inconsistencies like
missing values, blank columns,
an incorrect data format,
which needs to be cleaned.
Data Science Process or Life cycle
3. Model Planning: In this
stage, you need to
determine the method and
technique to draw the
relation between input and
output variables
4. Model Building: The actual
model building process
starts. Here, Data scientist
distributes datasets for
training and testing.
Data Science Process or Life cycle
5.Operationalize: You deliver
the final baselined model
with reports, code, and
technical documents in this
stage.
6.Communicate Results: In
this stage, the key findings
are communicated to all
stakeholders.
Applications of Data Science
Internet Search: Google search uses Data science technology
to search for a specific result within a fraction of a second
Recommendation Systems: To create a recommendation
system. For example, “suggested friends” on Facebook or
suggested videos” on YouTube.
Image & Speech Recognition: Speech recognizes systems like
Siri, Google Assistant, and Alexa run on the Data science
technique. Moreover, Facebook recognizes your friend when
you upload a photo with them.
Gaming world: EA Sports, Sony, Nintendo are using Data
science technology. This enhances your gaming experience.
Online Price Comparison: PriceRunner, Junglee, Shopzilla work
on the Data science mechanism.
Why Data Science is important?
To process large volumes of data: According to IDC, by 2025,
global data will grow to 175 zettabytes. To process large
volumes of data
Data Science enables companies to efficiently understand
complex structured data from multiple sources and derive
valuable insights to make smarter data-driven decisions.
Data Science is widely used in various industry domains,
including marketing, healthcare, finance, banking, policy
work, and more.
Big Data and Data Science hype
and
Getting past the hype
What is Big Data?
Big Data is a collection of data that is huge in volume, yet
growing exponentially with time. It is a data with so large
size and complexity that none of traditional data
management tools can store it or process it efficiently.
Example of Big Data:
Social Media: The statistic shows that 500+terabytes of new data
get inserted into the databases of social media site Facebook,
every day. This data is mainly generated in terms of photo and
video uploads, message exchanges, putting comments etc.
The New York Stock Exchange is an example of Big Data
that generates about one terabyte of new trade data per day.
A single Jet engine can generate 10+terabytes of data in 30
minutes of flight time. With many thousand flights per day,
generation of data reaches up to many Petabytes.
Types of Big Data
Structured data: Any data that can be stored, accessed and
processed in the form of fixed format is termed as a
‘structured’ data (Tables).
Nowadays, we are foreseeing issues when a size of such data
grows to a huge extent, typical sizes are being in the rage of
multiple zettabytes.
Employee_ID Employee_Name Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 650000
3398 Pratibha Joshi Female Admin 650000
7465 Shushil Roy Male Admin 500000
7500 Shubhojit Das Male Finance 500000
7699 Priya Sane Female Finance 550000
Types of Big Data
Unstructured :Any data with unknown form or the structure is
classified as unstructured data. In addition to the size being
huge, un-structured data poses multiple challenges in terms of
its processing for deriving value out of it.
A typical example of unstructured data is a heterogeneous
data source containing a combination of simple text files,
images, videos etc. (The output of a Google search)
Types of Big Data
Semi-structured: Semi-structured data can contain both the
forms of data. We can see semi-structured data as a
structured in form but it is actually not defined.
Example of semi-structured data is a data represented in an
XML file.
<rec><name>Prashant Rao</name><gen>Male</gen><age>35</age></rec>
<rec><name>Seema R.</name><gen>Female</gen><age>41</age></rec>
Characteristics of Big Data
Volume – Refers to the amount of data that exists. If the volume of
data is large enough, it can be considered big data.
Variety – Variety refers to heterogeneous sources and the nature
of data, both structured and unstructured.
Velocity – Refers to the speed of generation of data. How fast the
data is generated and processed to meet the demands,
determines real potential in the data.
Variability – It refers to the inconsistency which can be shown by
the data at times, thus hampering the process of being able to
handle and manage the data effectively.
Value - It refers to the value that big data can provide, and it
relates directly to what organizations can do with that collected
data.
Applications of Big Data
Banking and Insurance sectors
Communications, Media and Entertainment
Healthcare Providers
Education
Manufacturing and Natural Resources
Government
Retail and Wholesale trade
Transportation
Energy and Utilities
Limitations of Big Data
Storage: Datasets can require considerable resources to
store
Formatting and Data cleaning: Advanced formatting and
cleaning methods may be required before data analysis.
Quality control: Can be difficult and often has to be done
through small representative samples
Security and Privacy concerns: Often more complex than
for traditional data sets.
Accuracy and consistency of methods: Many approaches
are relatively new and imperfect although these may
continue to improve over time
Limitations of Data Science
Data Science is Blurry Term: Data Science is a very general term and does not have a
definite definition. While it has become a buzzword, it is very hard to write down the exact
meaning of a Data Scientist.
Mastering Data Science is near to impossible: Being a mixture of many fields, Data
Science stems from Statistics, Computer Science and Mathematics. It is far from possible
to master each field and be equivalently expert in all of them.
Large Amount of Domain Knowledge Required: Another disadvantage of Data
Science is its dependency on Domain Knowledge. A person with a considerable
background in Statistics and Computer Science will find it difficult to solve Data
Science problem without its background knowledge.
Arbitrary Data May Yield Unexpected Results: A Data Scientist analyzes the data
and makes careful predictions in order to facilitate the decision-making process. Many
times, the data provided is arbitrary and does not yield expected results.
Problem of Data Privacy: For many industries, data is their fuel. Data Scientists help
companies make data-driven decisions. However, the data utilized in the process may
breach the privacy of customers.
Big Data and Data Science Hype
Given the hype around data
science, the reality is that most
companies still fail to use much of
the data they collect and store
during business activities.
Why Now: Technology makes this
possible
Infrastructure for large data
processing
Increased memory and
bandwidth
Datafication
Datafication: It is the process of “taking all aspects of life and
turning them into data”.
(or)
Datafication aims to transform most aspects of a business into
quantifiable data that can be tracked, monitored, and
analyzed.
It refers to the use of tools and processes to turn an
organization into a data-driven enterprise.
Example:
Twitter “datafies” stray thoughts
Linkedin “datafies” professional networks
Google’s augmented reality glasses “datify” gaze (looks)
Current landscape of perspectives
Data science is not merely Statistics or
Hacking or Mathematics. Data science
is the civil engineering of data. It
includes
Statistics (traditional mathematical
analysis)
Data changing (parsing, scraping,
and formatting data)
Visualization (graphs, tools, etc.)
Its a practical knowledge of tools and
materials, coupled with a theoretical
understanding of what’s possible.
Current landscape of perspectives
Math and Statistics knowledge: Mathematics is the critical part of
data science. Mathematics involves the study of quantity, structure,
space, and changes. For a data scientist, knowledge of good
mathematics is essential. Statistics is one of the most important
components of data science. Statistics is a way to collect and
analyze the numerical data in a large amount and finding
meaningful insights from it.
Substantive (Domain) Expertise: The Substantive Knowledge is the
knowledge specific to the area where data science is applied. It is
often referred to as “domain knowledge”. For example, if you are
applying data science to genome problems, you should have
“substantive knowledge” on that topic.
Current landscape of perspectives
Hacking Skills: The hacking skills refer to the computer science skills.
Data is digital. In order to efficiently manipulate the data, you need to
have some programming skills. You need to be comfortable at the
command line, be able to manipulate files of different formats,
program algorithms that will modify the data, etc.
Machine Learning: Machine learning is backbone of data science.
Machine learning is all about to provide training to a machine so that it
can act as a human brain. In data science, we use various machine
learning algorithms like supervised learning, Un supervised learning and
Reinforcement learning algorithms to solve the problems. There are
various machine learning algorithms which are broadly being used in
data science such as Regression, Decision tree, Clustering. Principal
component analysis, Support vector machines, Naive Bayes, Artificial
neural network and Apriori algorithms.
Statistical Inference
Statistics is a branch of Mathematics, that deals with the collection,
analysis, interpretation, and the presentation of the numerical data.
The main purpose of Statistics is to make an accurate conclusion
using a limited sample about a greater population.
Types of Statistics:
Descriptive Statistics: Describe about the data
Inferential Statistics: It helps to make predictions from the data.
Statistical Inference means “guess”, which means making inference
about something.
Statistical Inference
Statistical inference is the discipline that concerns with the development of
procedures, methods, and theorems that allow us to extract meaning and
information from data that has been generated by stochastic (random)
processes.
The overall process is starting from
the activities or processes in the world to the data,
manipulate the data and then
from the data back to the world, is the field of statistical inference.
Example:
Process or Activity – Sending and Receiving mails of employees
Data - No. of mails sent and received every day for the last 3 months
Inference – Find how many no. of mails and will be sent or received in
the next 3 months
Statistical Inference – Process and Data
Process:
The activities or functions which are happening in and around the
world is called Process.
One should know about ways to describe, understand, and make sense
of these processes to understand the world better and understanding
these processes is part of the solution to problems.
Data:
It represents the traces of the real-world processes, and exactly which
traces we gather are decided by our data collection or sampling
method.
Once we have all the data, to derive new idea, and that’s to simplify those
captured traces(data) into more comprehensible, one should found a
mathematical models or functions of the data, known as statistical model
or estimator.
Note that, the process and data will be random and uncertainty in nature.
Statistical Inference – Example
Example: From the shuffled pack of cards, a card is drawn. This trial is
repeated for 400 times, and the suits are given below:
Suit Spade Clubs Hearts Diamonds
No of times drawn 90 100 120 90
Question: While a card is tried at random, then what is the probability of
getting a Diamond card.
Solution:
Total number of events = 400
Number of trials in which diamond card is drawn = 90
Therefore, P(diamond card) = 90/400 = 0.225
Populations and Samples
Population refers to the entire group of individuals about whom
you wish to draw conclusions.
Sample refers to the sub set of people (from population) from
which you will be collecting data.
Populations and Samples
In Statistical Inference, the term Population denotes the set of
objects or units, such as tweets or photographs or stars.
The set of characteristics that are measured or extracted from the
objects is called as Observations, and it is denoted as N, the number
of observations from the population.
Example:
Population: The emails sent last year by employee
Observation: The sender’s name, The list of recipients,
Date sent, Text of email,
No. of characters and sentences in the email,
No. of verbs in the email, and
The length of time until first reply.
Populations and Samples
Sample refers to a subset of the units of size n from population that
are considered in order to examine the observations to draw
conclusions and make inferences about the population.
There are different ways that can be followed for getting this subset
of data, which are called sampling mechanisms.
Note that, some sampling mechanisms may introduce biases into the
data, and distort it. Once that happens, any conclusions you draw
will simply be wrong and distorted.
Populations and Samples
Example: Employee Emails
Sample -1 : 1/10 of Employees and their emails at random
Sample- 2 : 1/10 of Emails and its Employees at random
But if we counted how many email messages each person sent, and used
that to estimate the underlying distribution of emails sent by all employees,
we might get entirely different answers.
Notice that, the basic thing counting can get distorted when we’re using
sampling methods of different type
Populations Vs Samples
BASIS FOR
POPULATION SAMPLE
COMPARISON
Meaning Population refers to the Sample means a subgroup
collection of all elements of the members of
possessing common population chosen for
characteristics, that comprises participation in the study.
universe.
Includes Each and every unit of the Only a handful of units of
group. population.
Characteristic Parameter Statistic
Data collection Complete enumeration or Sample survey or
census sampling
Focus on Identifying the characteristics. Making inferences about
population.
Big Data – Population and Samples
This Big Data world is defined by the enormous amount of ever-
expanding, diverse data being generated, collected and
analyzed by researchers and practitioners alike.
While large data sets allow us to gain useful insights about general
trends, smaller segments contained within the larger data set are still
useful.
For example, consider concept of personalization works
(personalized medicine). Here from the large data set, we
create smaller, homogeneous, data sets to make predictions
within smaller groups.
Big Data – Population and Samples
In this context, one can apply the concept of population and
samples to derive useful insights from smaller data sets
(sample) which was considered from larger data sets
(population).
Issues need to be addressed
Sampling solves some engineering challenges
Hidden biases of big data
Sampling method
Underlying assumptions
Sampling distribution
Modelling
Modelling is describing mathematically a situation in
reality for the purpose of solving a problem or finding a
answer to a question in that situation(from Data).
Modelling process includes an iterative process that
requires creativity and inventiveness and in which
mathematical, scientific and technical knowledge is
applied to describe new situations(Data).
Modelling process consists of the activates related to
determining a strategy to design the model,
analyzing or getting to the bottom of the problem,
choosing variables, setting up relation between variables , and
deploying mathematical and computational tools.
Modelling - Examples
An Architects capture attributes of buildings through
blueprints and three-dimensional, scaled-down
versions.
A Molecular biologists capture protein structure with
three-dimensional visualizations of the connections
between amino acids.
The Statisticians and data scientists capture the
uncertainty and randomness of data-generating
processes with mathematical functions.
Note that, a model is an artificial construction where
all external detail has been removed or abstracted.
Modelling - Activities
On the left-hand side are
activities related to research,
such as collecting data that are
used in the model and/or can
be used to assess the modelling
results.
On the right-hand side are
conceptual activities that must
lead to the development of a
model, including creative
thinking and formulating
hypotheses to be tested.
Modelling - Activities
How to Build a Model
The key steps involved in Data Science Modelling are:
Step 1: Understanding the Problem:
The first step involved in Data Science Modelling is understanding the
problem. A Data Scientist listens for keywords and phrases when
interviewing a line-of-business expert about a business challenge. The
Data Scientist breaks down the problem into a procedural flow that
always involves a holistic understanding of the business challenge.
Step 2: Data Extraction:
Not just any Data, but the Unstructured Data pieces you collect,
relevant to the business problem you’re trying to address. The Data
Extraction is done from various sources online, surveys, and existing
Databases.
How to Build a Model
Step 3: Data Cleaning: Data Cleaning is useful as you need to sanitize Data while
gathering it. The following are some of the most typical causes of Data
Inconsistencies and Errors:
Duplicate items are reduced from a variety of Databases.
The error with the input Data in terms of Precision.
Changes, Updates, and Deletions are made to the Data entries.
Variables with missing values across multiple Databases.
Step 4: Exploratory Data Analysis: Exploratory Data Analysis (EDA) is a robust
technique for familiarizing yourself with Data and extracting useful insights. Data
Scientists use Statistics and Visualization tools to summaries Central Measurements
and variability to perform EDA.
Step 5: Feature Selection: Feature Selection is the process of identifying and
selecting the features that contribute the most to the prediction variable or
output that you are interested in, either automatically or manually.
How to Build a Model
Step 6: Incorporating Machine Learning Algorithms
This is one of the most crucial processes in Data Science Modelling as the Machine Learning
Algorithm aids in creating a usable Data Model. There are a lot of algorithms to pick from,
the Model is selected based on the problem. There are three types of Machine Learning
methods that are incorporated:
1) Supervised Learning
Linear Regression
Random Forest
Support Vector Machines
2) Unsupervised Learning
KNN (k-Nearest Neighbors)
K-means Clustering
Hierarchical Clustering
Anomaly Detection
3) Reinforcement Learning
Q-Learning
State-Action-Reward-State-Action (SARSA)
Deep Q Network
How to Build a Model
Step 7: Testing the Models:
The Data Model is applied to the Test Data to check if it’s
accurate and houses all desirable features. You can further test
your Data Model to identify any adjustments that might be
required to enhance the performance and achieve the desired
results.
Step 8: Deploying the Model:
The Model which provides the best result based on test findings is
completed and deployed in the production environment
whenever the desired result is achieved through proper testing as
per the business needs.
How to Build a Statistical Model - Issues
The major issues involved building a Model are:
Underlying process about he problem
Assumptions about the problem
Simple Vs Complex model
Mathematical expressions Vs Visualization methods
Probability Distributions - Variables
A variable is a quantity whose value changes.
A discrete variable is a variable whose value is obtained by counting.
Example: number of students present
A continuous variable is a variable whose value is obtained by measuring.
Example: heights of all the students in class
A random variable is a variable whose value is a numerical outcome of a
random phenomenon.
The probability distribution of a random variable X tells what the
possible values of X are and how probabilities are assigned
A random variable can be discrete or continuous
Probability Distributions
Statistical model is non-deterministic models, where variables are
stochastic(Random) in nature i.e. they have probability distributions. So,
the probability distributions are the foundation of statistical models.
A probability distribution is a mathematical function that describes the
probability of different possible values of a variable. Probability
distributions are often depicted using graphs or probability tables.
Example: One Coin flip Test scores
Heads Tails
0.5 0.5
Probability Distributions - Types
There are 3 types of probabilities
1.Probability distribution of One Random Variable
2.Probability distribution of Multiple Random Variables(Joint
Probability distribution)
Joint Probability: Probability of events A and B.
Conditional Probability: Probability of event A given event B.
3.Probability distribution of Independence and Exclusivity
Probability Distributions - Types
Probability distribution of One Random Variable:
It quantifies how likely a specific outcome is for a random variable, such as
the flip of a coin, the roll of a dice, or drawing a playing card from a deck.
For a random variable x, P(x) is a function that assigns a probability
to all values of x.
Probability Distribution of x = P(x)
Probability is calculated as the number of desired outcomes divided
by the total possible outcomes.
Probability = (number of desired outcomes) / (total number of possible outcomes)
For example, the probability of a die rolling a 5 is calculated as one
outcome of rolling a 5 (1) divided by the total number of discrete
outcomes (6) or 1/6 or about 0.1666 or about 16.666%.
Probability Distributions - Types
Probability distribution of One Random Variable: Example
Let a random variable is x (the amount of time until the next bus arrives)
Let p(x) is corresponding probability distribution, which maps x to a
positive real number. Let us assume that the probability of arrival of next
bus is given as
Then if you want to calculate the probability (likelihood) of the next bus
arriving in between 12 and 13 minutes is given as
Probability Distributions - Types
Probability distribution of 2 Random Variable: (Joint Probability)
The probability of two (or more) events is called the joint probability. The
joint probability of two or more random variables is referred to as the joint
probability distribution.
For the random variables x and y, P(x, y) is a joint probability and it is
represented as
Probability Distribution P(x, y) = P(x and y) = P(x) * P(y)
The calculation of the joint probability is sometimes called the
fundamental rule of probability or the “product rule” of probability or the
“chain rule” of probability.
Probability Distributions - Types
Probability distribution of 2 Random Variable: (Joint Probability) - Example
Example: What is the joint probability of drawing a King that is
black?
Event “A” = The probability of drawing a king = 4/52 = 0.0769
Event “B” = The probability of drawing a black card = 26/52 = 0.50
Therefore, the joint probability of event “A” and “B” is
P(4/52) x P(26/52) = 0.0385 = 3.9%.
Probability Distributions - Types
Probability distribution of two Random Variable: (Conditional Probability)
The probability of one event given the occurrence of another
event is called the conditional probability.
The conditional probability of one variable to one or more random
variables is referred to as the conditional probability distribution.
The conditional probability for events A given event B is calculated
as follows:
P(A | B) = P(A given B) = P(A and B) / P(B)
Note:
This notation assumes that the probability of event B is not zero.
The notion of event A given event B does not mean that event B has
occurred, instead, it is the probability of event A occurring after or in
the presence of event B for a given trial.
Probability Distributions - Types
Probability distribution of 2 Random Variable: (Conditional Probability) - Example
Example:
Susanth took two tests. The probability passing both tests is 0.6. The
probability of passing the first test is 0.8. What is the probability of
passing the second test given that she has passed the first test?
Different Probability Distribution Functions
Fitting a model
Model fitting is the measure of how well a mathematical model
generalizes data similar to that with which it was trained.
A good model fit refers to a model that accurately approximates the
output when it is provided with test inputs.
Fitting a model means that, estimating the parameters of the model
using the observed data. we are using the data as evidence to design
the real-world mathematical process that generates the data.
While doing coding for your model, the code will read data, and we will
specify the functional form of the model.
The R or Python will use built-in optimization methods to give you the
most likely values of the parameters given the data.
Fitting a model
Fitting a model refers to adjusting the parameters in the model to
improve accuracy. The process involves
Running an algorithm on data for which the target variable is
known to produce a mathematical model.
Then, the model’s outcomes are compared to the real,
observed values of the target variable to determine the
accuracy.
The next step involves adjusting the algorithm’s standard
parameters in order to reduce the level of error and make the
model more accurate.
This process is repeated several times until the model finds the
optimal parameters to make predictions with substantial
accuracy.
Overfitting and Underfitting
When random fluctuations or the noise in the training data are
picked up and learned as concepts by the model, the model
“overfits”.
Overfitting negatively impacts the performance of the model on
new data.
It will perform well on the training set, but very poorly on the test
set. This negatively impacts the model’s ability to generalize and
make accurate predictions for new data.
Overfitting and Underfitting
Underfitting happens when the model cannot sufficiently
model the training data nor generalize new data.
An underfit model is not a suitable model; this will be obvious
as it will have a poor performance on the training data.
Data Science Process
The complete picture of data science process can be
depicted as shown below.
Data Science Process
Inside the Real World are lots raw data—logs, Olympics records, employee emails, or
recorded genetic material.
We want to process this to make it clean for analysis. So we build and use pipelines of
data munging: joining, scraping, wrangling, or whatever you want to call it. To do this we
use tools such as Python, shell scripts, R, or SQL, or all.
Once we have this clean dataset, we should be doing some kind of EDA. In the course of
doing EDA, we may realize that it isn’t actually clean because of duplicates, missing
values, absurd outliers, and data that wasn’t actually logged or incorrectly logged. If
that’s the case, we may have to go back to collect more data, or spend more time
cleaning the dataset.
Next, we design the model to use some algorithm like k-nearest neighbor (k-NN), linear
regression, Naive Bayes, or something else. The model we choose depends on the type
of problem we’re trying to solve.
We then can interpret, visualize, report, or communicate our results. This could take the
form of reporting the results up to business to make decisions.
Alternatively, the goal may be to build or prototype a “data product”; e.g., a spam
classifier, or a search ranking algorithm, or a recommendation system.
Exploratory Data Analysis
Exploratory Data Analysis (EDA) is an approach to analyze the
data using visual techniques.
It is used to discover trends, patterns, or to check assumptions
with the help of statistical summary and graphical
representations.
Exploratory data analysis is a significant step to take before
diving into statistical modeling or machine learning, to ensure
the data is really what it is claimed to be and that there are no
obvious errors.
EDA should be part of data science projects in every
organization.
Objectives of Exploratory Data Analysis
The goal of EDA is to allow data scientists to get deep insight into a
data set and at the same time provide specific outcomes that a
data scientist would want to extract from the data set. It includes:
List of outliers
Estimates for parameters
Uncertainties for those estimates
List of all important factors
Conclusions or assumptions as to whether certain
individual factors are statistically essential
Optimal settings
A good predictive model
Exploratory Data Analysis Tools
The basic tools of EDA are plots, graphs and summary statistics.
The EDA is a method of systematically going through the data to
do the following
Plotting distributions of all variables (using box plots),
plotting time series of data,
transforming variables,
looking at all pairwise relationships between variables using
scatterplot matrices,
Generating summary statistics.
Computing the mean, minimum, maximum, the upper and lower
quartiles, and identifying outliers.
UNIT-1 – CHAPTER -2
Basics of R Language
Contents
Introduction
R- Environment Setup
Programming with R
Basic Data Types.
Introduction to R language
Introduction:
R is an open-source programming language and environment
used for statistical analysis, data visualization, and data science.
Being open-source, R has a massive community that
continuously works to improve the environment as well as helps
members worldwide to improve and innovate.
It has over 10,000 different libraries and packages to enhance
and add on to its already significant capabilities.
Introduction to R language
History:
R is an extension of the S-programming language, which was
created by John Chambers at Bell Laboratories (formerly AT&T)
in 1976. S was a premiere tool for statistical research.
In 1992, Ross Ihaka and Robert Gentleman created R at the
University of Auckland, New Zealand, as a tool that their students
could learn and use easily.
Ihaka and Gentleman released the initial version in 1995, and a
stable beta version was released in 2000.
Introduction to R language
Advantages/Features:
Open source: R is an open-source environment. It is cost-
effective for projects of any size and is widely available.
Advanced graphics: R has various libraries and packages available
for plotting attractive and elegant graphs. These can also be used
to create highly interactive graphics for data-driven story telling, as
well.
R has a massive community that works tirelessly to improve and
add upon R’s abilities. CRAN or Comprehensive R Archive Network
has over 10,000 packages or extensions that can be used from
producing high-definition graphics to creating interactive web-
apps.
Introduction to R language
Advantages/Features:
R can perform complex mathematical and statistical
operations on vectors, matrices, data frames, arrays, and
other data objects of varying sizes.
R is an interpreted language and does not need a compiler. It
generates a machine-independent code that is easy to
debug and is highly portable.
R is a comprehensive programming language that supports
object-oriented as well as procedural programming with
generic and first-class functions.
Introduction to R language
Advantages/Features:
R supports both Command Line Interface and Graphical User
Interface by which users can be allowed to do programming
at console level and also allows to work with scripts.
R supports a wide variety of packages to handle the problems
in the aera of Financial sector, Healthcare, High Performance
computing, distributed computing, Statics and many more.
Compatible with various other technologies: R can integrate
with a number of different technologies and programming
languages.
Introduction to R language
Disadvantages:
The R seems to be relatively easy to learn at the beginning, but it is
hard to master it.
With the command based R, it become highly inconvenient for the
statisticians and non-computing professionals to use it.
R commands don’t concern with memory management, and
therefore R can consume a large amount of memory.
Due to a large number of packages available and the existing
redundancy among them, some packages can be of poor quality.
R- Environment Setup – Install R on windows
Step – 1: Go to CRAN R project website.
Step – 2: Click on the Download R for Windows link.
R- Environment Setup
R- Environment Setup – Install R on windows
Step – 3: Click on the base subdirectory link or install R for the first
time link.
Step – 4: Click Download R 3.3.4 for Windows and save the executable
.exe file.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.a. Select the desired language and then click Next.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.b. Read the license agreement and click Next.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.c. Select the components you wish to install (it is recommended to
install all the components). Click Next.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.d. Enter/browse the folder/path you wish to install R into and
then confirm by clicking Next.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.e. Select additional tasks like creating desktop shortcuts etc. then
click Next.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.f. Wait for the installation process to complete.
R- Environment Setup – Install R on windows
Step – 5: Run the .exe file and follow the installation instructions.
5.g. Click on Finish to complete the installation.
R- Environment Setup – Install R Studio on windows
Step – 1: To begin, go to download RStudio and click on the download button
for RStudio desktop.
Step – 2: Click on the link for the windows version of RStudio and save the .exe
file.
Step – 3: Run the .exe and follow
the installation instructions.
3.a. Click Next on the welcome
window.
R- Environment Setup – Install R Studio on windows
3.b. Enter/browse the path to the installation folder and
click Next to proceed.
R- Environment Setup – Install R Studio on windows
3.c. Select the folder for the start menu shortcut or click on do
not create shortcuts and then click Next.
R- Environment Setup – Install R Studio on windows
3.d. Wait for the installation process to complete.
R- Environment Setup – Install R Studio on windows
3.e. Click Finish to end the installation.
R- Environment Setup – Install R on Linux
R- Environment Setup – Install R Studio on Linux