UNIT – 1 INTRODUCTION TO DATA EXPLORATION
PART - A
1. What are the two basic organizing concepts of data analysis ?
Ans : The two organizing concepts that have become the basis of the language of data
analysis are,
Cases and
Variables.
2. Define the concept cases and variables.
Ans : Cases :
The cases are the basic units of analysis i.e), the things about which information is collected.
Variables :
The word variable express the fact that this feature varies across the different cases.
3. What is the easier way of visualizing how any variable is distributed across cases ?
Ans : Bar chart :
One simple device is the bar chart, a visual display in Which bars are drawn to represent each
category of a variable such that the Length of the bar is proportional to the number of cases
in the category.
Pie chart :
A pie chart can also be used to display the same information. It is largely a Matter of taste
whether data from a categorical variable are displayed in a bar Chart or a pie chart. In
general, pie charts are to be preferred when there are only a few categories and when the
sizes of the categories are very different.
4. Explain Histogram.
Ans : Charts that are somewhat similar to bar charts can be used to display interval level
variables grouped into categories and these are called histograms. They are constructed in
exactly the same way as bar charts except, that the ordering of the categories is fixed, and
care has to be taken to show exactly how the data were grouped.
5. What are the features visible in histogram ?
Ans :
What are typical values in the distribution? (Level)
How widely dispersed are the values? Do they differ very much from one another?
(Spread)
Is the distribution flat or peaked? Symmetrical or skewed? (Shape)
Are there any particularly unusual values? (Outliers)
6. What is SPSS?
Ans : SPSS is a very useful computer package which includes hundreds of different
procedures for displaying and analyzing data.
7. List the three main windows of SPSS.
Ans :
The Data Editor window
The Output window
The Syntax window.
8. Explain Residuals.
Ans : A residual can be defined as the difference between a data point and the observed
typical, or average, value.
Expressed as,
DATA = FIT + RESIDUAL
9. Define Twyman’s law.
Ans : The more unusual or interesting the data, the more likely they are to have been the
result of an error of one kind or another.
10. Explain standardized variables.
Ans : subtracting a constant from every data value alters the level of the distribution, and
dividing by a constant scales the values by a factor and the combination produce a very
powerful tool which can render any variable into a form where it can be compared with any
other this is called a standardized variable.
11. What are the four major methodological problems encountered when studying the
distribution of income ?
Ans :T he four major methodological problems encountered when studying the distribution
of income are,
How should income be defined?
What should be the unit of measurement?
What should be the time period considered?
What sources of data are available?
12. List the sources of income.
Ans :
Earned income, from either employment or self-employment.
Unearned income which accrues from ownership of investments, property, rent And
so on.
Transfer income, that is benefits and pensions transferred on the basis of entitlement,
not on the basis of work or ownership, mainly by the government but Occasionally by
individuals (e.g. alimony).
13. Explain Lorenz curves.
Ans : Lorenz curves have visual appeal because they portray how near total equality
or total inequality a particular distribution [Link] degree of inequality in two distributions
can be compared by superimposing their Lorenz curves.
14. Define the Gini Coefficient.
Ans : A measure that summarizes what is happening across all the distribution is the Gini
coefficient. The Gini coefficient expresses the ratio between the area between the Lorenz
curve and the line of total equality and the total area in the triangle formed between the
perfect equality and perfect inequality lines. It therefore varies between 0 and 1 although it is
sometimes multiplied by 100 to express the coefficient in percentage form.
15. Explain smoothing?
Ans : The process of smoothing time series also produces such a decomposition of the data.
In other words, what we might understand in engineering as
Message = Signal +Noise
Becomes
Data= Smooth+ Rough
16. List the different smoothing process in refinement ?
Ans :
Endpoint Smoothing
Breaking the smooth
17. Explain smoothing in time series.
Ans : Smoothing is a technique applied to time series to remove the fine-grained variation
between time steps. The hope of smoothing is to remove noise and better expose the signal of
the underlying causal processes. Moving averages are a simple and common type of
smoothing used in time series analysis and time series forecasting.
18. Why is data exploration important ?
Ans : Data visualization in data exploration leverages familiar visual cues such as shapes,
dimensions, colors, lines, points, and angles so that data analysts can effectively visualize
and define the metadata, and then perform data cleansing.
19. How is data exploration made easy in python?
Ans : Python data exploration is made easier with Pandas, the open source Python data
analysis library that can single-handedly profile any dataframe and generate a complete
HTML report on the dataset.
20. What are statistical analysis techniques of data exploration?
Ans :
Regression analysis,
Cohort analysis,
Predictive and perspective analysis,
Conjoint analysis and
Cluster analysis.
PART – B
11. Explain in detail numerical summaries of level and spread.
12. Explain in detail the concepts of Scaling and Standardizing.
13. Write in detail about Inequalities.
14. Write a detailed explanation about time series smoothing.
15. Explain various smoothing Techniques.
UNIT – 2 INTRODUCING TWO VARIABLE AND THIRD VARIABLE.
PART – A
1. Why is data cleansing important for data visualization?
Ans : Data cleansing is used for identifying and removing errors and inconsistencies from
data in order to enhance the quality of data. This process is crucial and emphasized because
wrong data can lead to poor analysis. This step ensures the quality of the data is met to
prepare data for visualization.
2. What are some important features of a good data visualization?
Ans : The data visualization should be light and must highlight essential aspects of the data;
looking at important variables, what is relatively important, what are the trends and changes.
Besides, data visualization must be visually appealing but should not have unnecessary
information in it.
3. What is a scatter plot? For what type of data is scatter plot usually used for?
Ans : A scatter plot is a chart used to plot a correlation between two or more variables at the
same time. It’s usually used for numeric data.
4. What features might be visible in scatterplots?
Ans :
Correlation: the two variables might have a relationship, for example, one might
depend on another. But this is not the same as causation.
Associations: the variables may be associated with one another.
Outliers: there could be cases where the data in two dimensions does not follow the
general pattern.
Clusters: sometimes there could be groups of data that form a cluster on the plot.
Gaps: some combinations of values might not exist in a particular case.
Barriers: boundaries.
Conditional relationships: some relationship between the variables rely on a condition
to be met.
5. What type of data is box-plots usually used for? Why? What information could you gain
from a box-plot?
Ans : Boxplots are usually used for continuous variables. The plot is generally not
informative when used for discrete data.
Minimum/maximum score
Lower/upper quartile
Median
The Interquartile Range
Skewness
Dispersion
Outliers
6. When analyzing a histogram, what are some of the features to look for? What type of
data is histograms usually used for?
Ans :
Asymmetry
Outliers
Multimodality
Gaps
Heaping/Rounding: Heaping example: temperature data can consist of common values due to
conversion from Fahrenheit to Celsius. Rounding example: weight data that are all multiples
of [Link]/Errors.
Histogram is usually for continuous data.
7. List 3 libraries in R that can be used for data visualization
Ans :
Ggplot2,
Lattice,
Leaflet,
High charter,
RColorBrewer,
Plotly,
sunburstR,
RGL,
Digraphs.
8. List the good table manners.
Ans :
Reproducibility versus clarity
Labelling
Sources
Sample data
Definitions
Opinion data
Ensuring frequencies can be reconstructed
Showing which way the percentages run and
Layout.
9. What is the relationship between two variables in research?
Ans : Correlation is a statistical method used to determine whether a relationship between
variables exists. Regression is a statistical method used to describe the nature of the
relationship between variables --- i.e., a positive or negative, linear or nonlinear relationship
10. What is a 3 variable system?
Ans : A system of three equations in three variables can be solved by using a series of steps
that forces a variable to be eliminated. The steps include interchanging the order of
equations, multiplying both sides of an equation by a nonzero constant, and adding a nonzero
multiple of one equation to another equation.
11. What is causal explanation?
Ans : The causal explanation is referring not so much to the logic of a theory but rather to the
explanation of the internal physical mechanism of phenomenon. “Explaining the world and
what is going on in it means, accordingly, laying bare its inner working, its underlying causal
mechanisms.”
12. How do you make multiple plots to a single page layout in R?
Ans : par(mfrow=c(2,2))
plot(wt,mpg, main="Scatterplot of wt vs. mpg")
plot(wt,disp, main="Scatterplot of wt vs disp")
Hist(wt, main="Histogram of wt")
boxplot(wt, main="Boxplot of wt")
13. What is transformation in data exploration?
Ans : Data transformation is the process of converting raw data into a format or structure that
would be more suitable for model building and also data discovery in general. It is an
imperative step in feature engineering that facilitates discovering insights.
14. What is nominal data and ordinal data? Explain with examples.
Ans : Nominal data is data with no fixed categorical order. For example, the continents of the
world (Europe, Asia, North America, Africa, South America, Antarctica, Oceania).
Ordinal data is data with fixed categorical order. For example, customer satisfactory
rate (Very dissatisfied, dissatisfied, neutral, satisfied, very satisfied).
15. Why are contingency tables important?
Ans :A contingency table provides a way of portraying data that can facilitate calculating
probabilities. The table helps in determining conditional probabilities quite easily. The table
displays sample values in relation to two different variables that may be dependent or
contingent on one another.
16. What is the function of a contingency table in bivariate analysis?
Ans : Contingency Table T
he joint distribution of two variables is called a bivariate distribution. A contingency table
shows the frequency distribution of the values of the dependent variable, given the
occurrence of the values of the independent variable.
17. What is Rmarkdown? What is the use of it?
Ans : RMarkdown is a tool provided by R to create dynamic documents and reports that
contain shiny widgets and outputs from R. An R Markdown document is written in
markdown (an easy-to-write plain text format) and contains chunks of embedded R code.
18. What causes resistance lines?
Ans : Resistance in technical analysis is a price level that a rising stock can’t seem to
overcome. Once a stock reaches its resistance level, it often stalls and reverses. Resistance is
caused by heavy selling that overpowers buying, and typically occurs at specific resistance
price levels.
19. Why is scatterplot used?
Ans : Scatter plots’ primary uses are to observe and show relationships between two numeric
variables. The dots in a scatter plot not only report the values of individual data points, but
also patterns when the data are taken as a whole. Identification of correlational relationships
are common with scatter plots.
20. What are the steps involved in visualizing longitudinal data.
Ans :
Import libraries
Prepare data
Violin plots
Plot longitudinal data.
PART – B
1. Explain in detail the various data visualization techniques.
2. Explain the seven stages of visualization?
3. Differentiate between two variable and three variable contingency table
4. Explain in detail resistance lines analysis techniques.
5. Explain in detail how two variable to three variable transformation occurs with relevant
example.
UNIT - III
PART - A
1. What is data visualization
the field of mapping data to a visual form. visualization adds building blocks for
interacting with and representing various kinds of abstract data, but typically these
methods undervalue the aesthetic principles of visual design rather than embrace their
strength as a necessary aid to effective communication.
2. List out the seven stages for visualizing data?
a. Acquire
b. Parse
c. Filter
d. Mine
e. Represent
f. Refine
g. Interact
3. What smeant by acquire?
The acquisition step involves obtaining the data. This can be either extremely
complicated (i.e., trying to glean useful data from a large sys- tem) or very simple
(reading a readily available text file).
4. What is meant by parse?
After acquire the data, it needs to be parsed that means changed into a format that
tags each part of the data with its intended use. Each line of the file must be broken
along its individual parts; in this case, it must be delimited at each tab character.
Then, each piece of data needs to be converted to a useful format. Each field is
formatted as a data type that we’ll handle in a conversion program.
5. What ismeant by Filter?
filtering the data is to remove portions which are not relevant to our use.
6. What is meant by Mine?
This step involves math, statistics, and data mining. The data in this case receives
only a simple treatment: the program must figure out the minimum and maximum
values for latitude and longitude by running through the data. So that it can be
presented on a screen at a proper scale.
7. What is meant by Represent?
This step determines the basic form that a set of data will take. Some data sets are
shown as lists, others are structured like trees, and so forth. In this case, each zip
code has a latitude and longitude, so the codes can be mapped as a two-dimensional
plot, with the minimum and maximum values for the latitude and longitude used for
the start and end of the scale in each dimension.
8. What is meant by Refine?
In this step, graphic design methods are used to further clarify the representation by
calling more attention to particular data (establishing hierarchy) or by changing
attributes (such as color) that contribute to readability.
9. What is meant by Processing Development Environment(PDE)?
The Processing Development Environment (PDE). This is the software that runs
when you double-click the Processing icon. The PDE is an Integrated Development
Environment with a minimalist set of features designed as a sim- ple introduction to
programming or for testing one-off ideas.
10. What is meant by sketch?
A Processing program is called a sketch. The idea is to make Java-style programming
feel more like scripting, and adopt the process of scripting to quickly write code.
Sketches are stored in the sketchbook, a folder that’s used as the default location for
saving all of your projects. When you run the Processing, the sketch which is used
last time will automatically open. If this is the first time Processing (or if the sketch
is no longer available), a new sketch will open.
Sketches that are stored in the sketchbook can be accessed from File ➝ Sketchbook.
Alternatively, File ➝ Open... can be used to open a sketch from elsewhere on the
system.
11. Explain size() method?
The size( ) command also sets the global variables width and height. For objects
whose size is dependent on the screen, always use the width and height variables
instead of a number.
Example : size(400, 400);
12. What is meant by renderer?
A renderer handles how the Processing API is implemented for a particular output
method. Several renderers are included with Processing, and each has a unique
function.
Example: Java2D renderer, Processing 2D renderer, Processing 3Drenderer,
OpenGL renderer, PDF renderer.
13. How to load and display data in processing?
One of the unique aspects of the Processing API is the way files are handled. The
loadImage( ) and loadStrings( ) functions each expect to find a file inside a folder
named data, which is a subdirectory of the sketch folder.
File handling functions include loadStrings( ), which reads a text file into an
array of String objects, and loadImage( ), which reads an image into a PImage object,
the con- tainer for image data in Processing.
Eg: String[] lines = loadStrings("[Link]");
PImage image = loadImage("[Link]");
14. Write the functions used in processing API?
Acquire
loadStrings( ), loadBytes( )
Parse
split( )
Filter
for( ), if (item[i].startsWith( ))
Mine
min( ), max( ), abs( )
Represent
map( ), beginShape( ), endShape( )
Refine
fill( ), strokeWeight( ), smooth( )
Interact
mouseMoved( ), mouseDragged( ), keyPressed( )
15. What is meant by Library?
A library is a collection of code in a specified format that makes it easy to use within
Processing. Libraries have been important to the growth of the project because they
let developers make new features accessible to users without making them part of the
core Processing API.
example : XML import library
To use the XML library in a project, choose Sketch ➝ Import Library ➝ xml. This
will add the following line to the top of the sketch:
import [Link].*;
16. Write the code for drawing a map?
PImage mapImage;
void setup( ) { size(640,
400);
mapImage = loadImage("[Link]");
}
void draw( )
{
background(255);
image(mapImage, 0, 0);
}
17. Write the basic methods used in processing API?
loadImage()
setup( )
draw( )
setup( ) - load images, fonts, and set initial values for variables
draw( ) - runs at 60 frames per second; ); it can be used to update the screen to show
animation or respond to mouse movement and other types of input.
loadImage() - function reads an image from the data folder (URLs or absolute paths
also work). The PImage class is a container for image data, and the image( )
command draws it to the screen at a specific location.
18. How to display map in proessing?
displaying a map in Processing is a two-step process:
1. Load the data.
2. Display the data in the desired format.
19. How to display centers of states?
1. Create locationTable and use the [Link]( ) function to read
each location’s coordinates (x and y values).
2. Draw a circle using those values. Because a circle, geometrically speaking, is
just an ellipse whose width and height are the same, graphics libraries provide
an ellipse-drawing function that covers circle drawing as well.
20. What is screen scraping?
Extracting the data from the HTML with a program is often called screen-scraping.
HTML is machine-generated and therefore is in a pretty regular format. To locate
the table you want among the JavaScript and other HTML tags used to display the
web page.
21. What are the process require for screen scraping?
1. Navigate to the page that contains the data.
2. Choose View Source from the browser’s menu and take a look at the code. Use
the Find command to look for your identifiers (e.g., “American League” or
“East”) to see where the data begins.
3. If the web page we were interested in contained our data directly in its HTML,
we could write a program that read lines from the page and parsed them to
remove the data.
22. What is meant by Typography?
The Elements of Typographical Style (Hartley and Marks Publish- ers) defines the en
dash as suitable for use when separating values that can be broken with the word
“to.” In this case, the 40–21 next to the Red Sox can be stated as “the Red Sox have a
record of 40 to 21.
23. What is meant by Unicode escape?
` The en dash is specified by "\u2013", a Unicode escape sequence. A
Unicode escape is a \u followed by four hex digits representing the character’s
number in the Unicode character set. Other types of dashes can be used, such as the
em dash, "\u2012", or the minus sign, "\u2212".
Example:
title[index] = wins + "-" + losses;
to read as follows:
title[index] = wins + "\u2013" + losses;
24. What is tree structure?
Tree structures store data for which each element might have several subelements.
Elements in a tree are typically referred to as nodes, usually with multiple child
nodes. Files and directories are straightforward examples of a tree structure. Each
directory can contain several items, which can be files or additional directories.
Additional directories may have more files inside, and so on.
25. Special challenges for recursion?
1. It’s common to show one or two levels of the tree and let the user delve in or
move out.
2. This in turn requires ways to signal to the application that parts of the data are
omitted or hidden.
3. When animation is available, it’s convenient to load recursive data incremen-
tally so that you don’t have to make the viewer wait for the whole data set to
load.
26. What is meant by graph?
A graph is a collection of elements, usually called nodes, linked together by edges
(sometimes called branches). It is a common structure for mapping connections of
many related elements.
27. Function of relax() method?
The relax( ) methods calculate the placement of each node and lengths for each
edge. This is handled by a kind of toy physics simulation, known as force-
directed layout, which reaches an optimal layout through a series of calculations.
28. What is meant by force directed layout?
In a force- directed layout, edges act like springs that have a target length
(also called their rest length). At each step, each Edge tries to get its length a little
closer to its target length. Because several edges are interconnected, the elements
push and pull on one
PART – B
1. Explain the seven stages of visualization?
2. Explain the concept of mapping? and explain the functions with examples?
3. Explain in detail about scatterplot maps?
4. Explain tree concepts?
5. Explain the concept of network and graphs?
UNIT – IV
PART – A
1. Define R?
The R Project for Statistical Computing, or simply named R, is a free software environment for
statistical computing and graphics. It is also a programming language that is widely used among
statisticians and data miners for developing statistical software and data analysis.
2. Define R studio?
RStudio is a free, open source IDE for R. It includes a console, syntax-highlighting
editor that supports direct code execution, as well as tools for plotting, history, debugging and
workspace management.
RStudio is available in open source and commercial editions and runs on the desktop
(Windows, Mac, and Linux) or in a browser connected to RStudio Server or RStudio Server Pro
(Debian/Ubuntu, RedHat/CentOS, and SUSE Linux).
3. What is meant by Files – Pane?
The Filespane functions like a file explorer similar to Windows Explorer on a Windows
operating system or Finder on a Mac. This tab, provides the following functionality:
1. Delete files and folders
2. Create new folders
3. Rename folders
4. Folder navigation
5. Copy or move files
6. Set working directory or go to working directory
7. View files
8. Import datasets
4. What is meant by Plots – pane?
The Plotspane, displayed is used to view output visualizations produced when typing
code into the Console window or running a script. Plots can be created using a variety of
different packages, but primarily be using the ggplot2 package. Once produced, you can zoom
in, export as an image, or PDF, copy to the clipboard, and remove plots. You can also navigate
to previous and next plots.
5. What is meant by package – pane?
The Packages pane, displays all currently installed packages along with a brief
description and version number for the package. Packages can also be removed using the x icon
to the right of the version number for the package. Clicking on the package name will display
the help file for the package in the Help tab. Clicking on the checkbox to the left of the package
name loads the library so that it can be used when writing code in the Console window.
6. What is meant by help – pane?
The Help pane, displays linked help documentation for any packages that you have
installed.
7. What is meant by viewer pane?
RStudio includes a Viewerpane that can be used to view local web [Link] pane
can only be used for local web content in the form of static HTML pages written in the
session‟s temporary directory or a locally run web application. The Viewer pane can‟t be used
to view online content.
8. What is meant by Environment pane?
The Environment pane contains a listing of variables that you have created for the
current session. Each variable is listed in the tab and can be expanded to view the contents of
the variable.
9. What is meant by history pane?
The History pane,displays a list of all commands that have been executed in the
current session. This tab includes a number of useful functions including the ability to save
these commands to a file or load historical commands from an existing file. You can also select
specific commands from the History tab and send them directly to the console or an open script.
You can also remove items from the History pane.
10. What is meant by connection pane?
The Connectionstab can be used to access existing or create new connections to ODBC
and Spark data sources.
11. What is meant by source pane?
The Sourcepane in RStudio, is used to create scripts, and display datasets An R script
is simply a text file containing a series of commands that are executed together. Commands
can also be written line by line from the Console pane as well. When written from the
Consolepane, each line of code is executed when you click the Enter (Return) key. However,
scripts are executed as a group.
Multiple scripts can be open at the same time with each script occupying a separate tab.
RStudio provides the ability to execute the entire script, only the current line, or a highlighted
group of lines. This gives you a lot of control over the execution the code in a script.
12. What is meant by console pane?
The Consolepane in RStudio is used to interactively write and run lines of code.
Each time you enter a line of code and click Enter(Return) it will execute that line of code.
Any warning or error messages will be displayed in the Console window as well as
output from print() statements.
13. What is by terminal pane?
The RStudio Terminalpane provides access to the system shell from within the
RStudio IDE. It supports xterm emulation, enabling use of full-screen terminal applications
as well as regular command-line operations with lineediting and shell history.
14. What ismeant by pane layout?
The Pane Layout tab is used to change the locations of console, source editor, and
tab panes, and set which tabs are included in each pane.
15. Write the functions of menu option?
1. Create new files and projects
2. Import datasets
3. Hide, show, and zoom in and out of panes
4. Work with plots (save, zoom, clear)
5. Set the working directory
6. Save and load workspace
7. Start a new session
8. Debugging tools
9. Profiling tools
10. Install packages
11. Access help system
16. How to assign data to variable?
There are two ways that variables can be assigned in R.
1. The use of a less than sign immediately followed by a dash then precedes the variable name.
This is the operator used to assign data to a variable in R.
E. g. x <- 10
2. The other way of creating and assigning data to a variable is to use the equal sign.
E. g. y = 10
17. What is the use of vector variable?
Vector is a sequence of data elements that have the same data type. Vectors are used primarily
as container style variables used to hold multiple values that can then be manipulated or
extracted as needed. The key though is that all the values must be of the same type. For
example, all the values must be numeric, character, or Boolean. You can‟t include any sort of
combination of data types.
To create a vector in R you call the c() function and pass in a list of values of the same
type.
E.g. layers <- c(„Parcels‟, „Streets‟, „Railroads‟, „Streams‟s)
17. What is meant by factor?
A Factor is basically a vector but with categories, so it will look familiar to you. Add the
following code block. Note that you can easily use line continuation in R simply by selecting the
Enter (Return) key on your keyboard. It will automatically add the “+” at the beginning of the
line indicating that it is simply a continuation of the last line.
E.g. [Link] <- factor(c(“Residential”, “Commercial”, “Agricultural”, “Commercial”,
“Commercial”, “Residential”), levels=c(“Residential”, “Commercial”))
table([Link]) [Link]
Residential Commercial 2 3
18. What is meant by list?
A list is an ordered collection of elements, in many ways very similar to vectors.
However, there are some important differences between a list and a vector. With lists you can
include any combination of data types. Lists are highly versatile and useful data types. A list in
R acts as a container style object in that it can hold many values that you store temporarily and
pull out as needed.
Lists can be created through the use of the list() function.. Each value that you intend
to place inside the list should be separated by a comma.
E.g. [Link] <- list(“Streets”, 2000, “Parcels”, 5000, TRUE, FALSE)
19. What is meant by Matrix data type?
A matrix is created using the matrix() function. The number of columns and rows
can be passed in as arguments to the function to define the attributes and data values of the
matrix. A matrix might be created from the values found in the attribute table of a feature
class. All the values in the matrix must of the same data type. The c() function is used to
define the data for the object.
E.g. matrx <- matrix(c(2,4,3,1,5,7), nrow=2, ncol=3, byrow=TRUE)
243
157
19. What is meant by vectorization in R?
Vectorization is a built-in structure that automatically loops through a data structure
without the need to write looping code. For loops are used when you know exactly how
many times to repeat a block of code. This includes the use of data frame objects that have a
specific number of rows. For loops are typically used with vector and data frame structures.
Example:
for (fire in 1:nrow(StudyArea))
{ print(StudyArea[fire, “TOTALACRES”])}
20. What is meant by decision support statement?
Decision support statements enable you to write code that branches based upon
specific conditions. The basic if | elsestatement in R is used for decision support. Basically,
ifstatements are used to branch code based on a test expression. If the test expression
evaluates to TRUE, then a block of code is executed. If the test evaluates to FALSE then the
processing skips down to the first else if statement or an elsestatement .
21. What is meant by function?
Functions are a group of statements that execute as a group and are action-oriented
structures in that they accomplish some sort of task. Input variables can be passed into
functions through what are known as parameters. Another name for parameters is
arguments. These parameters become variables inside the function to which they are
passed. R packages include many pre-built functions that you can use to accomplish
specific tasks.
Syntax:
22. Define tidyverse package?
The third-party tidyverse package supports a comprehensive data science
workflow as illustrated in the diagram below. The tidyverse ecosystem includes many
sub-packages designed to address specific components of the workflow. The tidyverse is
about the connections between the tools that make the workflow possible
23. What is readr?
Readr is u s ed to facilitate the import of file-based data into a structured data
format. The readr package includes seven functions for importing file-based datasets
including csv, tsv, delimited, fixed width, white space separated, and web log files.
24. What is tibble?
Data is imported into a data structure called a tibble. Tibbles are the tidyverse
implementation of a data frame. They are quite similar to data frames, but are basically a
newer, more advanced version. However, there are some important differences between
tibbles and data frames. Tibbles never convert data types of variables.
25. What is meant by tidyr package?
Data tidying is a consistent way of organizing data in R, and can be facilitated
through the tidyr package. There are three rules that we can follow to make a dataset tidy.
First, each variable must have its own column. Second, each observation must have its
own row, and finally, each value must have its own cell.
26. What is dplyr package?
The dplyr package is a very important part of tidyverse. It includes five key
functions for transforming your data in various ways. These functions include filter(),
arrange(), select(), mutate(), and summarize(). In addition, these functions all work very
closely with the group_by()function.
27. What is ggplot2 package?
The ggplot2package is a data visualization package for R. The building blocks used
in ggplot2 to implement the Grammar of Graphics include data, aesthetic mapping,
geometric objects, statistical transformations, scales, coordinate systems, position
adjustments, and faceting. Using ggplot2you can create many different kinds of charts and
graphs including bar charts, box plots, violin plots, scatterplots, regression lines, and more.
28. How to filter a data set?
Filtering the dataset enables you to focus on a subset of the rows instead of the
entire dataset. The dplyr package includes a filter() function that supports this capability.
Example:
dfCrime2 = filter(dfCrime, Neighborhood == „QUEEN ANNE‟)
29. How to group dataset?
The group_by() function, found in the dplyr package, is commonly used to group
data by one or more variables. Once grouped, summary statistics can then be generated for
the group or you can visualize the data in various ways. It‟s also very common to
visualize these grouped datasets in different ways. Barcharts, scatterplots, or other graphs
could be produced for the grouped dataset.
Example:
dfCrime2 = group_by(dfCrime2, Beat)
PART – B
1. Explain the concepts of variable? And How to create and assign values to the variable?
2. Briefly explain the concepts of looping statement and decision support statement with examples?
3. Explain the various types of packages and sub packages used in R?
4. How to load tidyverse package and explain the functions used in it?
5. How to load and transform data into R?
UNIT – V
PART – A
1. What is Exploratory Data Analysis?
Exploratory Data Analysis (EDA) is a workflow designed to gain a better understanding of
your data. The workflow consists of three steps. The first is to generate questions about your data.
Next, search for answers to these questions by visualizing, transforming, and modeling the data.
Finally, refine your questions and or generate new questions. In R there are two primary tools that
support the data exploration process: plots and summary statistics.
2. What are the types of data?
``` Data can generally be divided into 2 types.
1. Categorical types
2. continuous types.
Categorical variables consist of a small set of values, while continuous variables have a
potentially infinite set of ordered values. Categorical variables are often visualized with bar charts,
and continuous variables with histograms. Both categorical and continuous data can be represented
through various charts created with R.
3. What is meant by variation?
Variation is the tendency of the values of a variable to change from measurement to
measurement. The variable being measured is the same though
4. What is meant by covariation?
Covariation is the tendency of the values of two or more variables to vary together in a related
way. The geom_count() function can be used with ggplot() to measure covariation between variables
using different symbol sizes.
5. What is bar chart?
A bar chart is a great way to visualize categorical data. It separates each category into a separate bar
and then the height of each bar is defined by the number of occurrences in that category.
6. What is box plot?
Box plots provide a visual representation of the spread of data for a variable. These plots display the
range of values for a variable along with the median and quartiles.
7. List out the advantages of ggplot2 in R?
[Link] style for defining the graphics,
2.A high level of abstraction for specifying plots,
3. flexibility,
4.A built-in theming system for plot appearance,
[Link] and complete graphics system, and
[Link] to many other ggplot2 users for support
7.
8. What ismeant by violin plot?
Violin plots, which are similar to box plots, also show the probability density at various values.
Thicker areas of the violin plot indicate a higher probability at that value. Typically, violin plots also
include a marker for the median along with the Inter-Quartile Range (IQR). The geom_violin()
function is used to create violin plots in ggplot2.
9. What is meant by density plots?
Density plots, created with geom_density() computes a density estimate, which is a smoothed version
of a histogram and is used with continuous data. ggplot2 can also compute 2D versions of density
includes contours and polygon styled density plots.
10. What is R Markdown framework?
R Markdown is an authoring framework for data science that combines code, results, and
commentary. Output formats include PDF, Word, HTML, slideshows, and more. An R Markdown
document essentially serves three purposes: communication, collaboration, and as a modern-day lab
environment. R Markdown uses the rmarkdownpackage, but you don’t have to explicitly load the
package in RStudio.
11. What is chunk?
Chunks define a single task, sort of like a function. They should be self-contained and tightly defined
pieces of code. There are three ways to insert chunks into an R Markdown file: Cmd/Ctrl-Alt-I, the
Insert button on the editor toolbar, and by manually typing the chunk delimiters.
12. What is meant by Knit?
The Knit functionality built into RStudio can be used to export an R Markdown file to various
formats including HTML, PDF, and Word. Knit can be accessed from the dropdown menu
PART – B
1. Explain in detail about basic plots in R?
2. Explain basic data exploration techniques?
3. Explain the concept of map and how to draw map in R?
4. Explain in detail about R Markdown framework?