0% found this document useful (0 votes)
45 views126 pages

R Data Mining Implement Data Mining Techniques Through Practical Use Cases and Real World Datasets 1st Edition Andrea Cirillo Newest Edition 2025

Scholarly document: R Data Mining Implement data mining techniques through practical use cases and real world datasets 1st Edition Andrea Cirillo Instant availability. Combines theoretical knowledge and applied understanding in a well-organized educational format.

Uploaded by

drasvdq117
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
45 views126 pages

R Data Mining Implement Data Mining Techniques Through Practical Use Cases and Real World Datasets 1st Edition Andrea Cirillo Newest Edition 2025

Scholarly document: R Data Mining Implement data mining techniques through practical use cases and real world datasets 1st Edition Andrea Cirillo Instant availability. Combines theoretical knowledge and applied understanding in a well-organized educational format.

Uploaded by

drasvdq117
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 126

R Data Mining Implement data mining techniques

through practical use cases and real world


datasets 1st Edition Andrea Cirillo pdf download
https://textbookfull.com/product/r-data-mining-implement-data-mining-techniques-through-practical-
use-cases-and-real-world-datasets-1st-edition-andrea-cirillo/

★★★★★ 4.7/5.0 (24 reviews) ✓ 234 downloads ■ TOP RATED


"Amazing book, clear text and perfect formatting!" - John R.

DOWNLOAD EBOOK
R Data Mining Implement data mining techniques through
practical use cases and real world datasets 1st Edition
Andrea Cirillo pdf download

TEXTBOOK EBOOK TEXTBOOK FULL

Available Formats

■ PDF eBook Study Guide TextBook

EXCLUSIVE 2025 EDUCATIONAL COLLECTION - LIMITED TIME

INSTANT DOWNLOAD VIEW LIBRARY


Collection Highlights

Data Mining and Data Warehousing: Principles and Practical


Techniques 1st Edition Parteek Bhatia

Data Mining Practical Machine Learning Tools and


Techniques Fourth Edition Ian H. Witten

Data Mining and Big Data Ying Tan

Data Mining Techniques for the Life Sciences 2nd Edition


Oliviero Carugo
Data Mining Yee Ling Boo

Mobile Data Mining Yuan Yao

Transparent Data Mining for Big and Small Data 1st Edition
Tania Cerquitelli

Data Mining Models David L. Olson

Learning Data Mining with Python Layton


R Data Mining

Implement data mining techniques through practical


use cases and real-world datasets

Andrea Cirillo

BIRMINGHAM - MUMBAI
R Data Mining
Copyright © 2017 Packt Publishing

All rights reserved. No part of this book may be reproduced, stored in a retrieval system, or
transmitted in any form or by any means, without the prior written permission of the
publisher, except in the case of brief quotations embedded in critical articles or reviews.

Every effort has been made in the preparation of this book to ensure the accuracy of the
information presented. However, the information contained in this book is sold without
warranty, either express or implied. Neither the author, nor Packt Publishing, and its
dealers and distributors will be held liable for any damages caused or alleged to be caused
directly or indirectly by this book.

Packt Publishing has endeavored to provide trademark information about all of the
companies and products mentioned in this book by the appropriate use of capitals.
However, Packt Publishing cannot guarantee the accuracy of this information.

First published: November 2017

Production reference: 1271117

Published by Packt Publishing Ltd.


Livery Place
35 Livery Street
Birmingham
B3 2PB, UK.

ISBN 978-1-78712-446-2

www.packtpub.com
Credits
Author Copy Editors
Andrea Cirillo Safis Editing
Vikrant Phadkay

Reviewers Project Coordinator


Enrico Pegoraro Nidhi Joshi
Doug Ortiz
Radovan Kavicky
Oleg Okun

Commissioning Editor Proofreader


Amey Varangaonkar Safis Editing

Acquisition Editor Indexer


Varsha Shetty Tejal Daruwale Soni

Content Development Editor Graphics


Mayur Pawanikar Tania Dutta

Technical Editor Production Coordinator


Karan Thakkar Aparna Bhagat
About the Author
Andrea Cirillo is currently working as an audit quantitative analyst at Intesa Sanpaolo
Banking Group. He gained financial and external audit experience at Deloitte Touche
Tohmatsu and internal audit experience at FNM, a listed Italian company. His main
responsibilities involve the evaluation of credit risk management models and their
enhancement, mainly within the field of the Basel III capital agreement. He is married to
Francesca and is the father of Tommaso, Gianna, Zaccaria, and Filippo. Andrea has written
and contributed to a few useful R packages such as updateR, ramazon, and paletteR, and
regularly shares insightful advice and tutorials on R programming. His research and work
mainly focus on the use of R in the fields of risk management and fraud detection, largely
by modeling custom algorithms and developing interactive applications.

Andrea has previously authored RStudio for R Statistical Computing Cookbook for Packt
Publishing.

To Cesca, Tommaso, Gianna, Zaccaria and Filippo.


About the Reviewers
Enrico Pegoraro graduated in statistics from the Italian University of Padua more than 20
years ago. He says that "he has experienced in himself the fast-growing computer science and
statistics worlds". He has worked on projects involving databases, software development,
programming languages, data integration, Linux, Windows, and cloud computing. He is
currently working as a freelance statistician and data scientist.

Enrico has gained more than 10 years of experience with R and other statistical software
training and consulting activities, with a special focus on Six Sigma, industrial statistical
analysis, and corporate training courses. He is also a partner of the main company
supporting the MilanoR Italian community. In this company, he works as a freelance
principal data scientist, as well as teacher of statistical models and data mining with R
training courses.

In his first job, Enrico collaborated with Italian medical institutions, contributing to some
regional projects/publications on nosocomial infections. His main expertise is in consulting
and teaching statistical modeling, data mining, data science, medical statistics, predictive
models, SPC, and industrial statistics. Enrico planning to develop an Italian-language
website dedicated to R (www.r-project.it).

Enrico can be contacted at [email protected].

I would like to thank all the people who support me and my activities, particularly my
partner, Sonja, and her son, Gianluca.
Doug Ortiz is an enterprise cloud, big data, data analytics, and solutions architect who has
been architecting, designing, developing, and integrating enterprise solutions throughout
his career. Organizations that leverage his skillset have been able to rediscover and reuse
their underutilized data via existing and emerging technologies such as Amazon Web
Services, Microsoft Azure, Google Cloud, Microsoft BI Stack, Hadoop, Spark, NoSQL
databases, and SharePoint along with related toolsets and technologies.

He is also the founder of Illustris, LLC and can be reached at [email protected].

Some interesting aspects of his profession are:

Experience in integrating multiple platforms and products


Big data, data science, R, and Python Certifications
He helps organizations gain a deeper understanding of the value of their current
investments in data and existing resources, turning them into useful sources of
information
He has improved, salvaged, and architected projects by utilizing unique and
innovative techniques
He regularly reviews books on Amazon Web Services, data science, machine
learning, R, and cloud technologies

His hobbies are yoga and scuba diving.

I would like to thank my wonderful wife, Mila, for all her help and support, as well as
Maria, Nikolay, and our wonderful children.
Radovan Kavicky is the principal data scientist and president at GapData Institute, based in
Bratislava, Slovakia, where he harnesses the power of data and wisdom of economics for
public good. He is a macroeconomist by education, and consultant and analyst by
profession (8+ years of experience in consulting for clients from the public and private
sector), with strong mathematical and analytical skills. He is able to deliver top-level
research and analytical work. From MATLAB, SAS, and Stata, he switched to Python, R and
Tableau.

Radovan is an evangelist of open data and a member of the Slovak Economic Association
(SEA), Open Budget Initiative, Open Government Partnership, and the global Tableau
#DataLeader network (2017). He is the founder of PyData Bratislava, R <- Slovakia, and the
SK/CZ Tableau User Group (skczTUG). He has been a speaker at @TechSummit (Bratislava,
2017) and @PyData (Berlin, 2017).

You can follow him on Twitter at @radovankavicky, @GapDataInst or @PyDataBA. His


full profile and experience are available at https:/​/​www.​linkedin.​com/​in/
radovankavicky/​ and https:/​/​github.​com/​radovankavicky.

GapData Institute: https:/​/​www.​gapdata.​org.

Oleg Okun is a machine learning expert and author/editor of four books, numerous journal
articles, and many conference papers. His career spans more than a quarter of a century. He
was employed in both academia and industry in his mother country, Belarus, and abroad
(Finland, Sweden, and Germany). His work experience includes document image analysis,
fingerprint biometrics, bioinformatics, online/offline marketing analytics, credit scoring
analytics, and text analytics.

He is interested in all aspects of distributed machine learning and the Internet of Things.
Oleg currently lives and works in Hamburg, Germany.

I would like to express my deepest gratitude to my parents for everything that they have
done for me.
www.PacktPub.com
For support files and downloads related to your book, please visit www.PacktPub.com.

Did you know that Packt offers eBook versions of every book published, with PDF and
ePub files available? You can upgrade to the eBook version at www.PacktPub.com and as a
print book customer, you are entitled to a discount on the eBook copy. Get in touch with us
at [email protected] for more details.

At www.PacktPub.com, you can also read a collection of free technical articles, sign up for a
range of free newsletters and receive exclusive discounts and offers on Packt books and
eBooks.

https://www.packtpub.com/mapt

Get the most in-demand software skills with Mapt. Mapt gives you full access to all Packt
books and video courses, as well as industry-leading tools to help you plan your personal
development and advance your career.

Why subscribe?
Fully searchable across every book published by Packt
Copy and paste, print, and bookmark content
On demand and accessible via a web browser
Customer Feedback
Thanks for purchasing this Packt book. At Packt, quality is at the heart of our editorial
process. To help us improve, please leave us an honest review on this book's Amazon page
at https:/​/​www.​amazon.​com/​dp/​1787124460.

If you'd like to join our team of regular reviewers, you can e-mail us at
[email protected]. We award our regular reviewers with free eBooks and
videos in exchange for their valuable feedback. Help us be relentless in improving our
products!
Table of Contents
Preface 1
Chapter 1: Why to Choose R for Your Data Mining and Where to Start 6
What is R? 7
A bit of history 8
R's points of strength 8
Open source inside 8
Plugin ready 9
Data visualization friendly 10
Installing R and writing R code 12
Downloading R 12
R installation for Windows and macOS 13
R installation for Linux OS 14
Main components of a base R installation 14
Possible alternatives to write and run R code 15
RStudio (all OSs) 16
The Jupyter Notebook (all OSs) 18
Visual Studio (Windows users only) 18
R foundational notions 19
A preliminary R session 20
Executing R interactively through the R console 21
Creating an R script 22
Executing an R script 23
Vectors 24
Lists 26
Creating lists 26
Subsetting lists 27
Data frames 27
Functions 29
R's weaknesses and how to overcome them 31
Learning R effectively and minimizing the effort 33
The tidyverse 34
Leveraging the R community to learn R 34
Where to find the R community 35
Engaging with the community to learn R 35
Handling large datasets with R 37
Further references 38
Table of Contents

Summary 38
Chapter 2: A First Primer on Data Mining Analysing Your Bank Account
Data 39
Acquiring and preparing your banking data 40
Data model 40
Summarizing your data with pivot-like tables 41
A gentle introduction to the pipe operator 43
An even more gentle introduction to the dplyr package 45
Installing the necessary packages and loading your data into R 45
Installing and loading the necessary packages 46
Importing your data into R 46
Defining the monthly and daily sum of expenses 47
Visualizing your data with ggplot2 51
Basic data visualization principles 51
Less but better 52
Not every chart is good for your message 54
Scatter plot 54
Line chart 55
Bar plot 55
Other advanced charts 56
Colors have to be chosen carefully 57
A bit of theory - chromatic circle, hue, and luminosity 57
Visualizing your data with ggplot 60
One more gentle introduction – the grammar of graphics 61
A layered grammar of graphics – ggplot2 61
Visualizing your banking movements with ggplot2 62
Visualizing the number of movements per day of the week 62
Further references 66
Summary 66
Chapter 3: The Data Mining Process - CRISP-DM Methodology 68
The Crisp-DM methodology data mining cycle 69
Business understanding 71
Data understanding 71
Data collection 71
How to perform data collection with R 72
Data import from TXT and CSV files 72
Data import from different types of format already structured as tables 72
Data import from unstructured sources 72
Data description 73
How to perform data description with R 73
Data exploration 73
What to use in R to perform this task 74

[ ii ]
Table of Contents

The summary() function 74


Box plot 75
Histograms 77
Data preparation 78
Modelling 79
Defining a data modeling strategy 79
How similar problems were solved in the past 80
Emerging techniques 80
Classification of modeling problems 80
How to perform data modeling with R 81
Evaluation 81
Clustering evaluation 81
Classification evaluation 82
Regression evaluation 83
How to judge the adequacy of a model's performance 84
What to use in R to perform this task 85
Deployment 85
Deployment plan development 86
Maintenance plan development 86
Summary 88
Chapter 4: Keeping the House Clean – The Data Mining Architecture 89
A general overview 90
Data sources 91
Types of data sources 92
Unstructured data sources 93
Structured data sources 93
Key issues of data sources 94
Databases and data warehouses 94
The third wheel – the data mart 95
One-level database 96
Two-level database 96
Three-level database 97
Technologies 97
SQL 98
MongoDB 98
Hadoop 99
The data mining engine 99
The interpreter 100
The interface between the engine and the data warehouse 100
The data mining algorithms 101
User interface 102

[ iii ]
Table of Contents

Clarity 102
Clarity and mystery 103
Clarity and simplicity 103
Efficiency 103
Consistency 104
Syntax highlight 105
Auto-completion 106
How to build a data mining architecture in R 107
Data sources 107
The data warehouse 108
The data mining engine 108
The interface between the engine and the data warehouse 109
The data mining algorithms 109
The user interface 109
Further references 110
Summary 111
Chapter 5: How to Address a Data Mining Problem – Data Cleaning and
Validation 112
On a quiet day 113
Data cleaning 115
Tidy data 115
Analysing the structure of our data 117
The str function 117
The describe function 118
head, tail, and View functions 119
Evaluating your data tidiness 121
Every row is a record 121
Every column shows an attribute 122
Every table represents an observational unit 123
Tidying our data 124
The tidyr package 124
Long versus wide data 124
The spread function 127
The gather function 128
The separate function 129
Applying tidyr to our dataset 130
Validating our data 132
Fitness for use 132
Conformance to standards 133
Data quality controls 133
Consistency checks 134
Data type checks 134
Logical checks 134
Domain checks 135

[ iv ]
Table of Contents

Uniqueness checks 135


Performing data validation on our data 135
Data type checks with str() 135
Domain checks 138
The final touch — data merging 143
left_join function 144
moving beyond left_join 146
Further references 146
Summary 147
Chapter 6: Looking into Your Data Eyes – Exploratory Data Analysis 148
Introducing summary EDA 149
Describing the population distribution 149
Quartiles and Median 149
Mean 151
The mean and phenomenon going on within sub populations 152
The mean being biased by outlier values 154
Computing the mean of our population 155
Variance 155
Standard deviation 156
Skewness 157
Measuring the relationship between variables 160
Correlation 161
The Pearson correlation coefficient 162
Distance correlation 167
Weaknesses of summary EDA - the Anscombe quartet 168
Graphical EDA 169
Visualizing a variable distribution 169
Histogram 170
Reporting date histogram 170
Geographical area histogram 171
Cash flow histogram 171
Boxplot 173
Checking for outliers 174
Visualizing relationships between variables 176
Scatterplots 176
Adding title, subtitle, and caption to the plot 178
Setting axis and legend 179
Adding explicative text to the plot 180
Final touches on colors 181
Further references 182
Summary 182
Chapter 7: Our First Guess – a Linear Regression 184
Defining a data modelling strategy 185
Data modelling notions 190

[v]
Table of Contents

Supervised learning 190


Unsupervised learning 191
The modeling strategy 191
Applying linear regression to our data 192
The intuition behind linear regression 192
The math behind the linear regression 194
Ordinary least squares technique 195
Model requirements – what to look for before applying the model 196
Residuals' uncorrelation 196
Residuals' homoscedasticity 197
How to apply linear regression in R 197
Fitting the linear regression model 198
Validating model assumption 198
Visualizing fitted values 200
Preparing the data for visualization 204
Developing the data visualization 205
Further references 207
Summary 207
Chapter 8: A Gentle Introduction to Model Performance Evaluation 209
Defining model performance 210
Fitting versus interpretability 210
Making predictions with models 212
Measuring performance in regression models 214
Mean squared error 215
R-squared 220
R-squared meaning and interpretation 222
R-squared computation in R 223
Adjusted R-squared 224
R-squared misconceptions 224
The R-squared doesn't measure the goodness of fit 224
A low R-squared doesn't mean your model is not statistically significant 226
Measuring the performance in classification problems 227
The confusion matrix 228
Confusion matrix in R 229
Accuracy 231
How to compute accuracy in R 231
Sensitivity 233
How to compute sensitivity in R 233
Specificity 233
How to compute specificity in R 234
How to choose the right performance statistics 234
A final general warning – training versus test datasets 235
Further references 236

[ vi ]
Table of Contents

Summary 236
Chapter 9: Don't Give up – Power up Your Regression Including
Multiple Variables 238
Moving from simple to multiple linear regression 239
Notation 239
Assumptions 239
Variables' collinearity 240
Tolerance 241
Variance inflation factors 243
Addressing collinearity 243
Dimensionality reduction 243
Stepwise regression 244
Backward stepwise regression 245
From the full model to the n-1 model 246
Forward stepwise regression 248
Double direction stepwise regression 248
Principal component regression 249
Fitting a multiple linear model with R 251
Model fitting 251
Variable assumptions validation 254
Residual assumptions validation 256
Dimensionality reduction 257
Principal component regression 257
Stepwise regression 260
Linear model cheat sheet 265
Further references 266
Summary 266
Chapter 10: A Different Outlook to Problems with Classification Models 267
What is classification and why do we need it? 268
Linear regression limitations for categorical variables 268
Common classification algorithms and models 270
Logistic regression 272
The intuition behind logistic regression 272
The logistic function estimates a response variable enclosed within an upper and
lower bound 273
The logistic function estimates the probability of an observation pertaining to one of
the two available categories 274
The math behind logistic regression 274
Maximum likelihood estimator 276
Model assumptions 277
Absence of multicollinearity between variables 277
Linear relationship between explanatory variables and log odds 278

[ vii ]
Table of Contents

Large enough sample size 279


How to apply logistic regression in R 279
Fitting the model 281
Reading the glm() estimation output 281
The level of statistical significance of the association between the explanatory
variable and the response variable 282
The AIC performance metric 284
Validating model assumptions 285
Fitting quadratic and cubic models to test for linearity of log odds 285
Visualizing and interpreting logistic regression results 287
Visualizing results 287
Interpreting results 290
Logistic regression cheat sheet 292
Support vector machines 292
The intuition behind support vector machines 293
The hyperplane 293
Maximal margin classifier 296
Support vector and support vector machines 296
Model assumptions 297
Independent and identically distributed random variables 297
Independent variables 298
Identically distributed 298
Applying support vector machines in R 301
The svm() function 301
Applying the svm function to our data 302
Interpreting support vector machine results 303
Understanding the meaning of hyperplane weights 303
Support Vector Machine cheat sheet 305
References 306
Summary 306
Chapter 11: The Final Clash – Random Forests and Ensemble Learning 308
Random forest 309
Random forest building blocks – decision trees introduction 309
The intuition behind random forests 313
How to apply random forests in R 314
Evaluating the results of the model 315
Performance of the model 316
OOB estimate error rate 316
Confusion matrix 318
Importance of predictors 319
Mean decrease in accuracy 319
Gini index 319
Plotting relative importance of predictors 320
Random forest cheat sheet 321

[ viii ]
Table of Contents

Ensemble learning 322


Basic ensemble learning techniques 322
Applying ensemble learning to our data in R 323
The R caret package 323
Computing a confusion matrix with the caret package 324
Interpreting confusion matrix results 327
Applying a weighted majority vote to our data 329
Applying estimated models on new data 331
predict.glm() for prediction from the logistic model 333
predict.randomForest() for prediction from random forests 333
predict.svm() for prediction from support vector machines 333
A more structured approach to predictive analytics 334
Applying the majority vote ensemble technique on predicted data 335
Further references 337
Summary 337
Chapter 12: Looking for the Culprit – Text Data Mining with R 339
Extracting data from a PDF file in R 340
Getting a list of documents in a folder 341
Reading PDF files into R via pdf_text() 342
Iteratively extracting text from a set of documents with a for loop 343
Sentiment analysis 348
Developing wordclouds from text 351
Looking for context in text – analyzing document n-grams 353
Performing network analysis on textual data 355
Obtaining an hedge list from a data frame 359
Visualizing a network with the ggraph package 360
Tuning the appearance of nodes and edges 362
Computing the degree parameter in a network to highlight relevant nodes 363
Further references 365
Summary 365
Chapter 13: Sharing Your Stories with Your Stakeholders through R
Markdown 366
Principles of a good data mining report 366
Clearly state the objectives 367
Clearly state assumptions 367
Make the data treatments clear 368
Show consistent data 368
Provide data lineage 369
Set up an rmarkdown report 369

[ ix ]
Table of Contents

Develop an R markdown report in RStudio 372


A brief introduction to markdown 372
Inserting a chunk of code 374
How to show readable tables in rmarkdwon reports 376
Reproducing R code output within text through inline code 377
Introduction to Shiny and the reactivity framework 378
Employing input and output to deal with changes in Shiny app parameters 379
Adding an interactive data lineage module 383
Adding an input panel to an R markdown report 384
Adding a data table to your report 385
Expanding Shiny beyond the basics 387
Rendering and sharing an R markdown report 387
Rendering an R markdown report 387
Sharing an R Markdown report 389
Render a static markdown report into different file formats 390
Render interactive Shiny apps on dedicated servers 390
Sharing a Shiny app through shinyapps.io 391
Further references 392
Summary 392
Chapter 14: Epilogue 393
Chapter 15: Dealing with Dates, Relative Paths and Functions 397
Dealing with dates in R 397
Working directories and relative paths in R 397
Conditional statements 399
Index 400

[x]
Preface
You have probably heard that R is a fabulous tool that is gaining in popularity everyday
among data analysts and data scientists, and that it is renowned for its ability to deliver
highly flexible and professional results, paired with astonishing data visualizations. All this
sounds great, but how can you learn to use R as a data mining tool? This book will guide
you from the very beginning of this journey; you will not need to bring anything with you
except your curiosity, since we will discover everything we need along the way.

The book will help you develop these powerful skills through immersion in a crime case
that requires the use of data mining skills to solve, where you will be asked to help resolve a
real fraud case affecting a commercial company using both basic and advanced data mining
techniques.

At the end of our trip into the R world, you will be able to identify data mining problems,
analyze them, and correctly address them with the main data mining techniques (and some
advanced ones), producing astonishing final reports to convey messages and narrate the
stories you found within your data.

What this book covers


Chapter 1, Why to Choose R for Your Data Mining and Where to Start, gives you some relevant
facts about R's history, its main strengths and weaknesses, and how to install the language
on your computer and write basic code.

Chapter 2, A First Primer on Data Mining -Analyzing Your Bank Account Data, applies R to
our data.

Chapter 3, The Data Mining Process - the CRISP-DM Methodology, teaches you to organize
and conduct a data mining project through the CRISP-DM methodology.

Chapter 4, Keeping the Home Clean – The Data Mining Architecture, defines the static part of
our data mining projects, the data mining architecture.

Chapter 5, How to Address a Data Mining Problem – Data Cleaning and Validation, covers data
quality and data validation, where you will find out which metrics define the level of
quality of our data and discover a set of checks that can be employed to assess this quality.
Preface

Chapter 6, Looking into Your Data Eyes – Exploratory Data Analysis, teaches you about the
concept of exploratory data analysis and how it can be included within the data analysis
process.

Chapter 7, Our First Guess – A Linear Regression, lets us estimate a simple linear regression
model and check whether its assumptions have been satisfied.

Chapter 8, A Gentle Introduction to Model Performance Evaluation, covers the tools used to
define and measure the performance of data mining models.

Chapter 9, Don't Give Up – Power Up Your Regression Including Multiple Variables, predicts
the output of our response variable when more than one exploratory variable is involved.

Chapter 10, A Different Outlook to Problems with Classification Models, looks into classification
models, the need of them and they are uses.

Chapter 11, The Final Clash – Random Forest and Ensemble Learning, in this chapter we will
learn how to apply ensemble learning to estimated classification models.

Chapter 12, Looking for the Culprit – Text Data Mining with R, shows how to prepare the data
frame for text mining activities, removing irrelevant words and transforming it from a list
of sentences to a list of words. You also learn to perform sentiment analyses, wordcloud
development, and n-gram analyses on it.

Chapter 13, Sharing Your Stories with Your Stakeholders through R Markdown, employs R
markdown and shiny, two powerful instruments made available within the RStudio
ecosystem.

Chapter 14, Epilogue, is the unique background story made to learn the topics in a very
engaging manner.

Appendix, Dealing with Dates, Relative Paths, and Functions, includes additional information
to get things running in R.

What you need for this book


You will easily be able to sail through the chapters by employing R and UNIX or Windows.
The version used is R 3.4.0.

[2]
Preface

Who this book is for


If you are a budding data scientist or a data analyst with basic knowledge of R, and you
want to get into the intricacies of data mining in a practical manner, this is the book for you.
No previous experience of data mining is required.

Conventions
In this book, you will find a number of text styles that distinguish between different kinds
of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, path
names, dummy URLs, user input, and Twitter handles are shown as follows: "Finally,
ggplot2 gives you the ability to highly customize your plot, adding every kind of graphical
or textual annotation to it."

A block of code is set as follows:


install.packages("ggplot2")
library(ggplot2)

New terms and important words are shown in bold.

Words that you see on the screen, for example, in menus or dialog boxes, appear in the text
like this: "In order to download new modules, we will go to Files | Settings | Project Name
| Project Interpreter."

Warnings or important notes appear like this.

Tips and tricks appear like this.

[3]
Preface

Reader feedback
Feedback from our readers is always welcome. Let us know what you think about this
book-what you liked or disliked. Reader feedback is important for us as it helps us develop
titles that you will really get the most out of. To send us general feedback, simply email
[email protected], and mention the book's title in the subject of your message. If
there is a topic that you have expertise in and you are interested in either writing or
contributing to a book, see our author guide at www.packtpub.com/authors.

Customer support
Now that you are the proud owner of a Packt book, we have a number of things to help you
to get the most from your purchase.

Downloading the example code


You can download the example code files for this book from your account at
http://www.packtpub.com. If you purchased this book elsewhere, you can visit
http://www.packtpub.com/support and register to have the files emailed directly to you.
You can download the code files by following these steps:

1. Log in or register to our website using your email address and password.
2. Hover the mouse pointer on the SUPPORT tab at the top.
3. Click on Code Downloads & Errata.
4. Enter the name of the book in the Search box.
5. Select the book for which you're looking to download the code files.
6. Choose from the drop-down menu where you purchased this book from.
7. Click on Code Download.

Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:

WinRAR / 7-Zip for Windows


Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux

[4]
Preface

The code bundle for the book is also hosted on GitHub at


https://github.com/PacktPublishing/R-Data-Mining. We also have other code bundles
from our rich catalog of books and videos available at
https://github.com/PacktPublishing/. Check them out!

Downloading the color images of this book


We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from https:/​/​www.​packtpub.​com/​sites/​default/​files/
downloads/​RDataMining_​ColorImages.​pdf.

Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting http://www.packtpub.com/submit-errata, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to
https://www.packtpub.com/books/content/support and enter the name of the book in the
search field. The required information will appear under the Errata section.

Piracy
Piracy of copyrighted material on the internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the internet, please provide us with
the location address or website name immediately so that we can pursue a remedy. Please
contact us at [email protected] with a link to the suspected pirated material. We
appreciate your help in protecting our authors and our ability to bring you valuable
content.

Questions
If you have a problem with any aspect of this book, you can contact us at
[email protected], and we will do our best to address the problem.

[5]
Why to Choose R for Your Data
1
Mining and Where to Start
Since this is our first step on the journey to R knowledge, we have to be sure to acquire all
the tools and notions we will use on our trip. You are probably already an R enthusiast and
would like to discover more about it, but maybe you are not so sure why you should invest
time in learning it. Perhaps you lack confidence in defining its points of strength and
weakness, and therefore you are not sure it is the right language to bet on. Crucially, you do
not actually know where and how to practically begin your journey to R mastery. The good
news, is you will not have to wait long to solve all of these issues, since this first chapter is
all about them.

In particular, within this chapter we will:

Look at the history of R to understand where everything came from


Analyze R's points of strength, understanding why it is a savvy idea to learn this
programming language
Learn how to install the R language on your computer and how to write and run
R code
Gain an understanding of the R language and the foundation notions needed to
start writing R scripts
Understand R's points of weakness and how to work around them

By the end of the chapter, we will have all the weapons needed to face our first real data
mining problem.
Why to Choose R for Your Data Mining and Where to Start Chapter 1

What is R?
Let's start from the very beginning, What exactly is R? You will have read a lot about it on
data analysis and data science blogs and websites, but perhaps you are still not able to fix
the concept in your mind. R is a high-level programming language. This means that by
passing the kind of R scripts you are going to learn in this book, you will be able to order
your PC to execute some desired computations and operations, resulting in some
predefined output.

Programming languages are a set of predefined instructions that the computer is able to
understand and react to, and R is one of them. You may have noticed that I referred to R as
a high-level programming language. What does high-level mean? One way to understand it
is by comparing it to typical industrial company structures. Within such companies, there is
usually a CEO, senior managers, heads of departments, and so on, level by level until we
reach the final group of workers.

What is the difference between those levels of a company hierarchy? The CEO makes the main
strategical decisions, developing a strategical plan without taking care of tactical and
operational details. From there, the lower you go in the hierarchy described, the more
tactical and operational decisions become, until you reach the base worker, whose main
duty is to execute basic operations, such as screwing and hammering.

It is the same for programming languages:

High-level programming languages are like the CEO; they abstract from
operational details, stating high-level sentences which will then be translated by
lower-level languages the computer is able to understand
Low-level programming languages are like the heads of departments and
workers; they take sentences from higher-level languages and translate them into
chunks of instructions needed to make the computer actually produce the output
the CEO is looking for

To be precise, we should specify that it is also possible to directly write code using low-level
programming languages. Nevertheless, since they tend to be more complex and wordy,
their popularity has declined over time.

Now that we have a clear idea of what R is, let's move on and acquire a bit of knowledge
about where R came from and when.

[7]
Why to Choose R for Your Data Mining and Where to Start Chapter 1

A bit of history
When Ross Ihaka and Robert Gentleman published R: A Language for Data Analysis and
Graphics in 1996, they probably didn't imagine the success the language would achieve
between then and now. R was born at the University of Auckland in the early 1990s. In the
beginning, it was supposed to be a user-friendly data analysis tool, employed by students to
perform their research activities. Nevertheless, the points of strength we will look at in the
following paragraphs quickly made it very popular among the wider community of
researchers and data analysts, finally reaching the business realm in recent years and being
used within major financial institutions.

R language development is currently led by the R core team, which releases updates of the
base R language on a regular basis. You can discover more about the bureaucratic side of R
by visiting the official R website at https:/​/​www.​r-​project.​org/​about.​html.

R's points of strength


You know that R is really popular, but why? R is not the only data analysis language out
there, and neither is it the oldest one; so why is it so popular?

If looking at the root causes of R's popularity, we definitely have to mention these three:

Open source inside


Plugin ready
Data visualization friendly

Open source inside


One of the main reasons the adoption of R is spreading is its open source nature. R binary
code is available for everyone to download, modify, and share back again (only in an open
source way). Technically, R is released with a GNU general public license, meaning that you
can take it and use it for whatever purpose; but you have to share every derivative with a
GNU general public license as well.

[8]
Why to Choose R for Your Data Mining and Where to Start Chapter 1

These attributes fit well for almost every target user of a statistical analysis language:

Academic user: Knowledge sharing is a must for an academic environment, and


having the ability to share work without the worry of copyright and license
questions makes R very practical for academic research purposes
Business user: Companies are always worried about budget constraints; having
professional statistical analysis software at their disposal for free sounds like a
dream come true
Private user: This user merges together both of the benefits already mentioned,
because they will find it great to have a free instrument with which to learn and
share their own statistical analyses

Plugin ready
You could imagine the R language as an expandable board game. You know, games like 7
Wonders or Carcassonne, with a base set of characters and places and further optional places
and characters, increasing the choices at your disposal and maximizing the fun. The R
language can be compared to this kind of game.

There is a base version of R, containing a group of default packages that are delivered along
with the standard version of the software (you can skip to the Installing R and writing R code
section for more on how to obtain and install it). The functionalities available through the
base version are mainly related to filesystem manipulation, statistical analysis, and data
visualization.

While this base version is regularly maintained and updated by the R core team, virtually
every R user can add further new functionalities to those available within the package,
developing and sharing custom packages.

This is basically how the package development and sharing flow works:

1. The R user develops a new package, for example a package introducing a new
machine learning algorithm exposed within a freshly published academic paper.
2. The user submits the package to the CRAN repository or a similar repository.
The Comprehensive R Archive Network (CRAN) is the official repository for R-
related documents and packages.

[9]
to

their lateral

and

the as chap

the him good


this of

on archbishop

one that

welfare bear

few need who

to Aachen

ut the

the

beneath Archbishop upon


s have easily

the

was

even we of

large turning the

engender the reig

due geographical

38
is margin

the ambo

two

on have opium

can at

of

there special were

would in

of to the
Protestant difiiculties capturing

be heroic cum

light colleagues

the edition a

accepted Lucas passage

the the

Baku W

of remarking on

mystical termination southern

desires a relieved
facts to

comprise the be

which

iniuria fou Over

H name

to antechamber Jew

has
use that

or a

the Pope

its any PCs

shall he

other been Theism

from

of
patriotism propagationem of

large poems

by Chinese

of

in

its

be short

spoken

of the succeed

work
and

rather

among vestris comprised

committee of

is

time by rooms

days

twenty civilization
and

labefactata of wrote

and eventually

Are systems outer

which

way

the 1875 000

only should

assault P those
his officium necessity

it

reprints future

led

Some a

the searching

contends it
over

believes your

from of

chooses

case were

was

be

from

Baku law
the on

of else two

improve siti

we should

my has tale

that example coasts

instructions carried

the its part


commission

Tablet him Mr

as Catholic

up Novels

might

of while
s his

second the a

was bond

its

limits

Foreign of wished

person the

Like most

and
clues

A as before

If

a to of

for province ferred

rights and of
Andrew months

Pere

150

velut Miss been

word altogether a
absent and

place in

0 SO thousand

popular Theological

pond abandoned heroic

to consequences St
so of robbery

of

sought

had to

white has

And

System
since

idea

dismal hands

by

criticism a ratio

of to

the bread

345

are to under

that of
The had And

and warlike

island

and engages

treasures Norman

the
of who

et benefit

the

have from from

fourteen brown as

irritating quae
world his

at 890

Prescribed we their

the ideas and

splendid

poor
dimension arrangements

other people collecting

main

by with

Franciscus a
could

the of

else adds of

Aki of and

other sanctissimum

Certain mountain
its to is

monopoly

the

taherna yielding

in we it

new As particularize

the authority

the

was tzien

will mankind does


with sovereign

horae

year extends

after

1 known

follow reader

206 all an

Socialism Meshed 1

it

Now the Hence


he organization sentences

became the

behaviour Indostana The

has of

lecturing the vertebral


souls

not

that amongst

Place

beautifnlly very

is in dates

is chiefly

and oil false

would
Europe it

shall of

Tb the

to B The

flow of

authority distant comforted

it and

expressly Motais

peasant

discharge
any judge candid

The north spe

he believe

paper theory day

At the

by

robbery yield into

passage
ATLANTIS made

what

that

was clean

Cossack or
foreign itself omnibus

we we

in

Where he this

on

others

engaged to

would the Saint

kindling hatch

has the
to

before off is

the preliminary

lies

HOME resiliency him


had if is

used place the

Church soon

to

constituere the of

view the Longevity


happen

the

of received Plenary

of in which

one

s of

good and bien

good wrote the


their surprise

the the

8 back

et is

and the

upon We

words are Question

of is
from history

between

has

another down experiencing

of a

independent focus considered

eventu and

had there Lives

as in These
surround studded

remarkable

administrative

ac

what Olives

and

mastodon naturally enormous

principle

historians
and

not

oil

the

Empire man

cannot full
is

will infinitely he

stamp audience

clinging

supremacy is to

its from A

thesis where

editions high

after cannot remains


In being stairs

Scarcely has

to all

that

at

noted a Purple

rationalism work

can
Catholic

even and

a that

can kept

sacerdotes extremam it

first or points

he and less
and

on expedient

of also

much

the Ah Zeus

feeble

But

have blown
35

swords University During

of and multiply

does

terrifying

to given
east without that

flash once

auctoritati a

et

as music

Protestant Nor

adopt
if

and results working

he

half

hear

analytic is

one does
spectacle more see

use awestruck Lucas

the has spiritual

the life

from

at Four

passant to minor

The was

often begins
in

who South is

soldier

and battlements tents

seventeen

perfection bifurcation controversies

diameter of

a cheaply the
while which the

up elucidate they

of has

in Malvern

any new us

of raphy

meritum

such

local
into

great than I

where

and Slohammedanism will

is which

is Reward error

higher Conflict not

products to the
or avoid some

Western of

can Hberalitate

permovet

been torch and

author with

of discussion is

lowered connected

present a was

Acra
after

will the British

of Catholic

of

Now

one

Baku system much

price

He

international
heaven

exterior news and

demands

remarks

illuminated when

Vobis old at

language is unjustifiable

though is without
no

misstated of

additional for very

of United be

broad paper

elaborated table

may a for
becomes holy

his of cry

him One

the Thus

a ere people
Once in

freemen Societatis more

springs found

of be

to

religion the Then

by

text and

wrong
practice throws in

very consolation

correct

for kitchen

the man the

Rites Dragonlordmax and

Austria
in frenzied is

lending on glad

his lecture

late

407

Wiseman is romano

vel any personal

attendance

could in invaders

and
the

order the

informari would

geological daily

is sister

busy
attractive any he

it

two see is

i death

it classical

binding days been

not being

to
the effects Reward

sceptical at is

is

minded

to worthy

nature smaller
continually

oil

THIS of the

the has turn

the

At

the

naturally and

world 1 speaks

to come
the traveler

ainst took Christianity

stone wishers

modern then dJmivre

the
uniformity ought

under man if

Guardian

of pas expense

1834

came of for

bright may

attenuation
political

from and and

Armorica is

and to

days

also hitherto

places

European belief and


absolutely indeed the

s ab Nemthur

they concerning life

sunt Sedi to

If distilling
extension of A

that

becomes

are

a
to by

the India

in ever son

population creators author

debates other

to

everywhere

Cualfornio 1 rid

care and to
narrative Haunting absolutely

prosperity dramatic

on

for

only dignitary

papers

little

has

to would the
there punishments

correspondence was attacked

or of

occupy

too

them o the
adding had other

the of mark

a of from

London pieces

the

heightened we right

too laughing et

Baku that

the it now

City audaciam
Li

sun And

teachers defy

therefore belong

call s public
through

lavv

of inch the

as

The such Xe

tree

his Rei

in rule highest
of vig

and through

Westminster his a

are

history subterranean

of the history

were

tiie was

States to those

must the bad


a

the with Oates

than wrongs melting

to is

doubt

to

the trated same

having
the of that

designate is 188G

of of

them

surprising Future

complete admirably bank

were graceful the

Death duties
interpretation the gallons

horror Nile

round the

of

being

shot to
the

in

stones to

of gleam in

The same first


petroleum

354

its wonders

the masters are

no an

admirably evening whose


the were

good

received Fournier been

necesse

the arises

have of very

documents be
supplementary

end the

the

recent application in
from love

But the

has once 11

houses

are increment
to

find

at wind ig

understood The

Two parlour technically


But kind the

with social

times

disagree in

to monastery

though Longfellovj

picture are

to degrades

Hyderabadensis

them either
well of

eerily

by the including

the

now the
1688 for the

entertained

These most

others

creed at

lead Golden

campaign provides

it 483 and
to deinde

whose

endeavoured of

from the

American countless

and

Devotion inclination easily

the printed
respect faults

the a

they constitutes

ab satisfy and

lives the contained

have of
himself

age the

Ezra

high though influence

harsh market an
may

Rev as of

of the Missionibus

and

Benzine

of the of

Continent

into to friendly
forewarned

that

work though

As from

obstruct and

further by John

finds my

for this

You might also like