Data Analytics
Data Analytics
Course Writer
Editor
Acknowledgement
Every attempt has been made to trace the copyright holders of the materials reproduced in this
book. Should any infringement has occurred, SSPU apologises for the same and will be pleased
to make necessary corrections in future editions of this book.
Unit 1
Overview of Business Data Analytics and Decision Making
Topics:
Objectives:
1. To understand the fundamental concepts of data analytics, business
analytics, importance of data analytics, business analytics, its applications
in organisations
2. To understand the types of business data analytics
3. To understand the scope of business data analytics in decision making
4. To introduce the concepts, techniques, and applications of data warehousing
5. To understand the terms Data science and how it is used in decision making
Outcomes:
1. Understand the role of data analytics in business and the importance of
business data analytics, its applications in organizations.
2. Understand types of business data analytics
3. Understand how business analytics helps organisation in decision making.
4. Understand Data warehouse concepts, architecture, and models.
5. Understand the concept of Data Science and data driven decision making
1
Introduction to the Unit
This unit aims to give a complete overview of Data Analytics, Business Data Analytics.
This chapter covers the types of Business Analytics, its applications in organizations. It
also dealt with architecture and models of data warehouse. Finally, it explains the term data
science and data driven decision making.
What is Data?
• Data is nothing but raw facts and figures. Numbers, characters, symbols, special
symbols, etc. form data. Generally, any computer user enters data in the form of
numbers, text, clicks, on in any other format. Computers perform operations on data
which are stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media like Hard disks, pen drives, CD
drives, etc. But just data does not help any organization. Data has to be processed and
we should be able to generate meaningful information from it.
Now a days, organizations generate huge data. That is processed and converted into
information. Even this huge amount of information is not so useful for any top-level
management to take decisions. The information should be such that without much time,
managers should be able to take decisions and for that analytics can be useful.
What is Analytics?
• The systematic computational analysis of data or statistics or information resulting
from the systematic analysis of data or statistics.
• Analytics focuses on the implications of data, the decisions and implementation or
actions that should be taken as a outcome.
• Analytics is a field of computer science that uses mathematics, statistics, and machine
learning to find significant patterns in data. Data Analytics involves going through huge
data sets to discover, interpret, and share new insights and knowledge.
2
What is Data Analytics?
• In today's digital world, data gets generated in huge amounts, e.g., sensor data, CC TV
Data, weather data, IOT generated data, etc. but if is in unstructured or semi structured
format, then it may not be of any use. To make it useful, it must be converted into an
appropriate format, and we should be able to extract the required and meaningful
information from it. This process can be called Data Analysis. The purpose of Data
Analytics is to extract useful information from data and take the decision based upon
the data analysis.
• The process of reviewing, cleansing, and manipulating data with the objective of
identifying usable information, informing conclusions, and assisting decision-making
is known as data analysis. Data analysis is important in today's business environment
since it helps businesses make more scientific decisions and run more efficiently.
• Data analytics is the analysis of data, whether huge or small, to understand it and see
how to use the knowledge hidden within it.
• Data analytics is defined as “a process of cleaning, transforming, and modeling data to
discover useful information for business decision making”.
• Data analytics converts raw data into actionable insights using statistical methods,
algorithms, Analytics tools, technologies, and processes used to find trends and solve
problems by using data.
• Data analytics can figure out and shape business processes, improve decision-making,
and nurture business growth.
• E. g. You enter a grocery store, and you find that your regular monthly purchases are
already selected and kept aside for you. If you want all of them or want to remove some
from the list, or simply add some items into it. Isn’t it a shocking surprise for you?
This is done with the help of Data analytics, used by owner of Grocery Store, he must
have had all the past details of purchases made by you, the history is analyzed using
Analytics technology and algorithms and your purchase patterns have been identified
and you get the list.
• E. g. you visit some online store and search for T-Shirts of any specific brand. You
may forget, but your search data is used to give you more and more suggestions on
browser wherever you visit any other site too. When you play any online game, there
3
you find your search products, this is also one part of data analytics. The use of search
data to explore particular interactions among Web searchers, the search engine (e. g.
google, Bing etc.,) or the content during searching is used for Search Analytics. This is
further used in Search Engine Marketing (SEM).
4
• Collecting historical data and records.
• Analyzing the data collected to find out appropriate patterns and trends.
• Using these trends to design improved strategies and efficient business decisions.
• Business analytics is the application of data analytics to business.
• It is a set of disciplines and technologies to solve business problems using data analysis,
statistical models, and other quantitative methods.
• Garner says, “Leading organizations in every industry are wielding data and analytics
as competitive weapons”.
Some examples of Business Analytics:
• E. g Offering specific discounts to different classes of travelers based on the amount of
business they offer or have the prospective to offer.
5
• Reduce risks: One main advantage of business analytics is its ability to mitigate
risks. It helps in tracking the mistakes made by the organization in the past and
understanding the factors that led to their occurrence.
• With this knowledge, analysis is done to predict the probability of the reoccurrence
of similar risks in the upcoming future, and therefore, the corresponding measures
can be taken to prevent the same.
• Enhance customer experience: All successful businesses have figured out the secret
to success – making their customers happy! Organizations today, identify their
customer base, understand their needs and behaviours, and correspondingly cater
to them.
• This is possible because of the statistical tools and models used in business
analytics.
6
How does Business Analytics Work?
• Business analytics is the analysis of an organization’s raw data and the conversion
of that analysis into information that is relevant to the organization’s vision and
objectives.
• The data is processed using various tools and procedures, and various patterns and
correlations are mapped out before predictions are made based on the data acquired.
Organizations create plans to boost their sales and profits based on these anticipated
outcomes.
Descriptive analytics explains the patterns hidden in data. These patterns could be
the number of market segments, or sales numbers based on regions, or groups of
products based on reviews, software bug patterns in a defect database, behavioral
patterns in an online gaming user database, and more. These patterns are purely
based on historical data.
7
By developing key performance indicators (KPIs,) Descriptive Analytics
strategies can help track successes or failures. Metrics such as return on investment
(ROI) are used in many industries. Specialized metrics are developed to track
performance in specific industries.
In Diagnostic Analytics Strategy the performance indicators are further
investigated to discover why they got better or worse. This generally occurs in three
steps: – Identify anomalies in the data. These may be unexpected changes in a
metric or a particular market. – Data that is related to these anomalies is collected.
– Statistical techniques are used to find relationships and trends that explain these
anomalies
e. g., Did the weather affect ice cream sales? Did that latest marketing campaign
impact sales?
c. Predictive analytics moves to what is likely going to happen in the near term.
The use of statistics to forecast future outcomes. These techniques use historical
data to identify trends and determine if they are likely to recur. Predictive analytical
tools provide valuable insight into what may happen in the future and their
techniques include a variety of statistical and machine learning techniques, such as:
neural networks, decision trees, and regression.
e.g., What happened to sales the last time we had a hot summer? How many weather
models predict a hot summer this year?
8
The application of testing and other techniques to determine which outcome will
yield the best result in each scenario.
• Forecasting and Prediction: Businesses may benefit from the use of business
analytics to estimate future results and make predictions based on previous data.
This might entail forecasting future sales, looking for expansion prospects, and
identifying market trends.
9
information may be used to increase consumer engagement and loyalty, optimize
marketing initiatives, and uncover new sources of income.
10
Data Warehouse:
11
Data Warehousing
• A data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases.
• Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic decisions.
Data Warehousing
• It is designed for query and analysis rather than for transaction processing, and usually
contains historical data derived from transaction data, but can include data from other
sources.
• A data warehouse is also a collection of information as well as supporting system.
• Data warehouses have the distinguishing characteristic that they are mainly intended for
decision support applications.
• Typically, Data Warehousing architecture is based on Relational Database Management
System, i. e. RDBMS.
• Data Warehousing (DW) is a process for collecting and managing data from various
sources to provide meaningful business insights. It is typically used to connect and analyze
business data from heterogeneous sources. The data warehouse is the core of the Business
Intelligence system which is built for data analysis and reporting.
• It is a combination of technologies and components which facilitates the strategic use of
data. It is electronic storage of a large amount of information by a business which is
designed for query and analysis instead of transaction processing. It is a process of
12
transforming data into information and making it available to users in a timely manner to
make a difference.
• Data warehousing is revolutionising how people conduct business analysis and make
strategic decisions across all industries, from government agencies to manufacturing
companies, retail chains to financial institutions, and utility companies to airlines.
• Data Warehousing helps in:
o Analyzing the data to gain a better understanding of the business and to improve the
business.
o provide a comprehensive and integrated perspective of the business.
o makes historical and current information about the company conveniently accessible for
decision-making.
o allows for decision-supporting transactions without interfering with operational
systems.
o makes the information of the organisation consistent
o offers a versatile and interactive source of strategic information.
13
Definition of Data Warehouse:
Defined in many ways, but not rigorously.
• A decision support database that is maintained separately from the organization’s
operational database.
• Support information processing by providing a solid platform of consolidated, historical
data for analysis.
Data warehousing:
It is the process of constructing and using data warehouses.
There is not any universal definition of Data Warehouse. W. H. Inmon defined data warehouse as
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process.”
Let’s see all the terms used in this definition.
• Subject-Oriented: Data warehouses are designed to help you analyze data. For example,
to learn more about your company's sales data, you can build a data warehouse that
concentrates on sales. Using this data warehouse, you can answer questions such as "Who
was our best customer for this item last year?" or "Who is likely to be our best customer
next year?" This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented.
• It is organized around major subjects, such as customer, product, sales, bank-account,
purchase, etc.
• It is focused on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
• It provides a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.
• Integrated: Integration is closely related to subject orientation. Data warehouses must put
data from different sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this, they
are said to be integrated.
14
• Data warehouse is constructed by integrating multiple, heterogeneous data sources.
Like relational databases, flat files, on-line transaction records, XML, excel sheets etc.
• Then data cleaning and data integration techniques are applied.
• It ensures consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources.
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.
• Nonvolatile: Nonvolatile means that, once entered the data warehouse, data should not
change. This is logical because the purpose of a data warehouse is to enable you to analyze
what has occurred. Operational update of data does not occur in the data warehouse
environment.
• In this it is a physically separate store of data transformed from the operational
environment.
• It means it does not require transaction processing, recovery, and concurrency control
mechanisms.
• It requires only two operations in data accessing: initial loading of data and access of
data.
• Time Variant: The time horizon for the data warehouse is significantly longer than that
of operational systems.
• Operational database: current value data
• Data warehouse data: provide information from a historical perspective (e.g., past 5-
10 years)
• A data warehouse's focus, on change over time, is what is meant by the term time
variant. To discover trends and identify hidden patterns and relationships in business,
analysts need large amounts of data.
• Every key structure in the data warehouse contains an element of time, explicitly or
implicitly.
• But the key to operational data may or may not contain “time element.”
15
• Data warehouses are optimized for data retrieval and not routine transaction processing.
OLTP – Online Transaction Processing - An OLTP system is a common data processing system
in today's enterprises. Classic examples of OLTP systems are order entry, retail sales, and financial
transaction systems. OLTP systems are primarily characterized through a specific data usage that
is different from data warehouse environments. It refers to a class of systems and processes
designed to handle and manage the real-time transactional activities of an organization. These
transactions typically involve interactions with a database where data is read, updated, inserted, or
deleted based on user actions or requests.
Transactional Integrity: Maintaining data accuracy and consistency is crucial in OLTP systems,
as they often involve financial transactions, order processing, and other critical business activities.
Concurrent Users: OLTP systems need to handle a high number of concurrent users performing
various transactions simultaneously without compromising performance or data integrity.
Normalized Data Structure: The data in OLTP systems is usually organized in a normalized
structure to minimize redundancy and ensure efficient storage.
Small Transactions: Transactions in OLTP systems are typically small-scale operations that
involve updating or retrieving a limited amount of data.
High Availability: OLTP systems require high availability to ensure continuous access to data
and support uninterrupted business operations.
ACID Properties: OLTP systems adhere to the ACID (Atomicity, Consistency, Isolation,
Durability) properties to guarantee reliable transactions and data integrity.
16
Examples of OLTP transactions are:
• Recording customer orders and updating inventory levels in an e-commerce system.
• Processing bank transactions, such as deposits, withdrawals, and transfers.
• Booking reservations in a hotel or airline reservation system.
• Managing patient information and appointments in a healthcare system.
• In contrast to OLTP, there's another type of database system called "Online Analytical
Processing" (OLAP), which is designed for complex querying and reporting tasks, often
involving large volumes of data. OLAP systems are used for data analysis and business
intelligence purposes rather than real-time transaction processing.
Here are some reasons why organizations often choose to keep their data warehouse separate from
their operational systems:
DBMS is tuned up for OLTP, access methods, indexing, concurrency control, recovery.
Warehouse is tuned up for OLAP, complex OLAP queries, multidimensional view, consolidation.
Performance Isolation: Operational databases are optimized for quick transactional processing,
while data warehouses are optimized for complex analytical queries. Separating the two ensures
that resource-intensive analytical queries don't impact the performance of operational transactions.
Data Transformation and Aggregation: Data warehouses often involve data transformation,
cleansing, and aggregation processes to create a unified and consistent view of the data. These
processes can be resource-intensive and can impact the performance of operational systems if
performed directly on OLTP databases.
Historical Data Storage: Data warehouses are designed to store historical data over time,
allowing for trend analysis, comparisons, and long-term insights. Operational databases may not
be optimized for retaining large volumes of historical data.
17
Data Volume and Structure: Data warehouses are designed to handle large volumes of data from
various sources. Operational databases, while handling high-frequency transactions, might not be
suitable for handling the vast amounts of data required for in-depth analysis.
Schema Design: Data warehouses often use a different schema design, such as star or snowflake
schemas, which are optimized for analytical queries. These schemas are different from the
normalized structures used in OLTP systems.
Data Integration: Data warehouses can consolidate data from multiple sources, including various
operational databases, third-party systems, and external data sources. This integration process can
be complex and is better managed in a separate environment.
Data Quality and Consistency: The data stored in operational systems might not always be clean
and ready for analysis. Data warehouses provide an opportunity to cleanse and standardize data
before it's used for reporting and analysis.
Query Performance: Data warehouses use optimized indexing, partitioning, and columnar
storage to enhance query performance for analytical tasks. This can be different from the indexing
strategies used in OLTP databases.
Business Intelligence and Reporting: Keeping a separate data warehouse allows business
analysts and reporting tools to access data without impacting operational systems. It also provides
a central location for business intelligence activities.
Scalability: Data warehouses can be scaled independently based on the analytical workload.
Separating the data warehouse from operational systems allows for tailored scaling strategies for
each environment.
Therefore, separating a data warehouse from operational systems allows organizations to create an
environment specifically optimized for complex data analysis, reporting, and business intelligence
18
activities. It ensures that analytical tasks can be performed efficiently without affecting the
performance and integrity of operational transactional systems.
Difference between OLTP and Data Warehouse (OLAP- Online Analytical Processing)
19
Components:
• External Source: In data source layer, external source is where data is collected from
various sources like day-to-day transactional data from operational database system, SAP
(ERP system), flat files, excel sheets, etc., irrespective of the type of data. Data can be
structured, semi structured or unstructured format.
• Staging Area: Since the data extracted from the external sources does not follow a any
specific format, it has to be validated before going into data warehouse. For this purpose,
it is recommended to use ETL tool.
20
• Data-warehouse: After cleansing of data, it is stored in the data warehouse as central
repository. It stores the metadata, and the actual data gets stored in the data
marts. Note that data warehouse stores the data in its purest form in this top-down
approach.
• Data Mart: The difference between a data Warehouse and a data mart is that data
warehouse is used across organizations, while data marts are used for individual
customized reporting. Data marts are small in size.
For example, there are multiple departments in a company e.g., the finance department,
which is very different from the marketing department. They all generate data from
different sources, where they need customized reporting. The finance department is
concerned mainly with the statistics while the marketing department is concerned with the
promotions. The marketing department doesn’t require any information on finance.
Data marts are subsets of data warehouse, they are required for customized reporting. This
subset of data is valuable to specific groups of an organization. There are two approaches
to loading it. First, load the data warehouse and then load the marts or vice versa.
• In the reporting scenario, which is the data access layer, the user accesses the data
warehouse and generates the report. These reporting tools are meant to make the front
interface extremely easy for the consumer since people at the decision-making level are
not concerned with technical information. They are primarily concerned with a neat usable
report.
• Data Mining: The practice of analyzing the big data present in data warehouse is data
mining. It is used to find the hidden patterns that are present in the database or in data
warehouse with the help of algorithm of data mining.
21
This approach is defined by Inmon as – data warehouse as a central repository for the
complete organization and data marts are created from it after the complete data warehouse
has been created.
2. Microsoft Azure: Azure is a cloud computing platform that was launched by Microsoft in
2010.
3. Google BigQuery: BigQuery is a serverless data warehouse that allows scalable analysis
over petabytes of data.
4. Snowflake: Snowflake is a cloud computing-based data warehousing built on top of the
Amazon Web Services or Microsoft Azure cloud infrastructure.
5. Micro Focus Vertica: Micro Focus Vertica: Micro Focus Vertica is developed to use in
data warehouses and other big data workloads where speed, scalability, simplicity, and
openness are crucial to the success of analytics.
8. Amazon S3: Amazon S3 is object storage engineered to store and retrieves any quantity of
data from any place.
22
9. Teradata: Teradata is one of the admired Relational Database Management systems. It is
appropriate for building big data warehousing applications.
10. Amazon RDS: Amazon Relational Database Service is a cloud data storage service to
operate and scale a relational database within the AWS Cloud. Its cost-effective and
resizable hardware capability helps us to build an industry-standard relational database and
manages all usual database administration tasks.
12. MariaDB: MariaDB Server is one of the most well-liked ASCII text files relational
databases.
23
• Investment and Insurance sector
A data warehouse is primarily used to analyze customer and market trends and other data
patterns in the investment and insurance sector. Forex and stock markets are two major
sub-sectors where data warehouses play a crucial role because a single point difference can
lead to massive losses across the board. DWHs are usually shared in these sectors and focus
on real-time data streaming.
• Retail chains
DWHs are primarily used for distribution and marketing in the retail sector to track items,
examine pricing policies, keep track of promotional deals, and analyze customer buying
trends. Retail chains usually incorporate EDW systems for business intelligence and
forecasting needs.
• Healthcare
DWH is used to forecast outcomes, generate treatment reports, and share data with
insurance providers, research labs, and other medical units in the healthcare sector. EDWs
are the backbone of healthcare systems because the latest, up-to-date treatment information
is crucial for saving lives.
24
Data-Driven Decision Making: Data-driven decision-making (DDDM) is defined as
using facts, metrics, and data to guide strategic business decisions that align with your
goals, objectives, and initiatives. Organizations must improve their decision-making
processes. For that it has follow following steps:
Step 1: Strategy
Data-driven decision making starts with the all-important strategy. This helps focus your
attention by weeding out all the data that’s not helpful for your business.
First, identify your goals — what can data do for you? Perhaps you’re looking for new
leads, or you want to know which processes are working and which aren’t.
Look at your business objectives, then build a strategy around them — that way you won’t
be dazzled by all the possibilities big data has to offer.
Data is flowing into your organization from all directions, from customer interactions to
the machines used by your workforce. It’s essential to manage the multiple sources of data
and identify which areas will bring the most benefit. Which area is key to achieving your
overarching business strategy? This could be finance or operations, for example.
25
Now that you’ve identified which areas of your business will benefit the most from
analytics and what issues you want to address, it’s time to target which datasets will answer
all those burning questions.
This involves looking at the data that you already have and finding out which data sources
provide the most valuable information. This will help streamline data. Remember that
when different departments use separate systems, it can lead to inaccurate data reporting.
The best systems can analyze data across different sources.
Targeting data according to your business objectives will help keep the costs of data storage
down, not to mention ensuring that you’re gaining the most useful insights.
Keep an eye on costs, and keep the board happy, by focusing only on the data you really
need.
Identify the key players who will be managing the data. This will usually be heads of
departments. That said, the most useful data will be collected at all levels and will come
from both external and internal sources, so you have a well-rounded view of what’s going
on across the business.
To analyze the data effectively, you may need integrated systems to connect all the
different data sources. The level of skills you need will vary according to what you need to
analyze. The more complex the query, the more specialized skills you’ll need.
On the other hand, simple analytics may require no more than a working knowledge of
Excel, for example. Some analytics platforms offer accessibility so that everyone can
access data, which can help connect the entire workforce and make for a more joined-up
organization.
26
The more accessible the data, the more potential there is for people to spot insights from
it.
The way you present the insights you’ve gleaned from the data will determine how much
you stand to gain from them.
There are multiple business intelligence tools that can pull together even complex sets of
data and present it in a way that makes your insights more digestible for decision makers.
Of course, it’s not about presenting pretty pictures but about visualizing the insights in a
way that’s relatable, making it easier to see what actions needs to be taken and ultimately
how this information can be used in business.
SUMMARY
• Data Analytics is a process of inspecting, cleaning, transforming, and modelling data with
the goal of discovering useful information, suggesting conclusions, and supporting
decision-making.
• Business analytics is the process of using quantitative methods to derive meaning from
data to make informed business decisions.
• Descriptive Analytics, Diagnostic Analytics, Predictive Analytics and Prescriptive
Analytics are the types of Data analytics.
• Data analytics can provide tremendous aid to decision-making, as they allow us to analyse
large amounts of structured and unstructured data to identify trends, forecast future
outcomes and make informed decisions.
• A data warehouse is a database designed to enable business intelligence activities. It exists
to help users understand and enhance their organizations’ performance.
• It helps to maintain historical records and analyze the data to understand and improve the
business.
27
• W.H. Inmon defines data warehouse as a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process.
References
1. Data Science from Scratch by Steven Cooper
2. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
3. Data Warehousing Fundamentals by Paulraj Ponniah
4. Data Warehousing, OLAP and Data Mining by S Nagabhushana
5. Building the Data Warehouse by William H Inmon
MCQs
Q1. Data Analytics uses ___ to get insights from data.
A. Statistical figures
B. Numerical aspects
C. Statistical methods
D. None of the mentioned above
Q.2 Data Analysis is a process of ________
A. inspecting data
B. cleaning data
C. transforming data
D. All of the above
Q.3 ____________ focuses more on why something happened.
A. Prescriptive Analytics
B. Diagnostic Analytics
C. Predictive Analytics
D. Cognitive Analytics
28
B. Diagnostic Analytics
C. Predictive Analytics
D. Prescriptive Analytics
Q5. ____ suggests course of action.
A. Prescriptive Analytics
B. Diagnostic Analytics
C. Predictive Analytics
D. Descriptive Analytics
Q.8______means that, once entered the data warehouse, data should not change.
A. Nonvolatile
B. Time variant
C. Consistent
D. Redundant
Q.9 In _______ step Analysts find which processes are working and which aren’t.
29
A. Strategy
B. Identify Key Areas
C. Collect Data
D. Analyze data.
Q.10 A ____________ is a database designed to enable business intelligence activities
A. data mining
B. data warehouse
C. data cube
D. metadata
**********************************************************
Questions:
a. Define Data Analytics.
b. Explore some more examples from day-to-day life where data analytics can be used.
c. Explain 4 types of Data Analytics.
d. Define Data Warehousing and explain in detail with diagram.
e. Explain architecture of Data Warehouse.
f. What is OLTP and differentiate between OLTP and OLAP.
g. What is data mart, explain in brief.
h. What are the applications of data warehouse.
i. Name some tools used to build data warehouse.
Extra Reading
1. Business Analytics Principles, Concepts, and Applications, What, Why, and How, Marc J.
Schniederjans, Dara G. Schniederjans, Christopher M. Starkey
2. Data Analytics made Accessible by Dr. Anil Maheshwari
3. The Data Warehouse Toolkit by Ralph Kimball, Margy Ross
30
31
Unit II: Data Mining Concepts & Techniques I
Topics:
• Definitions,
• Data preparation,
• Data modelling,
• Visualization of Decision Tree using R
• Regression using Linear Regression
Objectives:
6. To understand the fundamental concepts of data mining
7. To understand steps of data mining like data preparation and data modelling
8. To fully understand standard data mining methods and techniques such as
Linear Regression, Decision Tree.
9. To understand the concept of visualization, visualization of Decision tree
using R
Outcomes:
6. Understand the fundamental concepts of data mining.
7. Understand steps of data mining like data preparation and data modelling
8. Comprehend standard data mining methods and techniques such as Linear
Regression, Decision Tree.
9. Understand the concept of visualization, visualization of Decision tree using
R programming language
32
Introduction to the Unit
This Unit introduces the concepts of data mining, which gives a complete description about the
principles used, architecture, applications, design and implementation, techniques such as
Decision Tree, Regression, Linear Regression of data mining. It provides both theoretical and
practical coverage of data mining topics with extensive examples. This unit also covers data
visualization using R programming language.
Introduction
a. There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
b. Definition: Data Mining is defined as extracting information from huge sets of data. It uses
techniques from machine learning, statistics, neural networks, and genetic algorithms. Data
mining looks for hidden, valid, and potentially useful patterns in huge data sets. Data
Mining is all about discovering unsuspected/ previously unknown relationships amongst
the data. It is a multi-disciplinary skill that uses machine learning, statistics, Artificial
Intelligence, and database technology.
c. The insights derived via Data Mining can be used for marketing, fraud detection, and
scientific discovery, etc.
d. Data mining is an essential process in which the intelligent methods are applied to extract
data patterns.
e. It can be referred to as the procedure of mining knowledge from data.
33
• However, the development of powerful computers and the expansion of the internet in the
1990s led to the emergence of data mining as a separate subject. The first data mining tool,
the Intelligent Miner, was created by a team of IBM researchers in the early 1990s and was
used to examine large datasets from the financial sector.
• Data mining and other related topics, such as machine learning and artificial intelligence,
started to converge in the early 2000s, which sped up the development of more advanced
algorithms and methods.
• Data mining has many applications in a wide range of sectors today and is a mature
discipline that is expanding quickly. Data mining will probably continue to be a crucial
technique for comprehending and making sense of massive datasets in the future as big
data and artificial intelligence continue to advance.
The process of data mining is to extract information from a large volume of datasets. It is
done in four phases:
1. Data acquisition: It is the process of collecting, filtering, and cleaning the data before it is
added to the warehouse.
2. Data cleaning, preparation, and transformation: It is the process in which data is
cleaned, pre-processed, and transformed after adding it to the warehouse.
3. Data analysis, modelling, classification, and forecasting: In this step, data is analyzed,
with the help of various models, and classification is done.
4. Final report: In this step, the final report, insights, and analysis are finalized.
34
Data mining requires domain knowledge, technical skills, and creativity to extract essential
insights from giant data sets. The steps and techniques used in the process can vary depending on
the data and the specific problem being solved.
1. Data collection: The first step in data mining is collecting relevant data from various
sources. This can include data from databases, spreadsheets, websites, social media,
sensors, and other sources.
2. Data preprocessing: When the data is collected, it must be pre-processed. This involves
cleaning the data, removing duplicates, and dealing with missing data.
3. Data exploration: The next step is to identify patterns, trends, and relationships. This can
be done using various visualization techniques such as scatter plots, histograms, and box
plots.
4. Data transformation: In this step, the data is transformed into a format suitable for
analysis. This can involve normalization, discretization, or other techniques.
5. Data modelling: In this step, various algorithms and methods are used to build a model
that can be used to extract insights from the data. This can include clustering, classification,
regression, or other procedures. Data modeling in data mining involves creating a
mathematical or statistical representation of data to understand patterns, relationships, and
make predictions or classifications. Data modeling is the process of creating a visual
representation of either a whole information system or parts of it to communicate
connections between data points and structures. The objective is to explain the types of
data used and stored within the system, the relationships among these data types, the ways
the data can be grouped and categorized and its formats and attributes. Data models are
built around business needs. Rules and requirements are defined through feedback from
business stakeholders so they can be incorporated into the design of a new system or
adapted in the iteration of an existing one.
35
6. Model evaluation: Once the model is built, it must be evaluated to determine its efficiency.
This involves testing the model on a subset of the data and comparing the results to known
outcomes.
7. Deployment: Finally, the model is deployed for a real-world problem, and the insights
gained from the data mining process are used to make informed decisions.
Data Preparation
Data preparation is a crucial step in the data mining process. It involves transforming raw data
into a format that is suitable for analysis and modeling. In this phase, data is made production
ready. The data preparation process consumes about 90% of the time of the project. The data
from different sources should be selected, cleaned, transformed, formatted, anonymized, and
constructed (if required).
36
data mining. In order to stay competitive, this can help organizations create more successful
marketing and sales strategies.
5. Predictive analytics: Data mining is a technique used in predictive analytics to find
patterns and trends in data that may be used to forecast upcoming events or results.
Businesses may be able to forecast future trends and developments and make better
decisions as a result.
6. Financial Analysis, Finance Planning and Asset Evaluation: To analyze financial data
and identify patterns and trends in the financial markets and investment performance,
financial analysts employ data mining. Investors may be able to manage risk more skillfully
and make better investing decisions as a result.
7. Sports Analysis: To analyze player and team data and find patterns and trends in player
performance and team dynamics, sports analysis uses data mining. This can assist
managers and coaches in developing more sensible judgements and winning game plans.
8. Apart from these, data mining can also be used in the areas of production control, science
exploration, astrology, Internet Web Surf-Aid, Corporate Analysis & Risk Management,
etc.
37
3. The collected metadata is then given to the data mining engine for proper processing,
which occasionally collaborates with modules for pattern assessment to get the desired
outcome.
4. A appropriate interface is then used to send this result to the front end in a
comprehensible format.
1. Data sources include databases, the World Wide Web (WWW), and data
warehouses. These sources may provide data in the form of spreadsheets, plain text,
or other types of media like pictures or videos. One of the largest sources of data is
the WWW.
38
2. Database Server: The database server houses the real, processed data. According
to the user's request, it does data retrieval tasks.
3. Data Warehouse :A data warehouse is a mainstream repository of information that
can be analyzed to create more informed decisions. Data regularly flows into a data
warehouse from transactional systems, relational databases, and other sources. Data
warehouse works on various schemas. A schema refers to a logical structure of the
database that stores the data. Data mining consists of multiple schemas:
i. Star Schema: In this schema, a multi-dimensional model is used to organize
data in the database.
ii. Snowflake Schema: It is an extension of a star schema in this
multidimensional model is divided into subdivisions.
iii. Fact Constellation Schema: This schema collects multiple fact tables that
have common dimensions.
4. Data Mining Engine: One of the key elements of the data mining architecture is
the data mining engine, which executes various data mining operations including
association, classification, characterisation, clustering, prediction, etc.
5. Modules for Pattern Evaluation: These modules are in charge of spotting
interesting patterns in data, and occasionally they also work with database servers
to fulfil user requests.
6. Graphic User Interface: Because the user cannot completely comprehend the
intricacy of the data mining process, a graphical user interface enables efficient
user-data mining system communication.
39
Alternative names for data mining
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis,
data archaeology, data dredging, information harvesting, business intelligence, etc.
40
data. Utilizing migration tools like Microsoft SQL and Oracle Data Service Integrator, data
integration is carried out.
• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database. It consists of selecting the relevant data from a larger dataset based on specific
criteria, such as the relevance of the data to the problem at hand.
• Preprocessing: This step involves cleaning and transforming the selected data to ensure it
is in a suitable format for analysis. This may also include removing duplicates, dealing
with missing values, and transforming data into a standard format.
• Data Transformation − here, data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations. This step transforms the
pre-processed data into a form that is suitable for analysis. This may include aggregating
data, reducing data dimensionality, or creating new variables from existing ones. Also,
Removal of noise from data using methods like clustering, regression techniques, etc. this
process is called as Smoothing. In this step these strategies are used like scaling of data to
come within a smaller range aka as Normalisation and Intervals replace raw values of
numeric data which is Discretization.
• Data Mining − In this step, intelligent methods are applied to extract data patterns. In this
step, various data mining techniques are applied to the transformed data to discover
patterns and relationships. This may involve using techniques such as clustering,
classification, and regression analysis.
• Interpretation: This step consists of interpreting the results of the data mining process to
identify helpful knowledge and insights. This may include visualizing the data or creating
models to explain the patterns and relationships identified.
• Pattern Evaluation − In this step, data patterns are evaluated. The final step involves
evaluating the usefulness of the knowledge discovered in the previous steps. This may
include testing the knowledge against new data or measuring the impact of the insights on
a particular problem or decision-making process. Identifying interesting patterns that
represent the information based on some measurements is the process of pattern evaluation.
Methods for data summary and visualization help the user comprehend the data.
41
• Knowledge Presentation − In this step, knowledge is represented using Data visualization
and knowledge representation tools which represent the mined data in this step. Data is
visualized in the form of reports, tables, graph, plots etc.
42
C. Knowledge discovery data
D. Knowledge discovery database
3. What are the chief functions of the data mining process?
A. Prediction and characterization
B. Cluster analysis and evolution analysis
C. Association and correction analysis classification
D. All of the above
2. Healthcare: Healthcare providers use data mining to analyze patient data to discover
patterns and relationships that can help diagnose diseases and build treatment plans.
For example, a healthcare provider might use data mining to find that patients with
specific symptoms are more likely to develop a particular disease, allowing the
provider to create a screening program to catch the disease early.
43
Advantages of Data Mining
Data mining offers several advantages to help businesses and organizations make better
decisions and gain valuable insights. Here are some of the main advantages of data mining:
• Predictive analysis: Data mining allows businesses to predict future trends and
behaviors based on historical data. This enables organizations to make better
decisions about future strategies, products, and services.
• Improved marketing: Data mining helps businesses identify customer behavior
and preference patterns. This can help organizations create targeted marketing
campaigns and personalized offers that are more likely to resonate with customers.
• Improved customer experience: Data mining can help businesses understand
customer preferences and behaviors, enabling organizations to tailor products and
services to meet their needs. This can result in higher customer satisfaction and
loyalty.
• Competitive advantage: Data mining enables businesses to gain insights into their
competitors' strategies and performance. This can help organizations identify areas
where they can earn a competitive advantage and differentiate themselves in the
marketplace.
• Increased efficiency: Data mining can help businesses streamline processes and
operations by identifying inefficiencies and bottlenecks. This can help
organizations optimize workflows and reduce costs.
• Fraud detection: Data mining can help detect fraudulent activities and patterns in
financial transactions. This can help organizations prevent financial losses and
maintain the integrity of their operations.
• By correctly identifying future trends, aids in averting potential enemies.
• Helps in the process of making crucial decisions.
• Transforms compressed data into useful information.
• Presents fresh patterns and surprising tendencies.
• Big data analysis is made easier.
• Helps businesses locate, draw, and keep consumers.
44
• Aids in strengthening the company's interaction with its clients.
• Helps businesses save costs by assisting them in optimizing their production in
accordance with the appeal of a certain product.
45
• Extreme workloads call for high-performance teams and staff training.
• Since the data may include sensitive client information, a lack of security might
potentially greatly increase the risk to the data.
• The incorrect outcome might result from inaccurate data.
• Large databases may be quite challenging to handle.
Decision Trees: Decision Trees are the Supervised Machine learning algorithm (this topic will be
explained in detail in the next unit) that can be used for Classification (predicting categorical
values) and Regression (predicting continuous values) problems. Decision trees are special in
machine learning due to their simplicity, interpretability, and versatility. Moreover, they serve as
the foundation for more advanced techniques, such as bagging, boosting, and random forests.
A classification problem identifies the set of categories or groups to which an observation belongs.
A Decision Tree uses a tree or graph-like model of decision. Each internal node represents a "test"
on attributes, and each branch represents the outcome of the test. Each leaf node represents a class
label (decision taken after computing all features).
Fig. Decision Tree - B and C are children (leaf nodes) of A (sub root), A is Parent
46
A decision tree starts with a root node that signifies the whole population or sample, which then
separates into two or more uniform groups via a method called splitting. When sub-nodes undergo
further division, they are identified as decision nodes, while the ones that don't divide are called
terminal nodes or leaves. A segment of a complete tree is referred to as a branch.
It is showed that decision trees could be used for both regression and classification tasks.
The following decision tree example shows that whether a customer at a company is likely to buy
a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a
class.
47
To visualize a decision tree using the R language, you can use the “rpart.plot” package, which
provides an easy way to plot decision trees created with the “rpart” package. Here's an example
of how to do it:
Install and load the required packages:
install.packages("rpart.plot")
library(rpart)
library(rpart.plot)
# Example
data(iris)
# Create decision tree
tree <- rpart(Species ~ ., data = iris)
48
Note that the example above uses the built-in iris dataset for demonstration purposes. Make
sure to replace Species with the target variable in your own dataset and adjust the formula
and data accordingly.
The rpart.plot function offers various customization options, such as controlling the colors,
node shapes, and labels.
Decision Trees are very useful algorithms as they are not only used to choose alternatives based
on expected values but are also used for the classification of priorities and making predictions. It
is up to us to determine the accuracy of using such models in the appropriate applications.
49
2. Does not require Data normalization.
3. Doesn’t facilitate the need for scaling of data.
4. The pre-processing stage requires less effort compared to other major algorithms, hence in
a way optimizes the given problem.
Disadvantages of Decision Trees
1. Requires higher time to train the model.
2. It has considerable high complexity and takes more time to process the data.
3. When the decrease in user input parameter is very small it leads to the termination of the
tree
4. Calculations can get very complex at times.
Regression
• Regression is defined as a statistical method that helps us to, understand, summarize and
analyze the relationship between two or more variables of interest. The process that is
designed to perform regression analysis helps to understand which factors are important,
which factors can be ignored, and how they are affecting each other. Francis Galton
introduced the term Regression.
• Regression refers to a type of supervised machine learning technique that is used to predict
any continuous-valued attribute. Regression helps any business organization to analyze
the target variable and predictor variable relationships. It is a most significant tool to
analyze the data that can be used for financial forecasting and time series modeling.
• For example, let's say we want to predict the price of a house based on its size, number
of bedrooms, and location. In this case, price is the dependent variable, while size,
number of bedrooms, and location are the independent variables. By analyzing the
historical data of houses with similar characteristics, we can build a regression model
that predicts the price of a new house based on its size, number of bedrooms, and
location.
• For an example. Let's say the price of a car depends on its horsepower, number of seats
and its top speed. In this example the car becomes the dependent variable whereas the
horsepower, number of seats and the top speed are all independent variables. If we have
a data record containing previous records of the price of cars with their features, we
50
can build a regression model to predict the price of a car depending on its horsepower,
number of seats and the top speed.
• There are several types of regression models, including linear regression, logistic
regression, and polynomial regression. Linear regression in data mining is the most
commonly used type, which assumes a linear relationship between the independent and
dependent variables. However, nonlinear relationships may exist between the variables
in some cases, which can be captured using nonlinear regression models.
• In regression, we generally have one dependent variable and one or more independent
variables. Here we try to “regress” the value of the dependent variable “Y” with the
help of the independent variables. That means, we are trying to understand, how the
value of ‘Y’ changes with respect to change in ‘X’.
Regression Analysis
• Regression analysis is used for prediction and forecasting. This has a significant
overlap with the field of machine learning. Regression modelling provides the
prediction mechanism by analysing the relationship between two variables. The main
use of regression analysis is to determine the strength of predictors, forecast an effect,
a trend, etc. The independent variable is used to explain the dependent variable in
Linear Regression Analysis. Regression modelling is a statistical tool for building a
mathematical equation depicting how there is a link between one response variable and
one or many explanatory variables.
• Let’s see an example of regression analysis.
• Imagine you're a sales manager attempting to forecast the sales for the upcoming
month. You are aware that the results can be influenced by dozens, or even hundreds
of variables, such as the weather, a competitor's promotion, or rumours of a new and
improved model. There may even be a notion within your company about what will
affect sales the most. "Believe me. We sell more when there is more rain. "Sales
increase six weeks after the competitor's promotion."
• It is possible to determine statistically which of those factors actually has an effect by
using regression analysis. It responds to the question: What elements are most crucial?
51
Which may we ignore? What connections do the elements have with one another? The
most crucial question is how confident we are in each of these characteristics.
• Those variables are referred to as "factors" in regression analysis. The fundamental
element you are attempting to comprehend or anticipate is your dependent variable.
Monthly sales are the dependent variable in Redman's example from above. Then there
are the independent variables, which are the elements you believe have an effect on
your dependent variable.
Linear Regression
Linear Regression is a predictive model used for finding the linear relationship between a
dependent variable and one or more independent variables.
Here, ‘Y’ is our dependent variable, which is a continuous numerical and we are trying to
understand how ‘Y’ changes with ‘X’.
52
Examples of Independent & Dependent Variables:
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP
If the relationship with the dependent variable is in the form of single variables, then it is
known as Simple Linear Regression. In regression, the equation that describes how the
response variable (y) is related to the explanatory variable (x) is known as Regression
Model.
In simple linear regression, the data are modeled to fit a straight line. For example, a
random variable, Y (called a response variable), can be modeled as a linear function of
another random variable, X (called a predictor variable), with the equation.
Y= B0 + B1 X.
where the variance of Y is assumed to be constant.
In the context of data mining, X and Y are numeric database attributes.
B0 and B1 are called regression coefficients.
The graph presents the linear relationship between the output(Y) variable and predictor(X)
variables. The blue line is referred to as the best fit straight line. Based on the given data
points, we attempt to plot a line that fits the points the best. A regression line is also known
as the line of average relationship. It is also known as the estimating equation or prediction
equation. The slope of the regression line of Y on X is also referred to as the Regression
coefficient of Y on X.
To calculate best-fit line linear regression uses a traditional slope-intercept form which is :
Y= B0 + B1 X.
53
Linear Regression is the process of finding a line that best fits the data points available
on the plot, so that we can use it to predict output values for given inputs.
“Best fitting line”
A Line of best fit is a straight line that represents the best approximation of a scatter plot
of data points. It is used to study the nature of the relationship between those points.
The equation to find the best fitting line is:
Y` = bX + A
where, Y` denotes the predicted value
b denotes the slope of the line
X denotes the independent variable
A is the Y intercept
54
A given collection of data points might show up on a chart as a scatter plot, which may or
may not look like it is organized along any lines. One of the most significant results of
regression analysis is the identification of the line of best fit, which minimizes the deviation
of the data points from that line. Many straight lines may be drawn through the data points
in the graph.
E = Y – Y`
where, E denotes the prediction error or residual error.
Y` denotes the predicted value.
Y denotes the actual value
55
A line that fits the data "best" will be one for which the prediction errors (one for each data
point) are as small as possible.
The above diagram depicts a simple representation with all the above discussed values.
Regression analysis uses “least squares method” to generate the best fitting line. This
method builds the line which minimizes the squared distance of each point from the line of
best fit.
So, the Line of Best Fit is used to express a relationship in a scatter plot of different data
points. It is an output of regression analysis and can be used as a prediction tool.
Prediction Techniques
Linear Regression Techniques
• Linear regression is used to model the relationship between a dependent variable
and one or more independent variables. Linear regression can be used for both
supervised and unsupervised learning tasks.
56
• In supervised learning, linear regression is used for regression tasks where the
goal is to predict a continuous numerical value. For example, predicting the price
of a house based on its size, location, number of bedrooms, etc. In this case, linear
regression can be used to find the best-fitting line or hyperplane that predicts the
house price based on the given features.
• In unsupervised learning, linear regression is used for dimensionality reduction
tasks. For example, in principal component analysis (PCA), linear regression is
used to find the principal components that capture the most variance in the data.
Linear regression can also be used for clustering tasks, where the goal is to group
similar data points together based on their features.
• There are several techniques used to estimate the parameters of a linear regression
model, including ordinary least squares (OLS), maximum likelihood estimation
(MLE), and gradient descent. These techniques are used to find the best-fitting line
or hyperplane that minimizes the sum of squared errors between the predicted
values and the actual values of the dependent variable.
57
Healthcare - The field of medical research relies heavily on regression. It is used for
research reasons to evaluate medications, forecast the future prevalence of a specific
condition, and other things.
Medicine- Forecast the different combinations of medicines to prepare generic medicines
for diseases.
Applications of Regression
Regression is a highly common approach with several uses in commerce and industry.
Predictor and responder variables are used in the regression process. The main regression
application is provided here.
58
• Data Cleaning, Data Integration, Data Selection, Data Transformation, Data
Mining, Pattern Evaluation, and Knowledge presentation are the steps of KDD
Process.
• Data preparation means transforming raw data into a format that is suitable for
analysis and modeling.
• Data modeling is the process of creating a visual representation of either a whole
information system or parts of it to communicate connections between data points
and structures.
• Linear regression analysis is used to predict the value of a variable based on the
value of another variable.
• Y= B0 + B1 X using this formula, best fit line is drawn or plotted to predict the
dependent variable Y. where X is independent Variable, B0 and B1 are constants
• Regression analysis can be used to evaluate trends and sales estimates, analyze
pricing elasticity, access risks in insurance company, sports analysis, and so on.
References
1. Data Warehousing, OLAP and Data Mining by S Nagabhushana
2. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
3. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed J. Zaki
and Wagner Meira Jr.
4. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
5. Regression and Other Stories by Andrew Gelman, Jennifer Hill, Aki Vehtari
6. An Introduction To Regression Analysis by Anusha Illukkumbura
7. Regression Analysis by Example by Samprit Chatterjee Ali S. Hadi
8. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
59
Q1. Data Mining is an integral part of ______
A. Data Warehousing
B. KDD
C. RDBMS
D. DBMS
Q4. To visualize a decision tree using the R language, you can use the “_______” package.
A. rpart
B. rpart.plot
C. rplot
D. ggplot
Q5. __________means transforming raw data into a format that is suitable for analysis and
modeling.
A. Data Cleaning
B. Data modelling
C. Data preparation
D. data visualization
60
Q.6 Which of the following are not correct about Data mining?
A. Extracting useful information from large datasets
B. The process of discovering meaningful correlations, patterns, and trends through large
amounts of data stored in repositories.
C. The practice of examining large pre-existing databases
D. Data mining has applications only in science and research.
Q.10 11. Out of the following, which one is the proper application of data mining?
A. Fraud Detection
61
B. Market Management and Analysis
C. Health Care and Medicine
D. All of the above
**********************************************************
Extra Reading:
1. Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman,
Jennifer Hill
2. Applied Regression Analysis (Wiley Series in Probability and Statistics) Third Edition by
Norman R. Draper, Harry Smith
3. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression,
and Survival Analysis (Springer Series in Statistics): Frank E. Harrell
4. Regression Analysis by Example (Wiley Series in Probability and Statistics) eBook:
Samprit Chatterjee, Ali S. Hadi
5. Data Mining for the Masses by Dr. Matthew North
6. An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela
Witten, Trevor Hastie, Robert Tibshirani
Subjective Questions
1. Define data mining, explain data mining process.
2. Explain data mining architecture with suitable diagram in detail.
3. What is KDD? Explain KDD Process in detail with proper diagram.
4. Write Data mining applications.
5. What is decision tree and how decision tree can be visualized using R language?
6. What is regression? What are its types? Explain Linear regression in detail.
7. Explain how linear regression is used to predict in data analysis.
62
Unit III: Data Mining Concepts & Techniques II
Objectives:
10. To understand the fundamental concepts of Association rules in data mining
11. To understand the concept of market basket analysis
12. To enable students to understand and implement association rule algorithms
in data mining using R and Tableau
13. To introduce the concept of machine learning
14. To understand difference between machine learning and data mining
Outcomes:
10. Learn and understand techniques of preprocessing various kinds of data.
11. Understand concept of Association Rule.
12. Apply association Mining Techniques on large Data Sets using R and
Tableau
13. Understand the concept of machine learning.
63
14. Comprehend the difference between machine learning and data mining.
In the previous unit we have seen the concepts of data mining and algorithms associated with it.
We will continue with data mining and machine learning concepts with one more algorithm i. e.
association rule mining.
What is Association?
• Association rule mining is a methodology that is used to discover unknown relationships
hidden in large databases.
• Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in big data.
• An example of an unsupervised learning approach is association rule learning, which
looks for relationships between data items and maps them appropriately for increased
profitability. It looks for any interesting relationships or correlations between the
dataset's variables.
• For example, the rule {onions, potatoes} => {burger} found in the sales data of a
supermarket would indicate that if a customer buys onions and potatoes together, they
are likely to also buy hamburger meat.
• Association rule is also called as
• Affinity Analysis
• Market Basket Analysis
64
o Due to its origin from the studies of customer purchase transactions
databases
o Formulate probabilistic association rules for the same.
o “What goes with what”
65
Although the ideas underlying association rules have been around for a while, association rule
mining was created by computer scientists Rakesh Agrawal, Tomasz Imieliński, and Arun
Swami in the 1990s as a method for exploiting point-of-sale (POS) systems to uncover
associations between different commodities. By applying the algorithms to supermarkets, the
researchers were able to identify associations—also known as association rules—between
variously purchased items. They then used this knowledge to forecast the possibility that certain
products would be bought in combination.
Association rule mining provided businesses with a tool to comprehend the buying patterns of
consumers. Association rule mining is frequently referred to as market basket analysis because to
its retail origins.
Q.2 In one of the frequent item-set examples, it is observed that if tea and milk are bought
then sugar is also purchased by the customers. After generating an association rule among
the given set of items, it is inferred:
A. {Tea} is antecedent and {sugar} is consequent.
B. {Tea} is antecedent and the item set {milk, sugar} is consequent.
C. The item set {Tea, milk} is consequent and {sugar} is antecedent.
D. The item set {Tea, milk} is antecedent and {sugar} is consequent.
Q.3 In one of the frequent item-set examples, it is observed that if milk and bread are
bought, then eggs are also purchased by the customers. After generating an
association rule among the given set of items, it is inferred:
A. {Milk} is antecedent and {eggs} is consequent.
66
B. {Milk} is antecedent and the item set {bread, eggs} is consequent.
C. The item set {milk, bread} is consequent and {eggs} is antecedent.
D. The item set {milk, bread} is antecedent and {eggs} is consequent.
It involves analyzing large data sets, such as purchase history, to reveal product groupings,
as well as products that are likely to be purchased together.
67
The primary goal of market basket analysis is to discover patterns or associations between
items that co-occur in transactions or customer purchases. These associations are often
expressed in the form of rules, commonly known as "association rules."
Apriori Algorithm
The best algorithm for market basket analysis is Apriori Algorithm: The Apriori
Algorithm is a popular market basket analysis algorithm that uses the Association Rule
algorithm.
The Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule. The name of the algorithm is Apriori
because it uses prior knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.
A frequent item set can be defined as
– An item set having a higher support than user specified minimum Support
Apriori algorithm is quite fast even for a large no. of unique items where each step requires
a single run through the dataset.
To improve the efficiency of level-wise generation of frequent itemsets, an important
property is used called Apriori property which helps by reducing the search space.
Confidence metrics are commonly used to measure the strength of association between
items in market basket analysis. It is an idea to generate candidate itemsets of a given size
and then scan dataset to check if their counts are really large. This process is an iterative
process.
Steps:
Here are the steps involved in implementing the Apriori algorithm in data mining -
68
1. Define minimum support threshold - This is the minimum number of times an item
set must appear in the dataset to be considered as frequent. The support threshold is
usually set by the user based on the size of the dataset and the domain knowledge.
2. Generate a list of frequent 1-item sets - Scan the entire dataset to identify the items
that meet the minimum support threshold. These item sets are known as frequent 1-
item sets.
3. Generate candidate item sets - In this step, the algorithm generates a list of candidate
item sets of length k+1 from the frequent k-item sets identified in the previous step.
4. Count the support of each candidate item set - Scan the dataset again to count the
number of times each candidate item set appears in the dataset.
5. Prune the candidate item sets - Remove the item sets that do not meet the minimum
support threshold.
6. Repeat steps 3-5 until no more frequent item sets can be generated.
7. Generate association rules - Once the frequent item sets have been identified, the
algorithm generates association rules from them. Association rules are rules of form
A -> B, where A and B are item sets. The rule indicates that if a transaction contains
A, it is also likely to contain B.
8. Evaluate the association rules - Finally, the association rules are evaluated based
on metrics such as confidence and lift.
The Apriori algorithm is also thought to be more precise than the AIS and SETM
algorithms. It aids in the discovery of common transaction item sets and the identification
of association rules between them. The objective of association rule mining is to discover
interesting relationships or associations among items. The significance of support in
association rule mining measures the importance of an itemset in a dataset.
Problem Statement: When we go grocery shopping, we often have a standard list of things
to buy. Each customer/consumer has a distinctive list, depending on one’s needs and
69
preferences. A housewife might buy healthy ingredients for a family dinner, while a
bachelor might buy beer and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and Y, that are frequently bought
together:
• Both X and Y can be placed on the same shelf, so that buyers of one item would be
prompted to buy the other.
• Promotional discounts could be applied to just one out of the two items.
• Advertisements on X could be targeted at buyers who purchase Y.
• X and Y could be combined into a new product, such as having Y in flavors of X.
We may be aware that particular things are commonly purchased together, but how can we
find these connections?
Definition: Association rules analysis is a technique to uncover how items are associated
to each other. There are 3 common ways to measure association.
Measure 1: Support. This says how popular an itemset is, as measured by the proportion
of transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4
out of 8, or 50%. Itemsets can also contain multiple items. For instance, the support of
{apple, beer, rice} is 2 out of 8, or 25%.
70
Fig. Transaction Examples
If you discover that sales of items beyond a certain proportion tend to have a significant
impact on your profits, you might consider using that proportion as your support threshold.
You may then identify itemsets with support values above this threshold as significant
itemsets.
Measure 2: Confidence. This says how likely item Y is purchased when item X is
purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with
item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer} is 3
out of 4, or 75%.
One drawback of the confidence measure is that it might misrepresent the importance of
an association. This is because it only accounts for how popular apples are, but not beers.
If beers are also very popular in general, there will be a higher chance that a transaction
containing apples will also contain beers, thus inflating the confidence measure. To account
for the base popularity of both constituent items, we use a third measure called lift.
71
Measure 3: Lift. This says how likely item Y is purchased when item X is purchased,
while controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,
which implies no association between items. A lift value greater than 1 means that item Y
is likely to be bought if item X is bought, while a value less than 1 means that item Y is
unlikely to be bought if item X is bought.
We use a dataset on grocery transactions from the arules R library. It contains actual
transactions at a grocery outlet over 30 days. The network graph below shows associations
between selected items. Larger circles imply higher support, while red circles imply higher
lift:
Observe the following diagram and see Associations between selected items. Visualized
using the arulesViz R library.
72
Advantages
1. Easy to understand algorithm.
2. Join and Prune steps are easy to implement on large itemsets in large databases.
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum support
is kept very low.
2. The entire database needs to be scanned.
73
Q.1 In Apriori algorithm, for generating e.g., 5 item sets, we use:
A. Frequent 4 item sets
B. Frequent 6 item sets
C. Frequent 5 item sets
D. None of the above
• An online store examining customer purchase data to see which goods are often
purchased together.
• The study may indicate that customers who buy laptops also buy mouse pads, extra
hard drives, and extended warranties.
• With this information, the online merchant might build targeted product bundles or
upsell opportunities, such as giving a package deal for a laptop, mouse pad, external
hard drive, and extended warranty.
• A healthcare organization uses market basket analysis to determine that patients
who are diagnosed with diabetes frequently also have high blood pressure and high
cholesterol.
• Depending on this information, the organizations create a care plan that addresses
all three conditions, which leads to improved patient outcomes and reduced
healthcare costs.
• A grocery store evaluating customer purchase data to discover which goods are
usually purchased together is a real-world example of market basket analysis.
74
Customers who buy bread may also buy peanut butter, jelly, and bananas. With this
knowledge, the retailer may make modifications to improve sales of these, such as
positioning them near each other on the shelf or providing discounts when
consumers purchase all four items together.
To perform market basket analysis, the following steps are basically followed:
2. Data Transformation: Convert the transactional data into a suitable format, such
as a binary matrix or a transaction-item matrix, where each row represents a
transaction, and each column represents an item. The matrix is populated with
values indicating the presence or absence of an item in a transaction.
4. Rule Generation: Use the frequent itemsets to generate association rules. These
rules are derived based on metrics such as support, confidence, and lift. Support
measures the frequency of occurrence of an itemset, confidence quantifies the
reliability of the rule, and lift indicates the strength of the association between the
antecedent and consequent.
75
5. Rule Evaluation and Selection: Evaluate the generated rules based on predefined
thresholds or criteria, such as minimum support and minimum confidence. Select
the rules that meet the criteria and are considered interesting or actionable.
Market basket analysis provides valuable insights into customer purchasing behavior,
allowing businesses to understand which items are frequently bought together. By
leveraging these associations, companies can make informed decisions to enhance
customer experiences, optimize inventory management, and increase sales.
76
Fig. Refers to the Association rule.
With the above diagram, you can see the association between the products like {brown
bread, whole milk}, {whole milk, newspapers}, etc. how these products are related to each
other. These relations further extended to other products as well.
In R, we can visualize associations obtained from market basket analysis using various
packages. We will use a popular package for association rule mining and visualization i.e.,
“arules”.
77
Install and load the arules package:
install.packages("arules")
library(arules)
Generate association rules from transaction data. Assuming we have our transaction data
in a format suitable for arules (e.g., a binary matrix or a transaction object), we can use the
apriori() function to mine association rules:
In the above example, the apriori() function is used to generate association rules with a
minimum support of 0.1 and a minimum confidence of 0.8. Adjust these parameters
according to our data and requirements.
Visualize the associations using a scatter plot. The plot() function in arules allows us to
visualize the associations in a scatter plot:
Customize the plot. The scatter plot can be customized to display additional information.
Here are a few customization options:
78
# Highlight specific rules
plot(rules, method = "scatterplot", control = list(highlight = c("antecedent item",
"consequent item")))
By visualizing the associations, we can gain a better understanding of the relationships
between items and identify patterns in our data. This can help us make informed business
decisions and develop effective strategies based on the discovered associations.
1. Prepare data: data should be in proper format, suitable for association analysis,
where each row represents a transaction, and each column represents an item. There
should be a binary indicator or count of each item's presence in a transaction.
2. Import data into Tableau: Open Tableau and connect to data source by selecting
the appropriate file type or database connection.
3. Create a new worksheet: Click on the "Sheet" tab at the bottom of the Tableau
interface to create a new worksheet.
4. Drag and drop data fields: From the Dimensions or Measures pane, drag the fields
which are to be analyzed onto the appropriate shelves. Typically, we would place
the transaction ID or unique identifier in the "Rows" shelf and the item fields in the
"Columns" or "Marks" shelf.
5. Adjust the visualization type: In the "Show Me" pane, choose a suitable
visualization type for association analysis. Some common choices include scatter
plots, heat maps, or treemaps. Select the visualization type that best represents the
associations we want to explore.
79
6. Add filters and highlight specific associations: Use Tableau's filtering
capabilities to focus on specific subsets of data or adjust the criteria for association
rules. We can also highlight specific associations by applying conditional
formatting or color-coding.
9. Save and share the visualization: Save Tableau workbook and share it with others
in organization or export it in various formats, such as PDF or image files, for
further presentation or reporting purposes.
10. Tableau offers a wide range of visualization options and customization features,
allowing us to create interactive and insightful representations of association
analysis results.
Automatic cluster detection is also known as cluster analysis or clustering, is a data analysis
technique that aims to identify natural groupings or clusters within a dataset without prior
knowledge of the cluster assignments. It is commonly used in various fields, including
machine learning, data mining, pattern recognition, and exploratory data analysis.
The goal of automatic cluster detection is to partition the data into clusters, where objects
within the same cluster are more like each other compared to objects in other clusters.
80
Cluster analysis can help uncover hidden patterns, relationships, and structures in data,
providing valuable insights and facilitating decision-making processes.
There are several algorithms and methods available for automatic cluster detection, and the
choice depends on the characteristics of the data and the specific requirements of the
analysis. Here are a few commonly used techniques:
• K-means Clustering
• Hierarchical Clustering
• Gaussian Mixture Models (GMM)
• Self-Organizing Maps (SOM)
• Mean Shift
81
▪ Machine learning behaves similarly to the growth of a child. As a child grows, her
experience E in performing task T increases, which results in higher performance measure
(P).
▪ Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.
▪ Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being explicitly
programmed to do so.
▪ Machine learning is a field of study and application in artificial intelligence (AI) that
focuses on the development of algorithms and models that enable computers to learn and
make predictions or decisions without being explicitly programmed. It involves using
statistical techniques to enable computers to learn from data, identify patterns, and make
informed decisions or predictions.
▪ The main goal of machine learning is to develop algorithms and models that can learn
from data and improve their performance on specific tasks without being explicitly
programmed.
▪ Machine learning has a wide range of applications across various domains, including
natural language processing, computer vision, recommendation systems, fraud detection,
healthcare, finance, and many others. Its capabilities have significantly advanced fields
such as autonomous driving, speech recognition, image classification, and language
translation.
▪ It's important to note that machine learning is not a magical solution to all problems. It
requires careful data preprocessing, feature engineering, model selection, and evaluation
to build accurate and reliable models. Additionally, ethical considerations and responsible
use of machine learning systems are crucial to mitigate potential biases and ensure fairness
and accountability in decision-making processes.
▪ Companies use machine learning for purposes like self-driving cars, credit card fraud
detection, online customer service, e-mail spam interception, business intelligence (e.g.,
managing transactions, gathering sales results, business initiative selection), and
personalized marketing.
82
▪ Companies that rely on machine learning include heavy hitters such as Yelp, Twitter,
Facebook, Pinterest, Salesforce, and the famous, popular, and ultimate search engine:
Google.
1. Regression (Prediction)
We use regression algorithms for predicting continuous values.
Regression algorithms:
• Linear Regression
• Polynomial Regression
• Exponential Regression
• Logistic Regression
• Logarithmic Regression
2. Classification
We use classification algorithms for predicting a set of items’ classes or categories.
Classification algorithms:
• K-Nearest Neighbors
• Decision Trees
83
• Random Forest
• Support Vector Machine
• Naive Bayes
3. Clustering
We use clustering algorithms for summarization or to structure data.
Clustering algorithms:
• K-means
• DBSCAN
• Mean Shift
• Hierarchical
4. Association
We use association algorithms for associating co-occurring items or events.
Association algorithms:
• Apriori
5. Anomaly Detection
We use anomaly detection for discovering abnormal activities and unusual cases like
fraud detection.
7. Dimensionality Reduction
We use dimensionality reduction to reduce the size of data to extract only useful features
from a dataset.
8. Recommendation Systems
84
We use recommenders algorithms to build recommendation engines.
Examples:
85
variables to a continuous function in order to predict outcomes within a continuous
output. Instead, in a classification problem, we are attempting to forecast outcomes
in a discrete output. We are attempting to map input variables into discrete
categories, to put it another way.
We have been given data about the size of houses on the real estate market, we try
to predict their price. Price as a function of size is a continuous output, so this is a
regression problem.
We could turn this example into a classification problem by instead making our
output about whether the house “sells for more or less than the asking price.”
Here we are classifying the houses based on price into two discrete categories.
Second example:
(a) Regression — Given a picture of a person, we have to predict their age on the
basis of the given picture
(b) Classification — Given a patient with a tumor, we have to predict whether the
tumor is malignant or benign.
86
abnormal or unusual patterns in the data. Unsupervised learning does not require
labelled training data or perform classification tasks.
▪ In Unsupervised learning, there is no prior information about the data. And since
the data is not labelled, the machine should learn to categorize the data on the
similarity and patterns it finds in the data. There is no feedback based on the
outcomes of the predictions when learning is done unsupervised. This helps us to
obtain interesting relations between the features present in the data.
▪ Unsupervised learning is used for tasks like customer segmentation, where patterns
or clusters are identified in the data, and anomaly detection, where abnormal data
points are identified.
▪ Unsupervised learning is a type of machine learning where the algorithm learns
from unlabeled data without any specific target variable or labels. The goal is to
discover patterns, structures, or relationships within the data.
▪ Clustering is a technique in unsupervised learning that groups similar data points
together, while dimensionality reduction aims to reduce the number of features
while retaining important information.
87
▪ Reinforcement learning is an area of machine learning inspired by behaviorist
psychology, concerned with how software agents ought to take actions in an
environment so as to maximize some notion of cumulative reward.
▪ The reinforcement learning algorithm (called the agent) continuously learns from
the environment in an iterative fashion. The agent learns from its environment’s
experiences until it has explored the whole spectrum of conceivable states.
▪ Reinforcement Learning is a discipline of Artificial Intelligence that is a form of
Machine Learning. It enables machines and software agents to automatically select
the best behavior in a given situation in order to improve their efficiency. For the
agent to learn its behavior, it needs only simple reward feedback, which is known
as the reinforcement signal.
Example 1:
▪ Consider teaching a dog a new trick. You cannot tell it what to do, but you can
reward/punish it if it does the right/wrong thing. It has to figure out what it did that
made it get the reward/punishment. We can use a similar method to train computers
to do many tasks, such as playing backgammon or chess, scheduling jobs, and
controlling robot limbs.
Example 2:
▪ Teaching a game bot to perform better and better at a game by learning and adapting
to the new situation of the game.
88
• Handling a variety of data
• Wide applications
Disadvantages:
• Requires a large amount of data.
• Utilization of time and resources
• Result interpretation
• Susceptibility to errors in critical domains
▪ Machine learning and data mining are closely related fields that overlap in several
areas. Both fields deal with extracting knowledge and insights from data, but they
89
have different approaches and objectives. Here's an overview of machine learning
and data mining:
Data Mining:
▪ Data mining, on the other hand, is a broader field that encompasses various techniques
for discovering patterns, relationships, and insights from large datasets. It involves the
extraction of useful and previously unknown information from data, typically with the
aim of solving specific business or research problems.
▪ Data mining techniques often involve the application of statistical and machine learning
algorithms to large datasets. The focus is on finding patterns, trends, anomalies, or
meaningful associations in the data that can be used for decision-making or gaining
valuable insights. Data mining can involve exploratory data analysis, data
preprocessing, feature selection, and applying algorithms such as clustering, association
rules, classification, regression, and more.
▪ Data mining is designed to extract the rules from large quantities of data, while machine
learning teaches a computer how to learn and comprehend the given parameters.
▪ Data mining is simply a method of researching to determine a particular outcome based
on the total of the gathered data. On the other side of the coin, we have machine learning,
which trains a system to perform complex tasks and uses harvested data and experience
to become smarter.
▪ Data Mining is a process of separating the data to identify a particular pattern, trends,
and helpful information to make a fruitful decision from a large collection of data.
▪ Data mining is designed to extract the rules from large quantities of data, while machine
learning teaches a computer how to learn and comprehend the given parameters.
▪ Overall, data mining is a broader field that includes various techniques for extracting
knowledge from data, while machine learning is a subset of data mining that focuses on
the development of algorithms and models for learning from data and making
predictions or decisions. Machine learning techniques are often employed within the
data mining process to automate the discovery of patterns and relationships in large
datasets.
90
Relationship between Machine Learning and Data Mining:
▪ Machine learning techniques are often utilized within the broader process of data mining.
Data mining can involve using machine learning algorithms as one of the tools to extract
knowledge from data. Machine learning algorithms, such as decision trees, support vector
machines, neural networks, and ensemble methods, are commonly applied in data mining
tasks to discover patterns or make predictions. Data Preprocessing is common to both
machine learning and data mining.
▪ Machine Learning: Machine learning often involves an iterative process of training models,
evaluating their performance, and fine-tuning them based on feedback. It focuses on
building models that generalize well to unseen data.
▪ Data Mining: Data mining also follows an iterative process, involving data preprocessing,
pattern discovery, evaluation, and interpretation. It aims to extract meaningful insights
from data and communicate them effectively.
▪ In practice, machine learning techniques are often utilized within data mining workflows
to build predictive models or automate certain aspects of the data mining process. Data
mining can help identify relevant features or variables for machine learning models, and
machine learning can enhance the accuracy and predictive power of data mining
algorithms.
▪ From all the above points we can conclude that machine learning and data mining are
complementary fields that share common goals of extracting knowledge from data but
employ different techniques and approaches to achieve those goals. They contribute to each
other's progress and are essential components of the broader field of data science.
SUMMARY
• Market basket analysis is a data mining technique used by retailers to increase sales by
better understanding customer purchasing patterns.
• Identifying items that buyers desire to buy is the major goal of market basket analysis.
It may help sales and marketing teams develop more effective product placement,
pricing, cross-sell, and up-sell tactics.
91
• The Apriori Algorithm is a popular market basket analysis algorithm that uses the
Association Rule algorithm.
• Association is intended to identify strong rules discovered in databases using some
measures of interest.
• Association rules are visualized using two different types of vertices to represent
the set of items and the set of rules R, respectively. The edges indicate the
relationship in rules.
• An R-extension package arules is used to find the rules among data and arulesViz
package from R is used to visualize association rules.
• Automatic cluster detection provides a valuable tool for exploratory data analysis,
pattern recognition, and data-driven decision-making, allowing for the
identification of inherent structures and groupings within datasets without prior
knowledge or assumptions.
• By leveraging Tableau's capabilities, we can visually explore and communicate the
associations in data effectively.
• Data Mining is a process of separating the data to identify a particular pattern,
trends, and helpful information to make a fruitful decision from a large collection
of data.
• Machine learning is the autonomous acquisition of knowledge using computer
programs.
• Data mining is designed to extract the rules from large quantities of data, while
machine learning teaches a computer how to learn and comprehend the given
parameters.
• Supervised learning, Unsupervised learning, Reinforcement learning are some of
the machine learning algorithms.
• There is challenge commonly associated with data mining and machine learning is
Lack of skilled professionals.
References
1. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
92
2. Data Warehousing, OLAP and Data Mining by S Nagabhushana
3. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
4. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed J. Zaki
and Wagner Meira Jr.
5. arulesViz: Interactive Visualization of Association Rules with R by Michael Hahsler
6. Visualizing Association Rules: Introduction to the R-extension Package arulesViz by
Michael Hahsler and Sudheer Chelluboina
7. A Course in Machine Learning by Hal Daumé III
8. Introduction to Machine Learning Alex Smola and S.V.N. Vishwanathan
9. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz
and Shai Ben-David
Q4. Which of the following statement(s) is/are true about unsupervised learning in machine
93
learning?
A. Unsupervised learning algorithms require labelled training data.
B. Unsupervised learning algorithms discover patterns and structures in unlabeled data.
C. Clustering and dimensionality reduction are examples of unsupervised learning
techniques.
D. Unsupervised learning is used for classification tasks.
E. Anomaly detection is a common application of unsupervised learning.
Q5. Which of the following statement(s) is/are true about supervised learning in machine
learning?
A. Supervised learning requires labelled training data.
B. The goal of supervised learning is to discover hidden patterns in unlabelled data.
C. Classification and regression are examples of supervised learning tasks.
D. Supervised learning algorithms can make predictions on new, unseen data.
E. K-means and Hierarchical Clustering are supervised learning algorithms.
Review Questions:
1. What is association rule and explain it in details.
2. What is Market Basket Analysis and write all the steps of Market Basket Analysis.
3. What is Machine Learning? Explain its advantages and disadvantages.
4. Explain all the types of machine learning.
5. Differentiate between data mining and machine learning.
94
Further Readings:
1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data by EMC Education Services (2015)
2. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in
Microsoft Office Excel with XLMiner by Shmueli, G., Patel, N. R., & Bruce, P. C.
(2010)
95
Unit IV: Business Forecasting
Topics:
• Business Forecasting
• Qualitative Techniques
• Quantitative Techniques
• Time Series Forecasting
Objectives:
15. To understand the concept of business forecasting
16. To understand the concept of qualitative and quantitative forecasting
techniques
17. To understand the applications of forecasting techniques
18. To introduce the concept of time forecasting
19. To understand use of forecasting techniques
Outcomes:
15. Understand the concept of business forecasting.
16. Understand the concept of qualitative and quantitative forecasting
techniques.
17. Understand the applications of forecasting techniques.
18. Introduce the concept of time forecasting.
19. Understand use of forecasting techniques.
96
Introduction to the Unit
This unit introduces the concept of business forecasting. The terms quantitative and qualitative
data. The quantitative and qualitative forecasting techniques are explained with applications and
their uses. At the end time series forecasting techniques is briefed. This unit defines quantitative
and qualitative predictions, explores the value of quantitative and qualitative forecasts, and offers
some examples of quantitative and qualitative forecasting techniques.
Introduction
Forecasting
Forecasting involves making predictions about the future. It is a technique that uses historical data
as inputs to make informed estimates that are predictive in determining the direction of future
trends. Businesses utilize forecasting to determine how to allocate their budgets or plan for
anticipated expenses for an upcoming period. This is typically based on the projected demand for
the goods and services offered.
e. g., In finance, forecasting is used by companies to estimate earnings or other data for subsequent
periods. Traders and analysts use forecasts in valuation models, to time trades, and to identify
trends. Forecasts are often predicated on historical data. Because the future is uncertain, forecasts
must often be revised, and actual results can vary greatly.
97
make acquisitions, expand, or divest. They also make forward-looking projections for public
dissemination such as earnings guidance.
Key factors of forecasting:
• Financial and operational decisions are made based on economic conditions and how the
future looks, albeit uncertain.
• Forecasting is valuable to businesses so that they can make informed business decisions.
• Financial forecasts are fundamentally informed guesses, and there are risks involved in
relying on past data and methods that cannot include certain variables.
• Forecasting approaches include qualitative models and quantitative models.
While there might be large variations on a practical level when it comes to business forecasting,
on a conceptual level, most forecasts follow the same process:
1. Selecting the critical problem or data point: This can be something like "will people buy
a high-end coffee maker?" or "what will our sales be in March next year?"
2. Identifying theoretical variables and an ideal data set: This is where the forecaster
identifies the relevant variables that need to be considered and decides how to collect the
data.
3. Assumption time: To cut down the time and data needed to make a forecast, the forecaster
makes some explicit assumptions to simplify the process.
4. Selecting a model: The forecaster picks the model that fits the dataset, selected variables,
and assumptions.
5. Analysis: Using the model, the data is analyzed, and a forecast is made from the analysis.
98
6. Verification: The forecast is compared to what happens to identify problems, tweak some
variables, or, in the rare case of an accurate forecast, pat themselves on the back.
7. Interpretation and decision making: Once the analysis has been verified, it must be
condensed into an appropriate format to easily convey the results to stakeholders or
decision-makers. Data visualization and presentation skills are helpful here.
Quantitative data
• Quantitative data is number-based, countable, or measurable. Quantitative data tells us
how many, how much, or how often in calculations.
• e. g. no. of employees in organization, no. of defects in product assembly, no. of
specific brand’s products purchased.
• Quantitative analysis involves looking at the hard data, the actual numbers.
Qualitative data
• Qualitative data is interpretation-based, descriptive, and relating to language. Qualitative
data can help us to understand why, how, or what happened behind certain behaviors.
• e. g. customer remarks, product reviews, etc. Qualitative analysis is less tangible.
• Qualitative data concerns subjective characteristics and opinions – things that cannot be
expressed as a number.
Qualitative and quantitative forecasting techniques are two distinct approaches used to
make predictions or estimates about future events or trends. Let's study each of these
techniques:
Qualitative Forecasting:
• Qualitative forecasting techniques rely on subjective judgments, expert opinions, and
qualitative data to make predictions. These techniques are typically used when historical
data is limited, unreliable, or unavailable, and when the focus is on understanding and
incorporating qualitative factors that may influence the future. The Qualitative forecasting
method is primarily based on fresh data like surveys and interviews, industry benchmarks,
99
and competitive analysis. This technique is useful for newly launched products, or verticals
wherein historical data doesn’t exist yet.
• Qualitative forecasting can help a company make predictions about their financial standing
based on opinions in the company. If you work as a manager or other high-level employee,
you can use forecasts to assess and edit your company's budget. Knowing how to use
qualitative forecasting can benefit your company by allowing input from external and
internal sources.
• Qualitative models have typically been successful with short-term predictions, where the
scope of the forecast was limited. Qualitative forecasts can be thought of as expert-driven,
in that they depend on market mavens or the market as a whole to weigh in with an
informed consensus.
• Qualitative forecasting models are useful in developing forecasts with a limited scope.
These models are highly reliant on expert opinions and are most beneficial in the short
term. Examples of qualitative forecasting models include interviews, on-site visits, market
research, polls, and surveys that may apply the Delphi method (which relies on aggregated
expert opinions).
100
• Gathering data for qualitative analysis can sometimes be difficult or time-consuming. The
CEOs of large companies are often too busy to take a phone call from a retail investor or
show them around a facility. However, we can still sift through news reports and the text
included in companies’ filings to get a sense of managers’ records, strategies, and
philosophies.
3. Some benefits of qualitative forecasting include the flexibility to use sources other
than numerical data, the ability to predict future trends and phenomena in business
and the use of information from experts within a company's industry.
▪ Expert Opinion: Gathering insights and predictions from subject matter experts or
individuals with domain knowledge and expertise.
▪ Executive opinions: Upper management uses intuition to make decisions.
▪ Internal polling or Panel Consensus: Customer-facing employees share insights
about customers.
▪ Panel approach: this can be a panel of experts or employees from across a business
such as sales and marketing executives who get together and act like a focus group,
101
reviewing data and making recommendations. Although the outcome is likely to be
more balanced than one person’s opinion, even experts can get it wrong.
▪ Delphi Method: A structured approach that involves obtaining anonymous input
from a panel of experts through a series of questionnaires, followed by iterative
feedback and consensus building. Experts share their projections in a panel
discussion. The Delphi method is commonly used for technological forecasting.
This method is commonly used to forecast trends based on the information given
by a panel of experts. This series of steps is based on the Delphi method, which is
about the Oracle of Delphi. It assumes that a group's answers are more useful and
unbiased than answers provided by one individual. The total number of rounds
involved may differ depending on the goal of the company or group's researchers.
▪ These experts answer a series of questions in continuous rounds that ultimately lead
to the "correct answer" a company is looking for. The quality of information
improves with each round as the experts revise their previous assumptions
following additional insight from other members of the panel. The method ends
upon completion of a predetermined metric.
▪ Market Research: Conducting surveys, focus groups, or interviews to gather
qualitative information from customers, stakeholders, or target audience.
Customers report their preferences and answer questions.
▪ Historical Analysis: This kind of forecasting is used to forecast sales on the
presumption that a new product will have a similar sales pattern to that of an
existing product.
▪ Scenario Analysis and Scenario planning: Developing multiple scenarios or
hypothetical situations based on different assumptions and exploring the potential
outcomes for each scenario.: this can be used to deal with situations with greater
uncertainty or longer-range forecasts. A panel of experts is asked to devise a range
of future scenarios, likely outcomes and plans to ensure the most desirable one is
achieved. For example, predicting the impact of a new sales promotion, estimating
the effect a new technology may have on the marketplace or considering the
influence of social trends on future buying habits.
102
▪ Qualitative forecasting techniques are subjective in nature and rely on human
judgment, making them useful when dealing with complex or uncertain situations,
emerging trends, or when there is limited historical data available.
Companies in almost any industry can use qualitative forecasting to make predictions about
their future operations. Here's how a few industries might use qualitative forecasting:
• Sales: Qualitative forecasting can help companies in sales makes decisions like
how much of a product to produce and when they should order more inventory.
• Healthcare: Healthcare employees can use qualitative forecasting to identify
trends in public health and decide which healthcare operations might be in high
demand in the near future.
• Higher Education: Colleges or universities can use qualitative forecasting to
predict the number of students who might enroll for the next term or year.
• Construction and manufacturing: Qualitative forecasting can show construction
and manufacturing companies the quantity of different materials they use to help
determine which materials or equipment they might need for their next project.
• Agriculture: Farmers can use qualitative forecasting to assess their sales and
decide which crops to plant for the next season based on which products consumers
purchase most often.
• Pharmaceutical: Qualitative forecasting in pharmaceuticals can help identify
which medications are popular among consumers and which needs people are using
pharmaceuticals to predict which kinds of pharmaceuticals they might benefit from
developing.
Q1. The choice of a forecasting method should be based on an assessment of the costs and
benefits of each method in a specific application.
103
A. True
B. False
Q.2 Surveys and opinion polls are qualitative techniques.
A. True
B. False
Q.3 The Delphi method generates forecasts by surveying consumers to determine their
opinions.
A. True
B. False
Quantitative Forecasting:
• Quantitative forecasting techniques, on the other hand, rely on historical data and
mathematical models to make predictions. Quantitative forecasting methods use
past data to determine future outcomes. These techniques are data-driven and
involve analyzing patterns, trends, and statistical relationships in the available data
to project future outcomes.
• The formulas used to arrive at a value are entirely based on the assumption that the
future will majorly imitate history.
• Understanding how your company's past might affect its future is essential for
managing a business or working in sales. One method for evaluating your company
successfully is quantitative forecasting, which uses gathered data to draw
104
conclusions about prospective future prospects. Understanding quantitative
forecasting may help you see future sales estimates and make smarter business
decisions, whether you're running your own company or attempting to predict the
future of a certain product.
• Quantitative models discount the expert factor and try to remove the human element
from the analysis. These approaches are concerned solely with data and avoid the
fickleness of the people underlying the numbers. These approaches also try to
predict where variables such as sales, gross domestic product, housing prices, and
so on, will be in the long term, measured in months or years.
1. Objectivity: Numbers are neutral and free from any subjective judgment.
Examining empirical data provides a standard of objectivity that is useful for
making important business decisions. This makes realistic projections easier to
calculate and guarantees that the information is trustworthy.
2. Reliability: As analysts record and use accurate data in quantitative forecasting,
the inferences made become more reliable. Quantitative forecasting takes
advantage of the available information to provide reliable and accurate predictions
based on an established history. This makes it easier for business owners or
salespeople to pinpoint areas for growth.
3. Transparency: Because data reflects exactly how a business is performing, it
provides a level of transparency that can be very useful for quantitative forecasts.
The collected records present all of the information accurately and openly, which
provides an added level of clarity for making future business decisions.
4. Predictability: When businesses monitor their history and record their data for
quantitative forecasts, it makes trends easier to identify and predict. Using this
105
information, businesses can set realistic expectations and adjust their goals to
measure growth.
▪ Straight-line method: Businesses evaluate recent growth and predict how growth
might continue influencing data. The straight-line method is one of the simplest and
easy-to-follow forecasting methods. A financial analyst uses historical figures and
trends to predict future revenue growth.
106
▪ Machine Learning: Utilizing various machine learning algorithms such as
decision trees, random forests, support vector machines, or neural networks to learn
patterns from historical data and make predictions or classifications.
▪ Naive method: Businesses review historical data and assume future behavior will
reflect past behavior. The naive method bases future predictions by anticipating
similar results to the data gathered from the past. This method does not account for
seasonal trends or any other patterns that might arise in the collected data. It is the
most straightforward forecast method and is often used to test the accuracy of other
methods.
▪ Trend projection: Trend projection uses your past sales data to project your future
sales. It is the simplest and most straightforward demand forecasting method.
It’s important to adjust future projections to account for historical anomalies. For
example, perhaps you had a sudden spike in demand last year. However, it
happened after your product was featured on a popular television show, so it is
unlikely to repeat. Or your eCommerce site got hacked, causing your sales to
plunge. Be sure to note unusual factors in your historical data when you use the
trend projection method.
▪ Seasonal index: Businesses analyze historical data to find seasonal patterns.
▪ Moving average method: Businesses determine averages over a large time
duration. Moving averages are a smoothing technique that looks at the underlying
pattern of a set of data to establish an estimate of future values. The most common
types are the 3-month and 5-month moving averages.
▪ Exponential smoothing: It is similar to the moving average method except for the
fact that recent data points are given more weightage. It needs a single parameter
called alpha, also known as the smoothing factor. Alpha controls the rate at which
the influence of past observations decreases exponentially. The parameter is often
set to a value between 0 and 1.
107
▪ The indicator approach: The indicator approach depends on the relationship
between certain indicators, for example, GDP and the unemployment rate
remaining relatively unchanged over time. By following the relationships and then
following leading indicators, you can estimate the performance of the lagging
indicators by using the leading indicator data.
Quantitative forecasting techniques require accurate and reliable historical data and assume
that past patterns and relationships will continue in the future. These techniques are most
effective when there is a significant amount of historical data available and when the
underlying factors influencing the forecast are stable and measurable.
108
C. Regression
D. Extrapolation
Q.2 ________are a smoothing technique that looks at the underlying pattern of a set of data
to establish an estimate of future values.
A. Moving averages
B. Seasonal index
C. Exponential smoothing
D. Extrapolation
109
3. External Factors: Quantitative methods of forecasting may not account for
external factors that can affect future trends.
It's worth noting that a combination of qualitative and quantitative techniques can be
employed in some forecasting scenarios. For example, qualitative inputs can be used to
inform or adjust quantitative models, or qualitative insights can be used to interpret and
validate quantitative forecasts. The choice of forecasting technique depends on the specific
situation, available data, and the level of accuracy and precision required for the forecast.
While time-series data is information gathered over time, various types of information describe
how and when that information was gathered. For example:
1. Time series data: It is a collection of observations on the values that a variable takes at
various points in time.
110
2. Cross-sectional data: Data from one or more variables that were collected
simultaneously.
3. Pooled data: It is a combination of cross-sectional and time-series data.
Time series analysis has a range of applications in statistics, sales, economics, and many more
areas. The common point is the technique used to model the data over a given period of time.
1. Features: Time series analysis can be used to track features like trend, seasonality, and
variability.
2. Forecasting: Time series analysis can aid in the prediction of stock prices. It is used if you
would like to know if the price will rise or fall and how much it will rise or fall.
3. Inferences: You can predict the value and draw inferences from data using Time series
analysis.
111
• Time-series data is a sequence of data points collected over time intervals, allowing
us to track changes over time.
• Time-series data can track changes over milliseconds, days, or even years.
Businesses are often very interested in forecasting time series variables.
• In time series analysis, we analyze the past behavior of a variable to predict its
future behavior.
• A time series is a set of observations on a variable's outcomes in different time
periods: the quarterly sales for a particular company during the past five years, for
example, or the daily returns on a traded security.
• A time series data example can be any information sequence that was taken at
specific time intervals (whether regular or irregular).
• Time-series analysis is a method of analyzing a collection of data points over a
period of time. Instead of recording data points intermittently or randomly, time
series analysts record data points at consistent intervals over a set period of time.
• Time series contains observation in the numerical form represented in
chronological order. Analysis of this observed data and applying it as input to
derive possible future developments was popularized in the late 20th century. It was
primarily due to the textbook on time series analysis written by George E.P. Box
and Gwilym M. Jenkins. They introduced the procedure to develop forecasts using
the input based on the data points in the order of time, famously known as Box-
Jenkins Analysis.
112
the weather based on past weather observations is a time series problem. Time
series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow's weather report to future years of climate
change.
• Common data examples could be anything from heart rate to the unit price of store
goods.
Time-series forecasting
Time-series forecasting is a type of statistical or machine learning approach that tries to
model historical time-series data to make predictions about future time points. Time series
analysis is perhaps the most common statistical demand forecasting model.
Time series forecasting is a method of predicting future values based on historical patterns
and trends in a sequence of data points ordered over time. It is widely used in various fields,
including finance, economics, weather forecasting, sales forecasting, and demand
forecasting, among others.
The goal of time series forecasting is to understand and capture the underlying patterns and
dependencies within the data and use them to make accurate predictions about future
values. These predictions can help in making informed decisions, planning resources, and
identifying potential risks or opportunities.
113
1. Secular Trend: The general tendency of a time series to increase, decrease or
stagnate over a long period of time. Trend shows a common tendency of data. It
may move upward or increase or go downward or decrease over a certain, long
period of time. The trend is a stable and long-term general tendency of movement
of the data. Some of the examples of trends can be agricultural production, increase
in population, in summer, the temperature may rise or decline in a day, but the
overall trend during the first two months will show how the heat has been rising
from the beginning. A Trend can be either linear or non-linear.
2. Seasonal Trend or Variations: Fluctuations within a year during the season.
Seasonal variations are changes in time series that occur in the short term. These
variations are often recorded as hourly, daily, weekly, quarterly, and monthly
schedules. The festivals, customs, fashions, habits, and various occasions, such as
weddings impact the seasonal variations. for example, the sale of umbrellas
increases during the rainy season, and air conditioners increase during summer.
Apart from natural occurrences, man-made conventions like fashion, marriage
season, festivals, etc., play a key role in contributing to seasonal trends.
114
recession, depression, and recovery. It may be regular or non-periodic in nature
depending on certain situations. Normally, cyclical variations occur due to a
combination of two or more economic forces and their interactions.
Earthquakes, war, famine, and floods are some examples of random time series
components.
Q.2 Increase in the number of patients in the hospital due to heat stroke is:
A. Secular trend
B. Irregular variation
C. Seasonal variation
D. Cyclical variation
Q.3 The sales of departmental store on Dussehra and Diwali are associated with the component
of a time series _________ variation.
A. Trend
B. Seasonal
115
C. Irregular
D. Cyclical
1. Data collection: Collect historical data points ordered by time. This data can
include measurements, observations, or any relevant information related to the
phenomenon being studied.
3. Stationarity: Check if the time series is stationary, meaning its statistical properties
remain constant over time. Stationarity is often assumed for many forecasting
models. If the series is non-stationary, transformations or differencing techniques
can be applied to make it stationary.
116
• Exponential Smoothing (ES) models
• Seasonal ARIMA (SARIMA) models
• Long Short-Term Memory (LSTM) networks (a type of deep learning model)
The selection of the model depends on factors such as the presence of seasonality,
trend, and the complexity of the data.
5. Model fitting and evaluation: Fit the selected model to the training data and tune
its parameters if necessary. Evaluate the model's performance using appropriate
metrics such as mean absolute error (MAE), mean squared error (MSE), or root
mean squared error (RMSE). Use validation data or cross-validation techniques to
assess the model's accuracy.
6. Forecasting: Once the model is trained and validated, use it to make predictions
on unseen future data points. Generate forecasts for the desired time horizon,
considering the level of uncertainty associated with the predictions.
Time series forecasting can be a complex task, and the choice of models and
techniques depends on the specific characteristics of the data and the goals of the
forecasting task. It's often beneficial to explore multiple models and compare their
performance to select the most appropriate one for a given application.
117
A. True
B. False
Q.3 The fundamental assumption of time-series analysis is that past patterns in time-series
data will continue unchanged in the future.
A. True
B. False
Q.4 Time-series forecasting tends to be more accurate than "naive" forecasting.
A. True
B. False
Q.5 The long-run increase or decrease in time-series data is referred to as a cyclical
fluctuation.
A. True
B. False
118
• For quicker analyses that can encompass a larger scope, quantitative methods are often
more useful. Looking at big data sets, statistical software packages today can crunch the
numbers in a matter of minutes or seconds. However, the larger the data set and the more
complex the analysis, the pricier it can be.
• Thus, forecasters often make a sort of cost-benefit analysis to determine which method
maximizes the chances of an accurate forecast in the most efficient way. Furthermore,
combining techniques can be synergistic and improve the forecast’s reliability.
Limitations of forecasting:
• Forecasting can be dangerous. Forecasts become a focus for companies and governments
mentally limiting their range of actions by presenting the short to long-term future as pre-
determined. Moreover, forecasts can easily break down due to random elements that cannot
be incorporated into a model, or they can be just plain wrong from the start.
• But business forecasting is vital for businesses because it allows them to plan production,
financing, and other strategies. However, there are three problems with relying on
forecasts:
• The data is always going to be old. Historical data is all we have to go on, and there is no
guarantee that the conditions in the past will continue in the future.
• It is impossible to factor in unique or unexpected events, or externalities. Assumptions are
dangerous, such as the assumption that banks were properly screening borrowers prior to
the subprime meltdown. Black swan events have become more common as our reliance on
forecasts has grown.
• Forecasts cannot integrate their own impact. By having forecasts, accurate or inaccurate,
the actions of businesses are influenced by a factor that cannot be included as a variable.
This is a conceptual knot. In a worst-case scenario, management becomes a slave to
historical data and trends rather than worrying about what the business is doing now.
119
• Negatives aside, business forecasting is here to stay. Appropriately used, forecasting allows
businesses to plan ahead for their needs, raising their chances of staying competitive in the
markets. That's one function of business forecasting that all investors can appreciate.
SUMMARY
▪ Quantitative forecasting methods use past data to determine future outcomes.
▪ The qualitative forecasting method is primarily based on fresh data like surveys and
interviews, industry benchmarks, and competitive analysis.
▪ Quantitative methods of forecasting are an essential tool used by businesses to make
informed decisions about the future.
▪ Regression Analysis, Naïve Method, etc. are Quantitative Techniques, whereas Delphi
Method, Market Research, etc. are Qualitative Techniques
▪ Time-series is a set of observations on a quantitative variable collected over time.
▪ Time Series Analysis consists of two steps build a model that represents a time series
and validate the proposed model and then use it.
▪ Trend, Cyclical Variation, Seasonal Variation, and Irregular Variation are the
components of Time Series.
References
1. Data Warehousing, OLAP and Data Mining by S Nagabhushana
2. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
3. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed
J. Zaki and Wagner Meira Jr.
4. Data Analytics made Accessible by Dr. Anil Maheshwari
5. Data Science and Big Data Analytics by EMC Education Services.
6. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
120
7. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed
J. Zaki and Wagner Meira Jr.
8. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta,
Umesha Nayak
Review Questions:
Q1. What is forecasting? Explain business forecasting and its need.
Q.2 Discuss Qualitative and Quantitative forecasting.
Q.3 What are qualitative forecasting techniques?
Q.4 Explain quantitative forecasting techniques.
Q5. What are advantages and disadvantages of quantitative forecasting?
Q.6 What are limitations of forecasting methods.
Q.7 differentiate between qualitative and quantitative forecasting methods.
Q.8 explain time series forecasting.
Q.9 Explain time series forecasting components with examples.
Q.10 explain time series forecasting process.
Further Readings:
1. Introductory Time Series with R (Use R!) 2009th Edition by Paul S.P. Cowpertwait,
Andrew V. Metcalfe
2. Forecasting principles and practice by Rob J Hyndman and George Athanasopoulos
3. Time Series Analysis and Its Applications: With R Examples 4th ed. 2017 Edition by
Robert H. Shumway, David S. Stoffer
4. Advances in Business and management forecasting by Kenneth D Lawrence
5. Forecasting methods and applications by Spyros G Makridakis, Steven C,
Wheelwright, Rob J Hyndman
121
Unit V: Social Network Analytics
Topics
• Social Network
• Social Network Analytics
• Text Mining
• Text Analytics
• Difference between Text Mining and Text analytics
• R Snippet to do Text Analytics
Objectives:
20. To understand the concept of social network
21. To understand the social network analytics and its process
22. To understand the concept of text mining
23. To comprehend the applications of text mining and text analytics
24. To apply the concept of Text Analytics using R Programming Language
Outcomes:
20. Understand the concept of social network.
21. Understand the social network analytics and its process.
22. Understand the concept of text mining.
23. Comprehend the applications of text mining and text analytics.
24. Apply the concept of Text Analytics using R Programming Language
122
Introduction to the Unit
This unit introduces social networks, social network analytics and the process of social network
analytics (SNA). The theoretical basics of social network analysis are briefly reviewed, and the
main methods needed to carry out this kind of analysis are covered. It addresses concerns with
data gathering, and social network structural metrics. The Unit also covers the advantages and
disadvantages of SNA. Also, the concept of text mining and applications of text mining are
discussed in this unit.
Social networks
• Social networks are websites and mobile apps that allow users and organizations to
connect, communicate, share information, and form relationships. Social networks have
become very popular in recent years because of the increasing proliferation and
affordability of internet enabled devices such as personal computers, mobile devices, and
other more recent hardware innovations such as internet tablets, etc. People can connect
with others in the same area, families, friends, and those with the same interests. E. g.
Facebook, WhatsApp, twitter, Instagram, etc.
• Social media and social networks are not the same thing, despite their frequent interchange.
An individual's connections and interactions with others are the main emphasis of a social
network. In social media, individual sharing with a big audience is increasingly important.
Media is used here in the same manner as it is in mass media. The majority of social
networks also function as social media websites.
• Social network analysis is a psychological study that looks at how people interact with each
other as individuals and groups. It enables one to understand the networks of relationships
between people in society and analyze the different cultural and relational paths societies
take.
123
• Social network analysis (SNA) is the method of investigating social structures with the
help of networks and graph theory.
• Social network analysis (SNA) has been used in many disciplines such as sociology,
anthropology, political science, psychology, philosophy, and many others.
• SNA is one of the most significant tools used in psychology research to investigate people's
thoughts and feelings in social contexts. The technique can be used for many purposes,
such as understanding the spread of disease, predicting crime rates, or understanding social
movements.
• The focus of a social network will be user-generated content. Users mostly engage with,
and view material created by other users. They are urged to provide text, status updates, or
images for public access.
• Users and organisations can build profiles on social networks. The profile includes the
person's bio and a core page containing the material they have uploaded. Their profile could
correspond to their legal name.
• A social network can help members establish enduring connections with one another.
Friending or following the other user are popular terms used to describe these ties. They
enable people to connect with one another and build social networks. Frequently, an
algorithm may suggest additional persons and businesses that they would like to connect
with.
• Social network analytics refers to the process of analyzing and interpreting data from social
networks to gain insights and understand the patterns, dynamics, and characteristics of
124
social relationships. It involves using various techniques and methodologies to extract
valuable information from social network data.
• The focus of a social network will be user-generated content. Users mostly engage with,
and view material created by other users. They are urged to provide text, status updates, or
images for public access.
• Users and organisations can build profiles on social networks. The profile includes the
person's bio and a core page containing the material they have uploaded. Their profile could
correspond to their legal name.
Sharing: Geographically separated friends or family members may communicate remotely and
exchange information, updates, images, and videos. People can grow their existing social networks
or meet new people through social networking who share their interests.
Learning: Social media sites are excellent forums for education. Customers may quickly acquire
breaking news, updates on friends and family, or information on what's going on in their
neighbourhood.
Interacting: Social networking improves user interactions by removing time and distance
restrictions. People can have face-to-face conversations with anyone in the globe using cloud-
based video communication services like WhatsApp or Instagram Live.
Marketing: Companies may utilise social networking platforms to boost brand and voice
identification, increase customer retention and conversion rates, and increase brand awareness
among platform users.
125
Social Connections: In this kind of social network, users may connect with friends, family,
acquaintances, brands, and more through online profiles and updates. Users can also meet new
people through shared interests. Instagram, Myspace, and Facebook are a few examples.
relationships with colleagues. These social networks are made for business interactions and are
geared towards professionals. These websites may be utilised, for instance, to research career
prospects, strengthen current business relationships, and develop new contacts in the
Professional connections: They could give an exclusive platform focusing on certain professions
or interests, or they might feature a generic forum for professionals to communicate with
colleagues. Examples are Microsoft Viva, LinkedIn, and Yammer.
Sharing of Multimedia: YouTube and Flickr, among other social networks, offer facilities for
sharing videos and photos.
News or informational: Users can submit news articles, educational materials, or how-to guides
to this sort of social networking site, which can be both all-purpose and topic specific. These social
networks, which share a lot with web forums, host groups of users seeking for solutions to common
issues. Members answer inquiries, host forums, or instruct others on how to do various activities
and projects to foster a sense of assisting others. Examples in use now include Digg, Stack
Overflow, and Reddit.
Communication: In this case, social networks emphasise enabling direct one-on-one or group
discussions between users. They resemble instant chat applications and place less emphasis on
postings or updates. WeChat, snapchat and WhatsApp are two examples.
Educational: Remote learning is made possible by educational social networks, allowing students
and professors to work together on assignments, do research, and communicate through blogs and
forums. Popular examples include ePals, LinkedIn Learning, and Google Classroom, MS Teams,
Zoom, etc.
126
Advantages of Social Networking:
Brand Recognition: Social networking helps businesses to connect with both potential and
current customers. This increases brand awareness and helps make brands more relevant.
Instant reachability: Social networking websites may offer immediate reachability by removing
the geographic and physical distances between individuals.
Builds a following: Social networking may help businesses and organisations grow their clientele
and reach throughout the globe.
Business Achievement: On social networking sites, customers' favourable evaluations and
comments may boost sales and profitability for businesses.
Increased Usage of Websites: Social networking profiles may be used by businesses to increase
and steer inbound traffic to their websites. They can do this, for instance, by including motivating
images, employing plugins and social media sharing buttons, or promoting inbound links.
Rumours and misinformation: Social networking sites have gaps where inaccurate information
might fall through, confusing and upsetting customers. People frequently believe everything they
read on social networking platforms without checking the origins.
Negative Comments and Reviews: An established company may suffer from a single
unfavourable review, particularly if it is published on a website with a sizable audience. A
damaged corporate reputation frequently results in permanent harm.
Data security and privacy concerns: Social media platforms may unintentionally expose user
data to risk. For instance, when a social networking site suffers a data breach, all of its users are
immediately put on alert. The personal information was exposed in April 2021, according to
Business Insider.
127
Time Consuming Process: Social media marketing for a company necessitates ongoing care and
maintenance. Regular post creation, updating, planning, and scheduling may take a lot of time.
Small companies that do not have the extra personnel or resources to devote to social media
marketing may find this to be particularly burdensome.
Social network analysis: mapping and measuring of relationships and flows between people,
groups, organizations, computers, or other information/knowledge processing entities; the nodes
in the network are the people and groups, while the links show relationships or flows between the
nodes – provides both a visual and a mathematical analysis of human relationships.
128
Fig. Network Graph
Here are some key aspects and methods used in social network analytics:
Data collection: Social network analytics begins with collecting relevant data from social
networks, which may include information about individuals, their connections, and their
activities within the network. This data can be obtained from online platforms, such as
Facebook, Twitter, LinkedIn, or specialized datasets.
Network visualization: Visualizing social networks is an essential step in understanding
the structure and organization of the network. Network graphs and visual representations
help in identifying patterns, clusters, and influential nodes within the network.
Centrality measures: Centrality measures quantify the importance or prominence of
nodes within a social network. Degree centrality measures the number of connections a
node has, while betweenness centrality identifies nodes that act as bridges between
different parts of the network. Other measures include closeness centrality and eigenvector
centrality.
129
Community detection: Community detection algorithms help identify groups or
communities of nodes that have stronger connections with each other compared to the rest
of the network. It helps in understanding the formation of subgroups or clusters within the
network.
Sentiment analysis: Sentiment analysis involves analysing textual data from social
network posts, comments, or messages to determine the sentiment expressed by
individuals. It helps in understanding the attitudes, opinions, and emotions within the
network.
Influence identification: Social network analytics can be used to identify influential nodes
or individuals who have a significant impact on information flow or decision-making
within the network. Influence identification methods may consider factors such as node
centrality, activity level, or community involvement.
Network dynamics: Analysing changes and dynamics in a social network over time
provides insights into the evolution of relationships, communities, and information
diffusion. It helps in understanding how networks grow, adapt, and transform.
Network diffusion: Network diffusion models study the spread of information, ideas, or
behaviours within a social network. These models help in understanding the mechanisms
of influence, viral marketing, or the propagation of trends and innovations.
Social network theory: Social network analytics draws upon theories and concepts from
social network analysis, sociology, graph theory, and other related disciplines. These
theories provide a framework for understanding social structures, relationships, and
interactions within networks.
130
• SNA can be used to identify people who are linked, but who may not be part of a formal
community. These people can be invited to join a community relevant to them.
• Improving effectiveness of functions or business units.
• Forming strategic partnerships or assessing client connectivity.
• social network analytics enables researchers, organizations, and individuals to gain
valuable insights into social relationships, behaviors, and communication patterns, aiding
decision-making, marketing strategies, community management, and various research
endeavors.
Social network analysis (SNA) has a wide range of applications across various domains due to
its ability to reveal hidden patterns, connections, and insights within social networks. Here are
some key applications of social network analysis:
• Identifying key influencers and opinion leaders for targeted marketing campaigns.
• Analyzing customer connections to improve segmentation and personalized marketing
strategies.
• Monitoring brand reputation and sentiment on social media platforms.
3. Healthcare and Public Health:
131
• Identifying individuals at high risk in epidemiological studies.
• Understanding patient referral patterns and healthcare provider collaborations.
4. Social Media Analysis:
• Identifying trending topics, hashtags, and influential users on social media platforms.
• Analyzing sentiment and emotions in user-generated content.
• Studying information diffusion and viral content propagation.
5. Counterterrorism and Security:
• Detecting and preventing radicalization by identifying key individuals or groups.
• Analyzing communication networks to track potential threats and activities.
• Understanding the structure of criminal networks and organized crime.
6. Academic Research:
• Studying collaboration and knowledge flow among researchers and academics.
• Analyzing citation networks to identify influential papers and researchers.
• Understanding scientific collaboration patterns and interdisciplinary research.
7. Political Science and Policy Analysis:
• Analyzing political connections and alliances among politicians.
• Studying lobbying activities and interest group networks.
• Tracking the flow of information and influence within political systems.
8. Supply Chain Management:
• Analyzing supplier-customer relationships to optimize supply chain efficiency.
• Identifying potential disruptions and vulnerabilities within supply networks.
• Improving coordination and collaboration among supply chain partners.
9. Education and Learning:
• Analyzing student interactions to enhance classroom dynamics and group projects.
• Identifying peer influence and social learning patterns.
• Studying the spread of educational innovations and best practices.
132
• Identifying commuting patterns and social interactions within cities.
• Studying social networks within neighborhoods for community development.
These are some examples of the various applications of social network analysis. SNA remains to
find new applications as technology evolves and more data becomes available for analysis.
Text mining
• Text mining is a subset of data mining, which is the process of finding patterns in large
volumes of data. Text mining identifies facts, relationships and assertions that would
otherwise remain buried in the mass of textual big data. Manually scanning and classifying
these documents can be extremely time-consuming, so automating text mining can save
businesses considerable time and effort. Managers can then use the discoveries to make
better informed decisions and quickly take action.
• Text mining is the process of exploring and analyzing large amounts of unstructured text
data aided by software that can identify concepts, patterns, topics, keywords, and other
attributes in the data.
• Text mining has become more practical for data scientists and other users due to the
development of big data platforms and deep learning algorithms that can analyze massive
sets of unstructured data.
• Mining and analyzing text help organizations find potentially valuable business insights in
corporate documents, customer emails, call center logs, verbatim survey comments, social
media posts, medical records, and other sources of text-based data. Increasingly, text
mining capabilities are also being incorporated into AI chatbots and virtual agents that
133
companies deploy to provide automated responses to customers as part of their marketing,
sales, and customer service operations.
• Once extracted, this information is converted into a structured form that can be further
analyzed, or presented directly using clustered HTML tables, mind maps, charts, etc. The
structured data created by text mining can be integrated into databases, data warehouses or
business intelligence dashboards and used for descriptive, prescriptive, or predictive
analytics.
134
• Topic Modelling: Identifying the underlying topics or themes within a collection of
documents.
• Relationship Extraction: Discovering relationships and connections between entities
mentioned in the text.
• Summarization: Creating concise and coherent summaries of lengthy documents or texts.
• Clustering: Grouping similar documents or texts together based on content similarity.
Text Analytics:
Text analytics is a broader term that encompasses a range of techniques used to process, analyse,
and interpret unstructured text data. It includes text mining as one of its components but also
includes other methods that focus on deriving insights from text. Text analytics often has a
stronger emphasis on business intelligence and decision-making. Text analytics refers to the
application that uses text mining techniques to sort through data sets. In order to extract insights
and patterns from massive amounts of unstructured text—text that does not follow a
predetermined format—text analytics integrates a variety of machine learning, statistical, and
linguistic approaches. It makes it possible for organizations, governments, scholars, and the
media to use the vast material at their disposal to make important choices. Sentiment analysis,
topic modelling, named entity identification, phrase frequency, and event extraction are just a
few of the techniques used in text analytics.
In addition to the tasks mentioned under text mining, text analytics may involve:
• Descriptive Analytics: Summarizing and describing text data to provide an overview of
its content.
• Predictive Analytics: Using text data to make predictions or forecasts about future events
or trends.
• Prescriptive Analytics: Offering recommendations or suggesting actions based on text
data analysis.
• Contextual Analysis: Understanding the context and context-dependent meanings of
words and phrases.
135
• Text Visualization: Creating visual representations of text data to aid in understanding
and exploration. Data visualization techniques can then be harnessed to communicate
findings to wider audiences. By transforming the data into a more structured format
through text mining and text analysis, more quantitative insights can be found through text
analytics.
Overall, while text mining is a specific subset of text analytics focused on extracting patterns and
insights from text data, text analytics encompasses a broader set of techniques that includes mining
as well as other analytical and interpretative approaches. Both text mining and text analytics play
important roles in turning unstructured text data into actionable information for various
applications and industries.
Several pre-steps are used in the sophisticated approach of text analytics to collect and clean the
unstructured material. Text analytics may be carried out in a variety of methods. Using this as a
model workflow example.
136
tokenization stage. For instance, character tokens may represent each letter in the word
"Fish" individually. Alternately, you may segment by sub word tokens, like fishing.
The foundation of all-natural language processing is represented by tokens.
Additionally, all of the text's undesirable elements—including white spaces—are
removed in this stage.
b. Part of Speech Tagging: Each token in the data is given a grammatical category, such
as a noun, verb, adjective, or adverb, at this stage.
c. Parsing: Understanding the syntactical structure of a document is the process of
parsing. Two common methods for determining syntactical structure are constituency
parsing and dependency parsing.
d. Lemmatization and stemming: These two procedures are used in data preparation to
take away the tokens' affixes and suffixes while keeping the tokens' dictionary form, or
lemma.
e. Stopword removal: All the tokens that often appear but have little value in text
analytics are at this stage. ‘And’ 'the', and 'a' are examples of words that fall under this
category.
3. Text analytics: Unstructured text data must first be prepared before text analytics
techniques may be used to provide insights. Text analytics uses a variety of methodologies.
Text categorization and text extraction stand out among them.
4. Text classification: This method is sometimes referred to as text tagging or text
classification. The text is given specific tags in this phase based on their significance. For
instance, labels like "positive" or "negative" are applied while analyzing customer
feedback. Rule-based or machine learning-based systems are frequently used for text
categorization. Humans specify the connection between a linguistic pattern and a tag in
rule-based systems. "Good" may denote a favorable review, while "bad" may denote a
negative review.
To tag a new set of data, machine learning algorithms employ training data or examples
from the past. Larger amounts of data enable the machine learning algorithms to produce
correct tagging results, therefore the training data and its volume are essential. Support
Vector Machines (SVM), the Naive Bayes family of algorithms (NB), and deep learning
algorithms are the primary algorithms utilized in text categorization.
137
5. Text extraction: is the process of taking recognizable and structured data out of the input
text's unstructured form. Keywords, names of individuals, places, and events are included
in this data. Regular expressions are one of the straightforward techniques for text
extraction. When input data complexity rises, it becomes difficult to sustain this strategy.
Conditional Random Fields (CRF) is a statistical method used in text extraction. CRF is a
modern but effective way of extracting vital information from the unstructured text.
Know your progress:
• Naive Bayes: Naive Bayes is a probabilistic algorithm that is often used for text
classification tasks. It assumes that the presence of a particular feature in a class is
independent of the presence of other features, hence the "naive" assumption. Naive Bayes
is computationally efficient and works well with large text datasets.
138
• Support Vector Machines (SVM): SVM is a supervised learning algorithm that can be
used for both classification and regression tasks. In text mining, SVM is often employed
for text classification problems where the goal is to assign a document to one of the
predefined categories. SVM works by finding an optimal hyperplane that separates the data
points representing different classes.
• Decision Trees: Decision trees are hierarchical structures that recursively split the data
based on different features. In text mining, decision trees can be used for tasks like text
classification, topic modeling, and sentiment analysis. Decision trees are easy to interpret
and can handle both categorical and numerical features.
• Random Forests: Random forests are an ensemble learning method that combines
multiple decision trees to make predictions. In text mining, random forests can be used for
tasks such as text classification, sentiment analysis, and feature selection. Random forests
reduce overfitting and provide robust predictions by aggregating the results from multiple
decision trees.
• Hidden Markov Models (HMM): HMM is a statistical model that is widely used for
sequence analysis and natural language processing tasks. HMMs are particularly useful for
tasks such as part-of-speech tagging, named entity recognition, and speech recognition.
They model the underlying probabilistic transitions between different states and the
observed output based on those states.
• Text mining and text analysis identifies textual patterns and trends within unstructured data
using machine learning, statistics, and linguistics.
• Natural Language Understanding helps machines “read” text (or speech) by simulating the
human ability to understand a natural language such as English, Spanish, or Chinese.
• Text mining employs a variety of methodologies to process the text, one of the most
important of these being Natural Language Processing (NLP).
139
• NLP includes both Natural Language Understanding and Natural Language Generation,
which simulates the human ability to create natural language text e. g. to summarize
information or take part in a dialogue.
These are very few some examples of the algorithms used in text mining. The choice of
algorithm depends on the specific task, dataset, and goals of the analysis. It is common to use
a combination of multiple algorithms and techniques to extract meaningful insights from text
data.
• Entity Recognition: Entity recognition involves identifying and classifying named entities
such as people, organizations, locations, dates, or other specific terms mentioned in the
text.
• Topic Modeling: Topic modeling is a technique that identifies the main topics or themes
present in a collection of documents. It can help uncover hidden patterns and structures
within the text data and is often used for document clustering, content recommendation,
and information retrieval.
140
• Text Classification: Text classification involves categorizing text documents into
predefined categories or classes. It is used for tasks like spam filtering, sentiment analysis,
news categorization, and document routing.
• Text mining involves the application of various algorithms and techniques from fields such
as natural language processing (NLP), machine learning, and statistics.
Text mining involves extracting meaningful information and insights from large volumes
of unstructured text data. It has numerous applications across various industries and
domains. Here are some key applications of text mining:
1. Sentiment Analysis:
• Analyzing social media posts, reviews, and comments to determine public
sentiment about products, services, or brands.
• Monitoring customer feedback to assess satisfaction levels and identify areas for
improvement.
141
• Identifying emerging trends and consumer preferences from online discussions and
forums.
3. Content Categorization and Classification:
• Automatically categorizing news articles, blog posts, or documents into relevant
topics or themes.
• Labeling emails or documents for better organization and retrieval.
4. Information Retrieval and Search Enhancement:
• Improving search engines by understanding user intent and returning more relevant
results.
• Extracting key information from documents to provide snippets or summaries in
search results.
5. Document Summarization:
• Generating concise and coherent summaries of lengthy documents for quick
understanding.
• Creating executive summaries of reports and research papers.
6. Named Entity Recognition (NER):
• Identifying and classifying entities such as names of people, organizations,
locations, and dates in text.
• Enhancing database entries and information retrieval by linking entities.
7. Topic Modeling:
• Discovering latent topics within a collection of documents.
• Understanding the main themes in a large corpus of text data.
8. Fraud Detection:
• Identifying patterns of fraudulent activities by analyzing textual descriptions and
transaction details.
• Detecting unusual or suspicious language in insurance claims or financial
documents.
9. Medical and Healthcare Applications:
• Mining electronic health records to identify patterns and trends in patient data.
142
• Extracting medical insights from research papers and clinical notes.
10. Legal and Compliance Analysis:
• Automating the review and analysis of legal contracts and documents.
• Identifying potential compliance violations in textual data.
11. Human Resources and Employee Feedback:
• Analyzing employee surveys and feedback to assess engagement levels and identify
workplace issues.
• Extracting skills, qualifications, and experience from resumes for recruitment
purposes.
12. Academic Research:
• Analyzing research papers to identify relevant citations and build citation networks.
• Extracting data for systematic reviews and literature surveys.
13. Social Media Analytics:
• Tracking trending topics and viral content on social media platforms.
• Understanding public opinions and reactions to current events.
14. Language Translation and Cross-Language Analysis:
• Enabling automatic translation of text between languages.
• Comparing sentiment, themes, and trends across different languages.
15. Competitive Intelligence:
• Analyzing competitor websites, press releases, and documents to gain insights into
their strategies and offerings.
• Identifying emerging competitors in the market.
While text mining offers several benefits, it also comes with few disadvantages and challenges.
Here are some of the key disadvantages of text mining:
144
8. Subjectivity and Tone Analysis: Determining the emotional tone, sentiment, or
intention behind text can be challenging due to the subjectivity of human language.
9. Semantic Understanding: Extracting accurate meaning and context from text requires
deep semantic understanding, which current algorithms might struggle with.
10. Unstructured Data: Text data is unstructured, making it harder to process and analyze
compared to structured data.
11. Constantly Evolving Language: Languages and vocabularies change over time,
leading to challenges in keeping text mining algorithms up to date.
12. Interdisciplinary Expertise Required: Effective text mining often requires expertise
in linguistics, data science, and domain knowledge, making it a multidisciplinary
endeavor.
13. Overfitting and Generalization: Models trained on specific text datasets might overfit
and struggle to generalize to new, unseen data.
14. Lack of Visual Context: Text lacks the visual context present in images or videos,
making it difficult to capture certain types of information.
15. Time-Consuming Annotation: Preparing and annotating text data for training
machine learning models can be time-consuming and labor-intensive.
16. Ethical and Legal Concerns: Analyzing text data might raise ethical and legal
concerns, such as copyright infringement or unintended consequences of data use.
Despite these disadvantages, text mining remains a valuable tool for extracting insights from
unstructured data, and advancements in natural language processing and machine learning
continue to address many of these challenges.
Let’s see how to do Text Analytics using R programming Language. We need multiple packages
to be installed in RStudio to do Text Analytics.
1. gutenbergr to install this library, supporting packages required are ‘lazyeval', 'urltool’,
‘triebeard’
2. tidytext
145
3. dplyr
4. ggplot2
once all these libraries are installed and loaded the following R snippet can be run in RStudio,
which will produce the expected Result.
To do text analytics, we are going to use Mark Twain’s books from gutenbergr library.
Write the following code in RStudio and run you will be able to plot the following graph.
#In the gutenbergr library, each book is tagged to an ID number, which need to identify
# their location.
mark_twain <- gutenberg_download(c(76, 74, 3176, 245))
# pull the books using the gutenberg_download function and save it to a mark_twain data
#frame.
146
# Output of the above command
#When you analyse any text, there will always be redundant words that can skew the results
#depending on what patterns or trends you are trying to identify. These are called stop words.
#It is up to you if you want to remove stop_words, but for this example, let’s go ahead and #remove
them.
#First, it is needed to load the library tidytext.
install.packages("tidytext") #install tidytext library
library(tidytext) # load tidytext library
data(stop_words) # load data from tidytext library
View(stop_words) # view dataframe
147
# output of the previous command 1,149 stop words
148
anti_join(stop_words) #remove stop words
tidy_mark_twain %>%
count(word, sort=TRUE)
library(ggplot2) # load library ggplot2 to plot graph
freq_hist <-tidy_mark_twain %>%
count(word, sort=TRUE) %>%
filter(n > 400) %>%
mutate(word = reorder(word,n )) %>%
ggplot(aes(word, n))+
geom_col(fill= 'blue')+
xlab(NULL)+
coord_flip()
print(freq_hist)
149
SUMMARY
• Social networking involves using online social media platforms to connect with new
and existing friends, family, colleagues, and businesses.
• Individuals can use social networking to announce and discuss their interests and
concerns with others who may support or interact with them.
• Social network analysis is a method of studying relationships between objects and
events in a social structure.
150
• Social network analysis (SNA) can be used to improve communities, identify missing
links, and improve connections between groups.
• Text mining uses artificial intelligence (AI) techniques to automatically discover
patterns, trends, and other valuable information in text documents.
• Text mining employs a variety of methodologies to process the text, one of the most
important of these being Natural Language Processing (NLP).
• Fraud detection, risk management, online advertising, customer churn and web
content management are other functions that can benefit from the use of text mining
tools.
References
1. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
2. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed
J. Zaki and Wagner Meira Jr.
3. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta,
Umesha Nayak
4. What is Social Network Analysis by John Scott.
5. The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured
Data by Ronen Feldman, James Sanger
6. Natural Language Processing and Text Mining by Anne Kao and Stephen R.
Poteet
7. Theory and Applications for Advanced Text Mining by Shigeaki Sakurai
8. Text Mining Predictive Methods for Analyzing Unstructured Information by
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau
Review Questions:
1. Explain what is social networking?
2. What are the advantages and disadvantages of social networking?
3. What is social network analytics?
4. What are the steps involved in social network analytics?
151
5. What are the applications of social network analytics?
6. What is text mining? What is text analytics? How both are different from each other?
7. Discuss the steps involved in text mining.
8. Explain the steps involved in text analytics/
9. Write the applications of text mining.
10. Write the advantages of text mining.
11. Write at least 6 disadvantages and limitations of text mining.
12. Write algorithms used in text mining.
13. Write an R snippet to demonstrate Text analytics.
Further Readings:
1. Scott J. Social network analysis: a handbook. Newbury Park: Sage, 2000.
2. Carrington PJ, Scott J, Wasserman S. Models, and methods in social network analysis
Cambridge: Cambridge University Press, 2005.
3. Wasserman S, Faust K. Social network Analysis: methods and applications. Cambridge:
Cambridge University Press, 1994.
4. M.E.J Newman. Networks. An Introduction. 1st edition Oxford University Press, 2010
5. Models and Methods in Social Network Analysis by Peter J. Carrington, John Scott,
Stanley Wasserman, Cambridge University Press
152
Unit VI: Introduction to Big Data Analytics
Topics:
• Introduction to Big Data Analytics
• Applications of Big Data Analytics
• Data Analysis Project Life Cycle
• Overview of Analytics Tools
• Achieving Competitive Advantage with Data Analytics
Objectives:
25. To understand the fundamental concepts of Big Data Platform and its Use
cases.
26. To understand the Big Data challenges & opportunities.
27. To understand Big Data Analytics and its applications
28. To study Data Analytics Project Life Cycle.
29. To study Data Analytics Tools
30. To study data analytics help to achieve competitive advantage
Outcomes:
Students will be able to
25. Understand the fundamental concepts of Big Data Platform and its Use
cases.
26. Understand the Big Data challenges & opportunities.
27. Understand the aspects of big data Analytics with the help of different Big
Data Applications.
28. Study Data Analytics Project Life Cycle.
29. Learn Data Analytics Tools.
30. Study data analytics can be utilized to achieve competitive advantage.
153
Introduction to the Unit
This unit explains several key concepts to clarify what is meant by Big Data, what big data
analytics is, why advanced analytics are needed. This unit also discusses the applications of Big
Data Analytics. It takes a close look at Data Analytics Project Life Cycle. Then, it outlines that
organizations contend with where they have an opportunity to leverage advanced analytics to
create competitive advantage. The unit covers the analytics tools which are used for Data
Analytics.
What is Data?
• Data is a collection of raw facts and figures. The quantities, characters, or symbols on
which operations are performed by a computer, which may be stored and transmitted in the
form of electrical signals and recorded on magnetic, optical, or mechanical recording
media.
• Every day, we create 2.5 quintillion (1 quintillion is 1030) bytes of data.
• So much that 90% of the data in the world today has been created in the last two years
alone.
• This data comes from everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos, purchase transaction records, and cell phone
GPS signals etc. This data is nothing but big data.
154
• Social Media
• Black box data
The statistics show that 500+ terabytes of new data get ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.
Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
Big data refers to datasets whose size is typically beyond the storage capacity of and
complex for traditional database software tools.
E. g. Apache Storm, Hadoop, MongoDB, Quoble, Cassandra, CouchDB, HPCC, Statwing.
Structured:
Any data that can be stored, accessed, and processed in the form of fixed format is termed
as “structured” data. Over the period, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is
well known in advance) and deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are being
in the rage of multiple zettabytes.
155
3398 Pratibha Bhagwat Female Admin 6.7
7465 Pranav Roy Male Admin 5
7500 Chittaranjan Das Male Finance 5.2
7699 Beena Sane Female Finance 4.5
Unstructured:
Any data with unknown form or structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc. Nowadays
organizations have wealth of data available with them but unfortunately, they don't know
how to derive value out of it since this data is in its raw form or unstructured format.
156
• Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.
Semi-structured
Semi-structured data can contain both forms of data. We can see semi-structured data as
structured in form, but it is actually not defined with e.g., a table definition in relational
DBMS. An example of semi-structured data is data represented in an XML file.
Q.1 Which types of Big Data involve structured, well-organized data that fits neatly into
traditional databases and spreadsheets?
A. Structured Data
B. Semi-structured Data
C. Unstructured Data
D. Metadata
157
Q.2 Which types of Big Data includes data with a defined structure, but not as rigid as
structured data, often containing tags or elements that provide context?
A. Structured Data
B. Semi-structured Data
C. Unstructured Data
D. Metadata
Q.3 Which types of Big Data refers to data that lacks a specific structure and is often in the
form of text, images, audio, or video?
A. Structured Data
B. Semi-structured Data
C. Unstructured Data
D. Metadata
Volume: It is the first of the 5 V's of big data, represents the amount of data that exists.
Volume is like the base of big data, as it is the initial size and amount of data that is
collected. If the volume of data is large enough, it can be considered big data. What is big
data is relative, though, and will change depending on the available computing power that's
on the market. The size and amounts of big data that companies manage and analyze.
Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that leads to more effective
operations, stronger customer relationships and other clear and quantifiable business
benefits.
Variety: Variety refers to the combination or mixture of data types. An organization might
obtain data from a number of different data sources, which may vary in value. Data can
158
come from sources in and outside an organization as well. The challenge in variety
concerns the standardization and distribution of all data being collected. Variety means the
diversity and range of different data types, including unstructured data, semi-structured
data, and raw data.
Velocity: the speed at which companies receive, store, and manage data. Velocity is all
about the speed at which data is coming into the organization. The ability to access and
process varying velocities of data quickly is critical., the specific number of social media
posts or search queries received within a day, hour, or other unit of time.
Veracity: It refers to the quality and accuracy of data. Gathered data could have missing
pieces, maybe inaccurate or may not be able to provide real, valuable insight. Veracity,
overall, refers to the level of trust there is in the collected data.
Data can sometimes become messy and difficult to use. A large amount of data can cause
more confusion than insights if it's incomplete. For example, concerning the medical field,
if data about what drugs a patient is taking is incomplete, then the patient's life may be
endangered.
Q.1 Which of the following is NOT one of the "5 V's" of Big Data?
A. Volume
B. Variety
C. Viscosity
D. Veracity
Q.2 Which "V" of Big Data refers to the sheer size of data generated and collected?
A. Velocity
B. Value
C. Volume
159
D. Veracity
Q.3 The "V" of Big Data that deals with the trustworthiness and reliability of data is:
A. Velocity
B. Viscosity
C. Veracity
D. Variety
160
• Making big data accessible: As data volume increases, collecting and analysing it
becomes increasingly challenging. Data must be made accessible and simple for users of
all skill levels by organisations.
• Maintaining quality data: Organisations are spending more time than ever before looking
for duplication, mistakes, absences, conflicts, and inconsistencies because there is so much
data to keep.
• Protecting Data and keeping it secure: Concerns about privacy and security increase as
data volume increases. Before utilising big data, organisations will need to work towards
compliance and set up strict data protocols.
• Selecting the appropriate platforms and tools: big data processing and analysis
technologies are always evolving. To function within their current ecosystems and meet
their specific demands, organisations must find the appropriate technology. A flexible
system that can adapt to future infrastructure changes is frequently the best option.
Big Data analytics is the process of collecting, examining, and analysing large amounts of data to
discover market trends, insights, and patterns that can help companies make better business
decisions. It is a process used to extract meaningful insights, such as hidden patterns, unknown
correlations, and customer preferences. Big data analytics is a process of identifying patterns,
trends, and correlations in vast quantities of unprocessed data to support data-driven decision-
making. These procedures employ well-known statistical analysis methods, such as clustering and
regression, to larger datasets with the aid of more recent tools.
Big Data analytics provides various advantages, it can be used for better decision making,
preventing fraudulent activities, among other things.
161
analytics aids businesses in cost-cutting and the creation of superior, client-focused goods
and services.
• Data analytics assists in generating insights that enhance how our society operates. Big
data analytics in the healthcare industry is essential for tracking and analysing individual
patient records as well as for monitoring results on a global level. Big data helped health
ministries in each country's government decide how to handle vaccines during the COVID-
19 pandemic and come up with strategies for preventing pandemic breakouts in the future.
162
customers will result, and this creates an adverse overall effect on business success.
The use of big data allows businesses to observe various customer related patterns and
trends. Observing customer behavior is important to trigger loyalty.
6. Solve Advertisers Problem and Offer Marketing Insights: Big data analytics can
help change all business operations. This includes the ability to match customer
expectations, changing the organizations’ product line and of course ensuring that the
marketing campaigns are powerful.
7. Driver of Innovations and Product Development: Another huge advantage of big
data is the ability to help organizations innovate and redevelop their products.
8. Product development: Using information gathered from client requirements and
wants makes developing and marketing new goods, services, or brands much simpler.
Businesses may better analyse product viability and stay current on trends with the use
of big data analytics.
9. Strategic business decisions: The capacity to continuously examine data aids firms in
reaching quicker and more accurate conclusions about issues like cost and supply chain
efficiency.
10. Customer experience: Data-driven algorithms provide a better customer experience,
which aids marketing efforts (targeted advertisements, for instance) and boosts
consumer happiness.
11. Risk management: Companies may find dangers by examining data trends, then come
up with ways to control those risks.
• Education: Based on student requirements and demand, big data enables educational
technology businesses and institutions create new curricula and enhance already-existing
ones.
163
• Medical care: Keeping track of individuals' medical history enables clinicians to identify
and stop illnesses.
• Government: To better manage the public sector, big data may be utilised to gather
information from CCTV and traffic cameras, satellites, body cameras and sensors, emails,
calls, and more.
• Banking: Data analytics can assist in detecting and observing unauthorised money
laundering.
164
3. Customer Acquisition and Retention: Customer information plays a significant role in
marketing strategies that aim to improve customer happiness through data-driven
initiatives. For Netflix, Amazon, and Spotify, personalization engines aid create better
consumer experiences and foster client loyalty.
4. Targeted Ads: To construct targeted ad campaigns for customers on a bigger scale and
at the individual level, personalised data about interaction patterns, order histories, and
product page viewing history may be quite helpful.
5. Product Development: It can produce insights on product viability, performance metrics,
development choices, etc., and direct changes that benefit the customers.
6. Price optimization: With the use of various data sources, pricing models may be
modelled and utilised by merchants to increase profits.
7. Supply Chain and Chain Analytics: Analytics for the supply chain and distribution
channels: Predictive analytical models support B2B supplier networks, proactive
replenishment, route optimisations, inventory management, and delivery delay alerts.
8. Risk management: It assists in identifying new hazards using data trends in order to
create efficient risk management solutions.
9. Better Decision-Making: Enterprises may improve their decision-making by using the
insights that can be gleaned from the data.
4. Introduction
Data Analytics Project Lifecycle defines the roadmap of how data is generated, collected,
processed, used, and analysed to achieve business goals.
It offers a systematic way to manage data for converting it into information that can be
used to fulfil organizational and project goals.
Data analytics mainly involves six important phases that are carried out in a cycle - Data
discovery, Data preparation, Planning of data models, the building of data models,
communication of results, and operationalization. The six phases of the data analytics
lifecycle are followed one phase after another to complete one cycle. It is interesting to
165
note that these six phases of data analytics can follow both forward and backward
movement between each phase and are iterative.
The lifecycle of data analytics provides a framework for the best performances of each
phase from the creation of the project until its completion. This framework was built by a
large team of data scientists with much care and experiments. The key stakeholders in data
science projects are business analysts, data engineers, database administrators, project
managers, executive project sponsors, and data scientists.
The Data analytic lifecycle is designed for Big Data problems and data science projects.
The cycle is iterative to represent a real project. To address the distinct requirements for
performing analysis on Big Data, step – by – step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing, and repurposing data.
Discovery
Communicate
Model Planning
Results
Model Building
166
• Discovery: In this phase stakeholder teams identify and investigate and understand the
problem.
➢ Find out the data sources and data sets required for the project.
➢ The data science team learns and investigates the problem.
➢ examine the business trends, make case studies of similar data analytics.
➢ study the domain of the business industry.
➢ Develop context and understanding.
➢ makes an assessment of the in-house resources, the in-house infrastructure, total time
involved, and technology requirements.
➢ Come to know about data sources needed and available for the project.
➢ The team formulates an initial hypothesis for resolving all business challenges in terms
of the current market scenario that can be later tested with data.
• Data Preparation: after the data discovery phase, data is prepared by transforming it from
a legacy system into a data analytics form by using the sandbox platform. A sandbox is a
scalable platform commonly used by data scientists for data preprocessing. Analytic
Sandbox is used to execute, load, and transform to get data into it.
➢ Steps to explore, preprocess, and condition data prior to modeling and analysis.
➢ It requires the presence of an analytic sandbox, the team execute, load, and transform,
to get data into the sandbox.
➢ Data preparation tasks are likely to be performed multiple times and not in predefined
order.
➢ Several tools commonly used are – Hadoop, Alpine Miner, Open Refine, etc.
• Model Planning: the team determines the methods, techniques, and workflow it intends to
follow for the subsequent model building phase.
➢ Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
➢ In this phase, the data science team develops data sets for training, testing, and
production purposes.
167
➢ Team builds and executes models based on the work done in the model planning
phase.
➢ Several tools commonly used for this phase are – MATLAB and STASTICA.
• Model Building: Developing Data sets for training, testing, and building purposes. Team
also considers whether its existing tools will be sufficient for running the models or if they
need a more robust environment for executing models.
➢ Team develops datasets for testing, training, and production purposes.
➢ Team also considers whether its existing tools will suffice for running the models or
if they need a more robust environment for executing models and workflows.
➢ Free or open-source tools – Rand PL/R, Octave, WEKA.
➢ Commercial tools – MATLAB and STASTICA.
• Communicate results: In this phase, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria
developed in Phase 1.
➢ The result is scrutinized by the entire team along with its stakeholders to draw
inferences on the key findings and summarize the entire work done.
➢ After executing the model team needs to compare outcomes of modeling to criteria
established for success and failure.
➢ Team considers how best to articulate findings and outcomes to various team
members and stakeholders, taking into account warning, assumptions.
➢ Team should identify key findings, quantify business value, and develop narrative
to summarize and convey findings to stakeholders.
• Operationalize: Here, the team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to implement the models in a
production environment.
➢ The team communicates benefits of project more broadly and sets up pilot project
to deploy work in controlled way before broadening the work to full enterprise of
users.
168
➢ This approach enables the team to learn about performance and related constraints
of the model in the production environment on small scale and make adjustments
before full deployment.
➢ Free or open-source tools – Octave, WEKA, SQL, MADlib.
While proceeding through these six phases, the various stakeholders that can be involved in the
planning, implementation, and decision-making are data analysts, business intelligence analysts,
database administrators, data engineers, executive project sponsors, project managers, and data
scientists. All these stakeholders are rigorously involved in the proper planning and completion of
the project, keeping in note the various crucial factors to be considered for the success of the
project.
Data Analytics Project Life Cyle goes through 6 phases, each phase use specific tools to
process the data.
169
2. Alpine Miner provides a graphical user interface (GUI) for creating analytic workflows,
including data manipulations and a series of analytic events such as staged data-mining
techniques (for example, first select the top 100 customers, and then run descriptive
statistics and clustering) on Postgres SQL and other Big Data sources.
3. OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool for
working with messy data.” It is a popular GUI-based tool for performing data
transformations, and it’s one of the most robust free tools currently available.
4. Like OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be used to perform
many transformations on a given dataset. In addition, data transformation outputs can be
put into Java or Python. The advantage of this feature is that a subset of the data can be
manipulated in Wrangler via its GUI, and then the same operations can be written out as
Java or Python code to be executed against the full, larger dataset offline in a local analytic
sandbox.
1. R has a complete set of modelling capabilities and provides a good environment for
building interpretive models with high-quality code. In addition, it can interface with
databases via an ODBC connection and execute statistical tests and analyses against Big
Data via an open-source connection. These two factors make R well suited to performing
statistical tests and analytics on Big Data. As of this writing, R contains nearly 5,000
packages for data analysis and graphical representation. New packages are posted
frequently, and many companies are providing value-add services for R (such as training,
instruction, and best practices), as well as packaging it in ways to make it easier to use and
more robust. This phenomenon is similar to what happened with Linux in the late 1980s
and early 1990s, when companies appeared to package and make Linux easier for
companies to consume and deploy. Use R with file extracts for offline analysis and optimal
performance and use RODBC connections for dynamic queries and faster development.
2. SQL Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.
170
3. SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple
data connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally used on file
extracts, but with SAS/ACCESS, users can connect to relational databases (such as Oracle
or Teradata) and data warehouse appliances (such as Greenplum or Aster), files, and
enterprise applications (such as SAP and Salesforce.com
1. SAS Enterprise Miner: allows users to run predictive and descriptive models based on
large volumes of data from across the enterprise. It interoperates with other large data
stores, has many partnerships, and is built for enterprise-level computing and analytics.
2. SPSS Modeler: offers methods to explore and analyze data through a GUI developed by
IBM.
3. Alpine Miner: provides a GUI front end for users to develop analytic workflows and
interact with Big Data tools and platforms on the back end.
4. MATLAB: provides a high-level language for performing a variety of data analytics,
algorithms, and data exploration.
5. STATISTICA and Mathematica are also popular and well-regarded data mining and
analytics tools.
171
4. Python is a programming language that provides toolkits for machine learning and
analysis, such as scikit-learn, NumPy, SciPy, pandas, and related data visualization using
matplotlib.
5. SQL in-database implementations, such as MADlib, provide an alternative to in-
memory desktop analytical tools. MADlib provides an open-source machine learning
library of algorithms that can be executed in-database, for for PostgreSQL or Greenplum
Tableau is a powerful data visualization tool that enables users to create interactive
and visually appealing dashboards and reports. It supports various data sources and
simplifies the process of data exploration, making it suitable for both beginners and
advanced users.
172
3.Mahout: Provides analytical tools
4.HBase: Provides real-time reads and writes
Once Hadoop processes a dataset, Mahout provides several tools that can analyse the data
in a Hadoop environment. For example, a k-means clustering analysis can be conducted
using Mahout.
Differentiating itself from Pig and Hive batch processing, HBase provides the ability to
perform real-time reads and writes of data stored in a Hadoop environment.
NoSQL
NoSQL (Not only Structured Query Language): is a term used to describe those data stores that
are applied to unstructured data. As described earlier, HBase is such a tool that is ideal for storing
key/values in column families. In general, the power of NoSQL data stores is that as the size of
the data grows, the implemented solution can scale by simply adding additional machines to the
distributed system.
1. Key/value stores: contain data (the value) that can be simply accessed by a given identifier
(the key). E.g., Redis, Voldemort.
2. Document stores are useful when the value of the key/value pair is a file and the file itself
is self-describing (for example, JSON or XML). E.g., CouchDB, MongoDB
3. Column family stores are useful for sparse datasets, records with thousands of columns
but only a few columns have entries. E.g., Cassandra, HBase
4. Graph databases are intended for use cases such as networks, where there are items
(people or web page links) and relationships between these items. E.g., FlockDB, Neo4j
173
3. Apache Spark: Spark is a fast and general-purpose distributed computing engine that also
supports big data processing. It enables data processing in real-time and batch modes and
integrates well with Hadoop and other data sources.
4. KNIME: KNIME (Konstanz Information Miner) is an open-source data analytics platform
that offers a graphical user interface for building data workflows. It supports integration
with various data sources and analysis tools.
5. SAS: SAS (Statistical Analysis System) is a software suite used for advanced analytics,
business intelligence, and data management. It has been widely used in industries like
finance, healthcare, and government.
6. IBM SPSS: SPSS is a statistical software package that allows users to analyze data using
various statistical methods. It is commonly used in social sciences, market research, and
other fields.
7. QlikView: QlikView is a data visualization and business intelligence tool that allows users
to create interactive dashboards and reports for data analysis.
2. Stream analytics tools: Systems that filter, aggregate, and analyse data that might be
stored in different platforms and formats, such as Kafka.
3. Distributed storage: Databases that can split data across multiple servers and can identify
lost or corrupt data, such as Cassandra.
4. Predictive analytics hardware and software: Systems that process large amounts of
complex data, using machine learning and algorithms to predict future outcomes, such as
fraud detection, marketing, and risk assessments.
5. Data mining tools: Programs that allow users to search within structured and unstructured
big data.
174
6. Data warehouses: Storage for large amounts of data collected from many different
sources, typically using predefined schemas.
These tools cater to different levels of data analysis complexity and user requirements. The choice
of the right data analytics tool depends on factors such as the size of the dataset, the level of
technical expertise, the specific analysis needs, and the budget available for the tool. Always
consider the features, scalability, ease of use, and integration capabilities before selecting a data
analytics tool for your needs.
Business Analytics
• Business Analytics refers to the practice of analyzing and interpreting data to gain
insights and make informed business decisions.
• Business analytics is the process of gathering data, measuring business
performance, and producing valuable conclusions that can help companies make
informed strategic decisions on the future of the business, through the use of various
statistical methods and techniques.
• Business Analytics assumes that given a sufficient set of analytics capabilities exist
within an organization, the existence of these capabilities will result in the
generation of organizational value and competitive advantage. For example,
customer intelligence is one of the factors of competitive advantage organizations
can derive from their customer relationship management system (CRM) using
business analytics.
Competitive Advantage
175
• The competitive advantage is what distinguishes a company's goods or services from
all other options available to a customer.
• It enables a company to manufacture goods or services more efficiently or at a lower
cost than its competitors.
• This may result in the company gaining a large market share, higher sales, and a
higher customer base than its competitors.
• It distinguishes the Company or company's business model from its competitors.
• It might be anything from their products to their service to their reputation to their
location.
• Positive business outcomes of having a competitive advantage include implementing
stronger business strategies, warding off competitors, and capturing a larger market
share within their consumer markets.
• Improved decision-making, Enhanced customer experience, Increased revenue and
profitability are some benefits of Competitive advantage.
176
4. For a business to gain a competitive advantage, it must articulate the benefit they
bring to their target market in ways that other businesses cannot.
177
▪ Analytics gives companies insight into their customers’ behavior and needs,
employee retention, etc.
▪ Business Analytics help businesses stay ahead of their competitors by providing
real-time data analysis.
▪ It also makes it possible for a company to understand its brand's public opinion,
follow the results of various marketing campaigns, and strategize how to create a
better marketing strategy to nurture long and fruitful relationships with its
customers.
▪ Business analytics helps organizations to know where they stand in the industry or
a particular niche and provides the company with the needed clarity to develop
effective strategies to position itself better in the future.
▪ For a company to remain competitive in the modern marketplace that requires
constant change and growth, it must stay informed on the latest industry trends and
best practices.
▪ If the management team is analytics-impaired, then that business is at risk.
Predictive business analytics is arguably the next wave for organizations to
successfully compete and it is an advantage for organizations.
SUMMARY
• Big data is anything beyond the human & technical infrastructure needed to support
storage, processing, and analysis.
• Big data refers to large amounts of data that can inform analysts of trends and
patterns.
178
• Velocity, Volume, Value, Veracity, and Variety are the characteristics of Big Data.
• Big Data analytics integrate structured and unstructured data with real time feeds
and queries, opening new paths to innovation and insight.
• Big Data Analytics has many applications such as Detecting Fraud in Banking
applications, Customer relationship management, Marketing and Advertising etc.
• Data Analytics Lifecycle defines the roadmap of how data is generated, collected,
processed, used, and analyzed to achieve business goals.
• Data Analytics Project Lifecycle consists of 6 phases.
• Discovery, Data Preparation, Model Planning, Model Building, Communicate
Results, operationalize are the phases of Data Analytics Project Lifecycle.
• Data analytics tools are software applications that collect and analyze data about a
business, to improve processes and identify hidden patterns to make data-driven
decisions.
• Several free and commercial tools are available for exploring, conditioning,
modelling, and presenting data.
• OpenRefine Data Wrangler, R and PL/R, Octave and so on are opensource free
Data Analytics tools.
• SAS, Hadoop, MapReduce, MATLAB, SQL are commercial software used for
Data Analytics.
• There are Data visualization tools available to visualize data analysis, like tableau,
R’s libraries, Python libraries, etc.
• To be competitive, businesses must find new ways to minimize expenses, better
allocate resources, and devise ways to reach out to each of their clients personally.
• Business Analytics can help companies make better use of their resources, decrease
expenses, and personalize their offerings to gain a competitive advantage.
• This will result not only from being able to predict outcomes but also to reach
higher to optimize the use of their resources, assets, and trading partners.
References
179
1. Big Data Imperatives - Enterprise Big Data Warehouse, BI Implementations and Analytics
by Soumendra Mohanty, Madhu Jagadeesh, Harsha Srivatsa
2. Big Data Analytics made easy by Y Lakshmi Prasad
3. Introduction to Big Data Analytics by EMC Education
4. Big Data Analytics by Dr. Anil Kumar K.M
5. Data Science from Scratch by Steven Cooper
6. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
7. Business Analytics Principles, Concepts, and Applications, What, Why, and How, Marc J.
Schniederjans, Dara G. Schniederjans, Christopher M. Starkey
8. Data Analytics made Accessible by Dr. Anil Maheshwari
9. Data Science and Big Data Analytics by EMC Education Services.
10. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
11. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed J. Zaki
and Wagner Meira Jr.
Review Questions
1. What is big data? Give examples of big data.
2. What are the applications of big data?
3. What is the importance of big data?
4. Which tools are used for handling big data?
5. Explain data analytics project life cycle with diagram.
6. What are the challenges of big data?
7. Explain at least two tools used phase wise, in data analytics project life cycle.
8. Explain applications of Big Data Analytics.
9. Explain advantages of big data analytics.
10. Explain 5 V’s of big data.
11. What are the types of big data? Explain with the examples.
Further Readings:
1. Madhu Jagadeesh, Soumendra Mohanty, Harsha Srivatsa, “Big Data Imperatives:
Enterprise Big Data Warehouse, BI Implementations and Analytics”, 1st Edition, Apress
180
2. Frank J. Ohlhorst, “Big Data Analytics: Turning Big Data into Big Money”, Wiley
Publishers
3. Cristian Molaro, Surekha Parekh, Terry Purcell, “DB2 11: The Database for Big Data &
Analytics”, MC Press
4. Tom White, “Hadoop –The Definitive Guide, Storage and analysis at internet scale”, SPD,
O’Reilly.
5. DT Editorial Services, “Big Data, Black Book-Covers Hadoop2, MapReduce, Hive,
YARN, Pig, R and Data Visualization” Dreamtech Press
6. Chris Eaton, Dirk Deroos et. al., “Understanding Big data”, Indian Edition, McGraw Hill.
7. Chen, H., Chiang, R.H. and Storey, V.C., 2012. Business intelligence and analytics: from
big data to big impact. MIS quarterly, pp.1165-1188.
181