0% found this document useful (0 votes)
512 views

Data Analytics

Uploaded by

Murtaza Muhammed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
512 views

Data Analytics

Uploaded by

Murtaza Muhammed
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 183

DATA ANALYTICS

(FOR PRIVATE CIRCULATION ONLY)


2023
Programme Coordinator

Dr. Chandan Ambatkar

Course Writer

Dr. Kavita Divekar

Editor

Mr. Yogesh Bhosle

Published by Symbiosis Skills and Professional University (SSPU), Pune


July, 2023

Copyright © 2023 Symbiosis Open Education Society


All rights reserved. No part of this book may be reproduced, transmitted or utilised in any form or
by any means, electronic or mechanical, including photocopying, recording, or by any
information storage or retrieval system without written permission from the publisher.

Acknowledgement
Every attempt has been made to trace the copyright holders of the materials reproduced in this
book. Should any infringement has occurred, SSPU apologises for the same and will be pleased
to make necessary corrections in future editions of this book.
Unit 1
Overview of Business Data Analytics and Decision Making

Topics:

• Introduction to Data Analytics.


• Concept of Business Data Analytics.
• Importance of Business Data Analytics
• Types of Business Data Analytics
• Scope of Business Data Analytics in Decision Making
• Database Management System & Data Warehouse
• Data Science & Data-driven decision making

Objectives:
1. To understand the fundamental concepts of data analytics, business
analytics, importance of data analytics, business analytics, its applications
in organisations
2. To understand the types of business data analytics
3. To understand the scope of business data analytics in decision making
4. To introduce the concepts, techniques, and applications of data warehousing
5. To understand the terms Data science and how it is used in decision making

Outcomes:
1. Understand the role of data analytics in business and the importance of
business data analytics, its applications in organizations.
2. Understand types of business data analytics
3. Understand how business analytics helps organisation in decision making.
4. Understand Data warehouse concepts, architecture, and models.
5. Understand the concept of Data Science and data driven decision making

1
Introduction to the Unit

This unit aims to give a complete overview of Data Analytics, Business Data Analytics.
This chapter covers the types of Business Analytics, its applications in organizations. It
also dealt with architecture and models of data warehouse. Finally, it explains the term data
science and data driven decision making.

Introduction to Data Analytics

What is Data?
• Data is nothing but raw facts and figures. Numbers, characters, symbols, special
symbols, etc. form data. Generally, any computer user enters data in the form of
numbers, text, clicks, on in any other format. Computers perform operations on data
which are stored and transmitted in the form of electrical signals and recorded on
magnetic, optical, or mechanical recording media like Hard disks, pen drives, CD
drives, etc. But just data does not help any organization. Data has to be processed and
we should be able to generate meaningful information from it.

Now a days, organizations generate huge data. That is processed and converted into
information. Even this huge amount of information is not so useful for any top-level
management to take decisions. The information should be such that without much time,
managers should be able to take decisions and for that analytics can be useful.

What is Analytics?
• The systematic computational analysis of data or statistics or information resulting
from the systematic analysis of data or statistics.
• Analytics focuses on the implications of data, the decisions and implementation or
actions that should be taken as a outcome.
• Analytics is a field of computer science that uses mathematics, statistics, and machine
learning to find significant patterns in data. Data Analytics involves going through huge
data sets to discover, interpret, and share new insights and knowledge.

2
What is Data Analytics?
• In today's digital world, data gets generated in huge amounts, e.g., sensor data, CC TV
Data, weather data, IOT generated data, etc. but if is in unstructured or semi structured
format, then it may not be of any use. To make it useful, it must be converted into an
appropriate format, and we should be able to extract the required and meaningful
information from it. This process can be called Data Analysis. The purpose of Data
Analytics is to extract useful information from data and take the decision based upon
the data analysis.
• The process of reviewing, cleansing, and manipulating data with the objective of
identifying usable information, informing conclusions, and assisting decision-making
is known as data analysis. Data analysis is important in today's business environment
since it helps businesses make more scientific decisions and run more efficiently.
• Data analytics is the analysis of data, whether huge or small, to understand it and see
how to use the knowledge hidden within it.
• Data analytics is defined as “a process of cleaning, transforming, and modeling data to
discover useful information for business decision making”.
• Data analytics converts raw data into actionable insights using statistical methods,
algorithms, Analytics tools, technologies, and processes used to find trends and solve
problems by using data.
• Data analytics can figure out and shape business processes, improve decision-making,
and nurture business growth.
• E. g. You enter a grocery store, and you find that your regular monthly purchases are
already selected and kept aside for you. If you want all of them or want to remove some
from the list, or simply add some items into it. Isn’t it a shocking surprise for you?
This is done with the help of Data analytics, used by owner of Grocery Store, he must
have had all the past details of purchases made by you, the history is analyzed using
Analytics technology and algorithms and your purchase patterns have been identified
and you get the list.
• E. g. you visit some online store and search for T-Shirts of any specific brand. You
may forget, but your search data is used to give you more and more suggestions on
browser wherever you visit any other site too. When you play any online game, there

3
you find your search products, this is also one part of data analytics. The use of search
data to explore particular interactions among Web searchers, the search engine (e. g.
google, Bing etc.,) or the content during searching is used for Search Analytics. This is
further used in Search Engine Marketing (SEM).

Check your progress:


1. Data Analysis is a process of,
A. Inspecting data
B. Data Cleaning
C. Transforming of data
D. All the mentioned above
1. Data analytics plays a vital role in Decision Making.
A. True
B. False
2. Data analysis is a process to draw insights through numerous data sets.
A. True
B. False
3. Amongst which of the following is / are the significance of data analytics,
A. To collect the data, store it and put analysis.
B. Analyse the data to get the fruitful insights and hidden information.
C. Organizations can utilize data analytics to gain control over their data.
D. All the mentioned above

Concept of Business Analytics:


• Business analytics is the analysis of an organization’s raw data and the conversion of that
analysis into information that is relevant to the organization’s vision, mission, and goals.
• The data is processed using various tools and procedures, and various patterns and
correlations are mapped out before predictions are made based on the data collected from
various sources. Organizations create plans to increase their sales and profits based on these
predicted outcomes.
In simple terms, business analytics can be explained as: -

4
• Collecting historical data and records.
• Analyzing the data collected to find out appropriate patterns and trends.
• Using these trends to design improved strategies and efficient business decisions.
• Business analytics is the application of data analytics to business.
• It is a set of disciplines and technologies to solve business problems using data analysis,
statistical models, and other quantitative methods.
• Garner says, “Leading organizations in every industry are wielding data and analytics
as competitive weapons”.
Some examples of Business Analytics:
• E. g Offering specific discounts to different classes of travelers based on the amount of
business they offer or have the prospective to offer.

Benefits of Business Data Analytics:


• Data analytics is significant since it aids in performance optimization for firms. By
finding more cost-effective methods to do business and retaining a lot of data, firms may
help cut expenses by incorporating it into their business strategy.
• An organization can also use data analytics to make better business decisions and help
analyze customer trends and satisfaction, which can give indication to new and better
products and services.
• Business analysts understand the organization’s goals and use analytics to guide data-
driven business decisions.
• The past data of the organization, current market situation, and product performance are
leveraged to predict future trends and accordingly design strategies.
• Business analytics can be used to track the progress of the organization over the years.
It can also identify the performance of a product or a strategy in the market.
• Based on these reports, it can be deduced what is working well for the organization and
what isn’t. As a result, updated decisions and methodologies can be implemented to
improve the stats.

5
• Reduce risks: One main advantage of business analytics is its ability to mitigate
risks. It helps in tracking the mistakes made by the organization in the past and
understanding the factors that led to their occurrence.
• With this knowledge, analysis is done to predict the probability of the reoccurrence
of similar risks in the upcoming future, and therefore, the corresponding measures
can be taken to prevent the same.
• Enhance customer experience: All successful businesses have figured out the secret
to success – making their customers happy! Organizations today, identify their
customer base, understand their needs and behaviours, and correspondingly cater
to them.
• This is possible because of the statistical tools and models used in business
analytics.

Some of the disadvantages of Business Analytics are:

• Lack of Commitment: The business Analytics process can be extremely costly as


well as time-consuming. Although the solutions can be easily achieved, the time
and cost factors leave people feeling disinterested and therefore less trusting.
This eventually leads to the complete failure of the business.
• Low-Quality Data: Organizations have a lot of data. But the real question is how
much of this data is correct and accessible. Having poorly constructed, heavily
complicated, or insufficient data is a huge limitation and can hinder the business
analytics processes.
• Privacy Concerns: Companies collect customer data to analyse it and make better
business decisions. But this can lead to a breach of the customer’s privacy.
There have been instances when one company shares its collected user data with
another company for mutual benefit. This data can be used against a particular user
in any way possible. Therefore, it is essential for organizations to collect only vital
information and work on maintaining the security and confidentiality of the data
collected.

6
How does Business Analytics Work?
• Business analytics is the analysis of an organization’s raw data and the conversion
of that analysis into information that is relevant to the organization’s vision and
objectives.

• The data is processed using various tools and procedures, and various patterns and
correlations are mapped out before predictions are made based on the data acquired.
Organizations create plans to boost their sales and profits based on these anticipated
outcomes.

• Business analysts focus on driving practical business changes in an organization.

Types of Business Data Analytics:


a. Descriptive analytics describes what has happened over a given period. The
interpretation of historical data to identify trends and patterns. These techniques
summarize large datasets to describe outcomes to stakeholders. This process
requires the collection of relevant data, processing of the data, data analysis and
data visualization. This process provides essential insight into past performance.
Have the number of views gone up? Are sales stronger this month than last?

Descriptive analytics explains the patterns hidden in data. These patterns could be
the number of market segments, or sales numbers based on regions, or groups of
products based on reviews, software bug patterns in a defect database, behavioral
patterns in an online gaming user database, and more. These patterns are purely
based on historical data.

b. Diagnostic analytics focuses more on why something happened. The


interpretation of historical data to determine why something has happened. These
techniques supplement more basic descriptive analytics. They take the findings
from descriptive analytics and dig deeper to find the cause.

7
By developing key performance indicators (KPIs,) Descriptive Analytics
strategies can help track successes or failures. Metrics such as return on investment
(ROI) are used in many industries. Specialized metrics are developed to track
performance in specific industries.
In Diagnostic Analytics Strategy the performance indicators are further
investigated to discover why they got better or worse. This generally occurs in three
steps: – Identify anomalies in the data. These may be unexpected changes in a
metric or a particular market. – Data that is related to these anomalies is collected.
– Statistical techniques are used to find relationships and trends that explain these
anomalies

e. g., Did the weather affect ice cream sales? Did that latest marketing campaign
impact sales?

c. Predictive analytics moves to what is likely going to happen in the near term.
The use of statistics to forecast future outcomes. These techniques use historical
data to identify trends and determine if they are likely to recur. Predictive analytical
tools provide valuable insight into what may happen in the future and their
techniques include a variety of statistical and machine learning techniques, such as:
neural networks, decision trees, and regression.

e.g., What happened to sales the last time we had a hot summer? How many weather
models predict a hot summer this year?

d. Prescriptive analytics: suggests a course of action. Prescriptive analytics helps


answer questions about what should be done. By using insights from predictive
analytics, data-driven decisions can be made. This allows businesses to make
informed decisions in the face of uncertainty. Prescriptive analytics techniques rely
on machine learning strategies that can find patterns in large datasets. By analyzing
past decisions and events, the likelihood of different outcomes can be estimated.

8
The application of testing and other techniques to determine which outcome will
yield the best result in each scenario.

Know your progress:


1. _________ techniques rely on machine learning strategies that can find patterns
in large datasets.
A. Prescriptive analytics
B. Descriptive analytics
C. Predictive analytics
D. Diagnostic analytics
2. Descriptive analytics answers the question, "What happened?"
A. True
B. False

Scope of business analytics in decision making process


• Businesses make predictions about the future through data visualization. Making
plans and decisions is aided by these insights. Business analytics spurs growth and
measures performance. Find hidden patterns, produce leads, and expand your
business appropriately.
• Business analytics is the process of analyzing data and drawing conclusions from
it in order to enhance business operations, improve decision-making, lower costs,
and boost profitability. To make useful inferences about the data, it entails
gathering, cleaning, organizing, visualizing, and interpreting the data.

• Forecasting and Prediction: Businesses may benefit from the use of business
analytics to estimate future results and make predictions based on previous data.
This might entail forecasting future sales, looking for expansion prospects, and
identifying market trends.

• Customer Analytics: Businesses may gain a better understanding of their


consumers' needs, behaviors, and preferences by researching customer data. This

9
information may be used to increase consumer engagement and loyalty, optimize
marketing initiatives, and uncover new sources of income.

• Operational Analytics: Operational data may be examined using business


analytics to identify potential for process improvement and efficiency
augmentation. This can help businesses cut costs, improve quality, and boost
productivity.

• Financial Analytics: Organizations may use business analytics to help them


analyze financial data to find possible risk and opportunity areas. This might entail
looking at financial performance indicators, making income and spending
projections, and finding areas where costs can be decreased.

• Competitive Intelligence: Organizations may benefit from using business


analytics to learn more about the tactics, advantages, and disadvantages of their
rivals. This information may be used to inform business choices and develop
cutting-edge, competitive strategies.

• At Management Levels in organization:


• At the strategic level, Executives and senior managers may use business
analytics to help them make decisions about long-term planning, setting goals
and objectives, and determining the organization's future course.
• At the tactical level, Tactic decision-making may be aided by business analytics,
which can offer insights into sales, manufacturing, and supply chain data.
• At the operational level, Front-line managers and employees can benefit from
business analytics while making operational decisions. To help with operational
decision-making, it can offer insights regarding customer interactions, inventory
levels, and product quality.

10
Data Warehouse:

Database Management System Database is a collection of related data. Database


management system is a database and database software together which is used to store
data and manipulate the data like insertion of data, deletion of data, modification of data,
retrieval of data and so on. Traditional databases are transactional, and they can follow any
one of the models like relational or object oriented or network or hierarchical, or object-
oriented relational model.
Operational Systems

Business Intelligence at Data Warehouse


The data warehouse comprises crucial measures of the business operations that are kept
along business dimensions, according to a high degree of interpretation. Units of sales by
product, day, customer group, sales district, sales area, and promotion, for instance, could
be included in a data warehouse. The product, day, client group, sales district, sales area,
and promotion are the business aspects in this instance.
The information comes from the operational systems that underpin the organization's
fundamental business operations. This data is stored in Data Warehouse. Meanwhile the
data gets stored in Data warehouse it goes through process which is called as Data
Transformation. We will see this later.

11
Data Warehousing
• A data warehouse refers to a data repository that is maintained separately from an
organization’s operational databases.
• Data warehousing provides architectures and tools for business executives to
systematically organize, understand, and use their data to make strategic decisions.

Data Warehousing
• It is designed for query and analysis rather than for transaction processing, and usually
contains historical data derived from transaction data, but can include data from other
sources.
• A data warehouse is also a collection of information as well as supporting system.
• Data warehouses have the distinguishing characteristic that they are mainly intended for
decision support applications.
• Typically, Data Warehousing architecture is based on Relational Database Management
System, i. e. RDBMS.
• Data Warehousing (DW) is a process for collecting and managing data from various
sources to provide meaningful business insights. It is typically used to connect and analyze
business data from heterogeneous sources. The data warehouse is the core of the Business
Intelligence system which is built for data analysis and reporting.
• It is a combination of technologies and components which facilitates the strategic use of
data. It is electronic storage of a large amount of information by a business which is
designed for query and analysis instead of transaction processing. It is a process of

12
transforming data into information and making it available to users in a timely manner to
make a difference.
• Data warehousing is revolutionising how people conduct business analysis and make
strategic decisions across all industries, from government agencies to manufacturing
companies, retail chains to financial institutions, and utility companies to airlines.
• Data Warehousing helps in:
o Analyzing the data to gain a better understanding of the business and to improve the
business.
o provide a comprehensive and integrated perspective of the business.
o makes historical and current information about the company conveniently accessible for
decision-making.
o allows for decision-supporting transactions without interfering with operational
systems.
o makes the information of the organisation consistent
o offers a versatile and interactive source of strategic information.

Why we need Data Warehousing?


• Data collected from various sources cannot be visualized.
• It must be integrated and then processed before visualization takes place.
• Business Intelligence is the activity which contributes to the growth of any organization. It
is the act of transforming raw/ operational data into useful information for business
analysis.

13
Definition of Data Warehouse:
Defined in many ways, but not rigorously.
• A decision support database that is maintained separately from the organization’s
operational database.
• Support information processing by providing a solid platform of consolidated, historical
data for analysis.

Data warehousing:
It is the process of constructing and using data warehouses.

There is not any universal definition of Data Warehouse. W. H. Inmon defined data warehouse as
“A data warehouse is a subject-oriented, integrated, time-variant, and nonvolatile collection of
data in support of management’s decision-making process.”
Let’s see all the terms used in this definition.
• Subject-Oriented: Data warehouses are designed to help you analyze data. For example,
to learn more about your company's sales data, you can build a data warehouse that
concentrates on sales. Using this data warehouse, you can answer questions such as "Who
was our best customer for this item last year?" or "Who is likely to be our best customer
next year?" This ability to define a data warehouse by subject matter, sales in this case,
makes the data warehouse subject oriented.
• It is organized around major subjects, such as customer, product, sales, bank-account,
purchase, etc.
• It is focused on the modeling and analysis of data for decision makers, not on daily
operations or transaction processing.
• It provides a simple and concise view around particular subject issues by excluding
data that are not useful in the decision support process.

• Integrated: Integration is closely related to subject orientation. Data warehouses must put
data from different sources into a consistent format. They must resolve such problems as
naming conflicts and inconsistencies among units of measure. When they achieve this, they
are said to be integrated.

14
• Data warehouse is constructed by integrating multiple, heterogeneous data sources.
Like relational databases, flat files, on-line transaction records, XML, excel sheets etc.
• Then data cleaning and data integration techniques are applied.
• It ensures consistency in naming conventions, encoding structures, attribute measures,
etc. among different data sources.
• E.g., Hotel price: currency, tax, breakfast covered, etc.
• When data is moved to the warehouse, it is converted.

• Nonvolatile: Nonvolatile means that, once entered the data warehouse, data should not
change. This is logical because the purpose of a data warehouse is to enable you to analyze
what has occurred. Operational update of data does not occur in the data warehouse
environment.
• In this it is a physically separate store of data transformed from the operational
environment.
• It means it does not require transaction processing, recovery, and concurrency control
mechanisms.
• It requires only two operations in data accessing: initial loading of data and access of
data.

• Time Variant: The time horizon for the data warehouse is significantly longer than that
of operational systems.
• Operational database: current value data
• Data warehouse data: provide information from a historical perspective (e.g., past 5-
10 years)
• A data warehouse's focus, on change over time, is what is meant by the term time
variant. To discover trends and identify hidden patterns and relationships in business,
analysts need large amounts of data.
• Every key structure in the data warehouse contains an element of time, explicitly or
implicitly.
• But the key to operational data may or may not contain “time element.”

15
• Data warehouses are optimized for data retrieval and not routine transaction processing.

OLTP – Online Transaction Processing - An OLTP system is a common data processing system
in today's enterprises. Classic examples of OLTP systems are order entry, retail sales, and financial
transaction systems. OLTP systems are primarily characterized through a specific data usage that
is different from data warehouse environments. It refers to a class of systems and processes
designed to handle and manage the real-time transactional activities of an organization. These
transactions typically involve interactions with a database where data is read, updated, inserted, or
deleted based on user actions or requests.

Characteristics of OLTP systems are:


Real-time Processing: OLTP systems are optimized for quick response times, allowing users to
perform transactions and receive immediate feedback.

Transactional Integrity: Maintaining data accuracy and consistency is crucial in OLTP systems,
as they often involve financial transactions, order processing, and other critical business activities.

Concurrent Users: OLTP systems need to handle a high number of concurrent users performing
various transactions simultaneously without compromising performance or data integrity.

Normalized Data Structure: The data in OLTP systems is usually organized in a normalized
structure to minimize redundancy and ensure efficient storage.

Small Transactions: Transactions in OLTP systems are typically small-scale operations that
involve updating or retrieving a limited amount of data.

High Availability: OLTP systems require high availability to ensure continuous access to data
and support uninterrupted business operations.

ACID Properties: OLTP systems adhere to the ACID (Atomicity, Consistency, Isolation,
Durability) properties to guarantee reliable transactions and data integrity.

16
Examples of OLTP transactions are:
• Recording customer orders and updating inventory levels in an e-commerce system.
• Processing bank transactions, such as deposits, withdrawals, and transfers.
• Booking reservations in a hotel or airline reservation system.
• Managing patient information and appointments in a healthcare system.
• In contrast to OLTP, there's another type of database system called "Online Analytical
Processing" (OLAP), which is designed for complex querying and reporting tasks, often
involving large volumes of data. OLAP systems are used for data analysis and business
intelligence purposes rather than real-time transaction processing.

Why there is a need for a separate Data Warehouse?


Separating a data warehouse from the operational databases (OLTP systems) offers numerous
advantages when it comes to managing and analysing data for business intelligence and reporting
purposes.

Here are some reasons why organizations often choose to keep their data warehouse separate from
their operational systems:
DBMS is tuned up for OLTP, access methods, indexing, concurrency control, recovery.
Warehouse is tuned up for OLAP, complex OLAP queries, multidimensional view, consolidation.
Performance Isolation: Operational databases are optimized for quick transactional processing,
while data warehouses are optimized for complex analytical queries. Separating the two ensures
that resource-intensive analytical queries don't impact the performance of operational transactions.

Data Transformation and Aggregation: Data warehouses often involve data transformation,
cleansing, and aggregation processes to create a unified and consistent view of the data. These
processes can be resource-intensive and can impact the performance of operational systems if
performed directly on OLTP databases.

Historical Data Storage: Data warehouses are designed to store historical data over time,
allowing for trend analysis, comparisons, and long-term insights. Operational databases may not
be optimized for retaining large volumes of historical data.

17
Data Volume and Structure: Data warehouses are designed to handle large volumes of data from
various sources. Operational databases, while handling high-frequency transactions, might not be
suitable for handling the vast amounts of data required for in-depth analysis.

Schema Design: Data warehouses often use a different schema design, such as star or snowflake
schemas, which are optimized for analytical queries. These schemas are different from the
normalized structures used in OLTP systems.

Data Integration: Data warehouses can consolidate data from multiple sources, including various
operational databases, third-party systems, and external data sources. This integration process can
be complex and is better managed in a separate environment.

Data Quality and Consistency: The data stored in operational systems might not always be clean
and ready for analysis. Data warehouses provide an opportunity to cleanse and standardize data
before it's used for reporting and analysis.

Query Performance: Data warehouses use optimized indexing, partitioning, and columnar
storage to enhance query performance for analytical tasks. This can be different from the indexing
strategies used in OLTP databases.

Business Intelligence and Reporting: Keeping a separate data warehouse allows business
analysts and reporting tools to access data without impacting operational systems. It also provides
a central location for business intelligence activities.

Scalability: Data warehouses can be scaled independently based on the analytical workload.
Separating the data warehouse from operational systems allows for tailored scaling strategies for
each environment.

Therefore, separating a data warehouse from operational systems allows organizations to create an
environment specifically optimized for complex data analysis, reporting, and business intelligence

18
activities. It ensures that analytical tasks can be performed efficiently without affecting the
performance and integrity of operational transactional systems.

Difference between OLTP and Data Warehouse (OLAP- Online Analytical Processing)

OLTP Data Warehouse (OLAP)

Users clerk, IT professional knowledge worker


Function day to day operations decision support
DB design application-oriented subject-oriented
Data current, up to date historical, summarized,
detailed, flat relational multidimensional,
isolated integrated, consolidated
Usage repetitive ad-hoc
Access read/write. lots of scans
index/hash on prim. key
unit of work short, simple transaction complex query
# records accessed tens Millions
#users thousands Hundreds
DB size 100MB-GB 100GB-TB
Metric transaction throughput query throughput, response

Characteristics of Data Warehouse


• Multidimensional conceptual view
• Unlimited dimensions and aggregation levels
• Unrestricted cross dimensional operations
• Client server architecture and multiuser support
• Accessibility
• Transparency
• Consistent reporting performance
• Flexible reporting

19
Components:

• External Source: In data source layer, external source is where data is collected from
various sources like day-to-day transactional data from operational database system, SAP
(ERP system), flat files, excel sheets, etc., irrespective of the type of data. Data can be
structured, semi structured or unstructured format.

• Staging Area: Since the data extracted from the external sources does not follow a any
specific format, it has to be validated before going into data warehouse. For this purpose,
it is recommended to use ETL tool.

E(Extracted): Data is extracted from External data source.


T(Transform): Data is transformed into the standard format.
L(Load): Data is loaded into data warehouse after transforming it into the standard
format.

20
• Data-warehouse: After cleansing of data, it is stored in the data warehouse as central
repository. It stores the metadata, and the actual data gets stored in the data
marts. Note that data warehouse stores the data in its purest form in this top-down
approach.

• Data Mart: The difference between a data Warehouse and a data mart is that data
warehouse is used across organizations, while data marts are used for individual
customized reporting. Data marts are small in size.

For example, there are multiple departments in a company e.g., the finance department,
which is very different from the marketing department. They all generate data from
different sources, where they need customized reporting. The finance department is
concerned mainly with the statistics while the marketing department is concerned with the
promotions. The marketing department doesn’t require any information on finance.

Data marts are subsets of data warehouse, they are required for customized reporting. This
subset of data is valuable to specific groups of an organization. There are two approaches
to loading it. First, load the data warehouse and then load the marts or vice versa.

• In the reporting scenario, which is the data access layer, the user accesses the data
warehouse and generates the report. These reporting tools are meant to make the front
interface extremely easy for the consumer since people at the decision-making level are
not concerned with technical information. They are primarily concerned with a neat usable
report.

• Data Mining: The practice of analyzing the big data present in data warehouse is data
mining. It is used to find the hidden patterns that are present in the database or in data
warehouse with the help of algorithm of data mining.

21
This approach is defined by Inmon as – data warehouse as a central repository for the
complete organization and data marts are created from it after the complete data warehouse
has been created.

Tools to develop Data Warehouse:


1. Amazon Redshift: Amazon Redshift is a cloud-based fully managed petabytes-scale data
warehouse By the Amazon Company.

2. Microsoft Azure: Azure is a cloud computing platform that was launched by Microsoft in
2010.

3. Google BigQuery: BigQuery is a serverless data warehouse that allows scalable analysis
over petabytes of data.
4. Snowflake: Snowflake is a cloud computing-based data warehousing built on top of the
Amazon Web Services or Microsoft Azure cloud infrastructure.

5. Micro Focus Vertica: Micro Focus Vertica: Micro Focus Vertica is developed to use in
data warehouses and other big data workloads where speed, scalability, simplicity, and
openness are crucial to the success of analytics.

6. Amazon DynamoDB: Amazon DynamoDB is a fully managed proprietary NoSQL data


warehouse service that supports key-value and document data structures and is obtainable
by Amazon.com as a part of the Amazon Web Services portfolio.

7. PostgreSQL: It is an extremely stable database management system, backed by over


twenty years of community development that has contributed to its high levels of resilience,
integrity, and correctness.

8. Amazon S3: Amazon S3 is object storage engineered to store and retrieves any quantity of
data from any place.

22
9. Teradata: Teradata is one of the admired Relational Database Management systems. It is
appropriate for building big data warehousing applications.

10. Amazon RDS: Amazon Relational Database Service is a cloud data storage service to
operate and scale a relational database within the AWS Cloud. Its cost-effective and
resizable hardware capability helps us to build an industry-standard relational database and
manages all usual database administration tasks.

11. Oracle Autonomous Warehouse: Autonomous Data Warehouse is a cloud-based data


warehousing service provided by Oracle that removes all the complexities of constructing
a data warehouse, data security, and helps in developing data-driven applications.

12. MariaDB: MariaDB Server is one of the most well-liked ASCII text files relational
databases.

Applications of Data Warehouse (DWH):

23
• Investment and Insurance sector
A data warehouse is primarily used to analyze customer and market trends and other data
patterns in the investment and insurance sector. Forex and stock markets are two major
sub-sectors where data warehouses play a crucial role because a single point difference can
lead to massive losses across the board. DWHs are usually shared in these sectors and focus
on real-time data streaming.

• Retail chains
DWHs are primarily used for distribution and marketing in the retail sector to track items,
examine pricing policies, keep track of promotional deals, and analyze customer buying
trends. Retail chains usually incorporate EDW systems for business intelligence and
forecasting needs.

• Healthcare
DWH is used to forecast outcomes, generate treatment reports, and share data with
insurance providers, research labs, and other medical units in the healthcare sector. EDWs
are the backbone of healthcare systems because the latest, up-to-date treatment information
is crucial for saving lives.

Data Science & Data-driven decision making


Data Science: Data science is the study of data to extract meaningful insights for business.
It is a multidisciplinary approach that combines principles and practices from the fields of
mathematics, statistics, artificial intelligence, and computer engineering to analyze large
amounts of data. Data science combines math and statistics, specialized programming,
advanced analytics, artificial intelligence (AI), and machine learning with specific subject
matter expertise to uncover actionable insights hidden in an organization’s data. These
insights can be used to guide decision making and strategic planning.

24
Data-Driven Decision Making: Data-driven decision-making (DDDM) is defined as
using facts, metrics, and data to guide strategic business decisions that align with your
goals, objectives, and initiatives. Organizations must improve their decision-making
processes. For that it has follow following steps:

Steps involved in data driven process.


Step 1: Strategy
Step 2: Identify key areas.
Step 3: Data targeting
Step 4: Collecting and analyzing.
Step 5: Action Items

Step 1: Strategy
Data-driven decision making starts with the all-important strategy. This helps focus your
attention by weeding out all the data that’s not helpful for your business.

First, identify your goals — what can data do for you? Perhaps you’re looking for new
leads, or you want to know which processes are working and which aren’t.

Look at your business objectives, then build a strategy around them — that way you won’t
be dazzled by all the possibilities big data has to offer.

Step 2: Identify key areas.

Data is flowing into your organization from all directions, from customer interactions to
the machines used by your workforce. It’s essential to manage the multiple sources of data
and identify which areas will bring the most benefit. Which area is key to achieving your
overarching business strategy? This could be finance or operations, for example.

Step 3: Data targeting

25
Now that you’ve identified which areas of your business will benefit the most from
analytics and what issues you want to address, it’s time to target which datasets will answer
all those burning questions.

This involves looking at the data that you already have and finding out which data sources
provide the most valuable information. This will help streamline data. Remember that
when different departments use separate systems, it can lead to inaccurate data reporting.
The best systems can analyze data across different sources.

Targeting data according to your business objectives will help keep the costs of data storage
down, not to mention ensuring that you’re gaining the most useful insights.

Keep an eye on costs, and keep the board happy, by focusing only on the data you really
need.

Step 4: Collecting and analyzing data.

Identify the key players who will be managing the data. This will usually be heads of
departments. That said, the most useful data will be collected at all levels and will come
from both external and internal sources, so you have a well-rounded view of what’s going
on across the business.

To analyze the data effectively, you may need integrated systems to connect all the
different data sources. The level of skills you need will vary according to what you need to
analyze. The more complex the query, the more specialized skills you’ll need.

On the other hand, simple analytics may require no more than a working knowledge of
Excel, for example. Some analytics platforms offer accessibility so that everyone can
access data, which can help connect the entire workforce and make for a more joined-up
organization.

26
The more accessible the data, the more potential there is for people to spot insights from
it.

Step 5: Turning insights into action.

The way you present the insights you’ve gleaned from the data will determine how much
you stand to gain from them.

There are multiple business intelligence tools that can pull together even complex sets of
data and present it in a way that makes your insights more digestible for decision makers.

Of course, it’s not about presenting pretty pictures but about visualizing the insights in a
way that’s relatable, making it easier to see what actions needs to be taken and ultimately
how this information can be used in business.

SUMMARY
• Data Analytics is a process of inspecting, cleaning, transforming, and modelling data with
the goal of discovering useful information, suggesting conclusions, and supporting
decision-making.
• Business analytics is the process of using quantitative methods to derive meaning from
data to make informed business decisions.
• Descriptive Analytics, Diagnostic Analytics, Predictive Analytics and Prescriptive
Analytics are the types of Data analytics.
• Data analytics can provide tremendous aid to decision-making, as they allow us to analyse
large amounts of structured and unstructured data to identify trends, forecast future
outcomes and make informed decisions.
• A data warehouse is a database designed to enable business intelligence activities. It exists
to help users understand and enhance their organizations’ performance.
• It helps to maintain historical records and analyze the data to understand and improve the
business.

27
• W.H. Inmon defines data warehouse as a subject-oriented, integrated, time-variant, and
nonvolatile collection of data in support of management’s decision-making process.

References
1. Data Science from Scratch by Steven Cooper
2. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
3. Data Warehousing Fundamentals by Paulraj Ponniah
4. Data Warehousing, OLAP and Data Mining by S Nagabhushana
5. Building the Data Warehouse by William H Inmon

MCQs
Q1. Data Analytics uses ___ to get insights from data.
A. Statistical figures
B. Numerical aspects
C. Statistical methods
D. None of the mentioned above
Q.2 Data Analysis is a process of ________
A. inspecting data
B. cleaning data
C. transforming data
D. All of the above
Q.3 ____________ focuses more on why something happened.
A. Prescriptive Analytics
B. Diagnostic Analytics
C. Predictive Analytics
D. Cognitive Analytics

Q4. ______ describes what has happened.


A. Descriptive Analytics

28
B. Diagnostic Analytics
C. Predictive Analytics
D. Prescriptive Analytics
Q5. ____ suggests course of action.
A. Prescriptive Analytics
B. Diagnostic Analytics
C. Predictive Analytics
D. Descriptive Analytics

Q6. Data Warehousing architecture is based on _______.


A. DBMS
B. Hierarchical Model
C. RDBMS
D. Network Model

Q.7 ____ is a subject-oriented, integrated, time-variant, and nonvolatile collection of data.


A. Data Mining
B. Data Warehousing
C. Web Mining
D. Text Mining

Q.8______means that, once entered the data warehouse, data should not change.
A. Nonvolatile
B. Time variant
C. Consistent
D. Redundant

Q.9 In _______ step Analysts find which processes are working and which aren’t.

29
A. Strategy
B. Identify Key Areas
C. Collect Data
D. Analyze data.
Q.10 A ____________ is a database designed to enable business intelligence activities
A. data mining
B. data warehouse
C. data cube
D. metadata
**********************************************************

Questions:
a. Define Data Analytics.
b. Explore some more examples from day-to-day life where data analytics can be used.
c. Explain 4 types of Data Analytics.
d. Define Data Warehousing and explain in detail with diagram.
e. Explain architecture of Data Warehouse.
f. What is OLTP and differentiate between OLTP and OLAP.
g. What is data mart, explain in brief.
h. What are the applications of data warehouse.
i. Name some tools used to build data warehouse.

Extra Reading

1. Business Analytics Principles, Concepts, and Applications, What, Why, and How, Marc J.
Schniederjans, Dara G. Schniederjans, Christopher M. Starkey
2. Data Analytics made Accessible by Dr. Anil Maheshwari
3. The Data Warehouse Toolkit by Ralph Kimball, Margy Ross

30
31
Unit II: Data Mining Concepts & Techniques I

Topics:
• Definitions,
• Data preparation,
• Data modelling,
• Visualization of Decision Tree using R
• Regression using Linear Regression

Objectives:
6. To understand the fundamental concepts of data mining
7. To understand steps of data mining like data preparation and data modelling
8. To fully understand standard data mining methods and techniques such as
Linear Regression, Decision Tree.
9. To understand the concept of visualization, visualization of Decision tree
using R

Outcomes:
6. Understand the fundamental concepts of data mining.
7. Understand steps of data mining like data preparation and data modelling
8. Comprehend standard data mining methods and techniques such as Linear
Regression, Decision Tree.
9. Understand the concept of visualization, visualization of Decision tree using
R programming language

32
Introduction to the Unit
This Unit introduces the concepts of data mining, which gives a complete description about the
principles used, architecture, applications, design and implementation, techniques such as
Decision Tree, Regression, Linear Regression of data mining. It provides both theoretical and
practical coverage of data mining topics with extensive examples. This unit also covers data
visualization using R programming language.

Introduction
a. There is a huge amount of data available in the Information Industry. This data is of no use
until it is converted into useful information. It is necessary to analyze this huge amount of
data and extract useful information from it.
b. Definition: Data Mining is defined as extracting information from huge sets of data. It uses
techniques from machine learning, statistics, neural networks, and genetic algorithms. Data
mining looks for hidden, valid, and potentially useful patterns in huge data sets. Data
Mining is all about discovering unsuspected/ previously unknown relationships amongst
the data. It is a multi-disciplinary skill that uses machine learning, statistics, Artificial
Intelligence, and database technology.
c. The insights derived via Data Mining can be used for marketing, fraud detection, and
scientific discovery, etc.
d. Data mining is an essential process in which the intelligent methods are applied to extract
data patterns.
e. It can be referred to as the procedure of mining knowledge from data.

History of Data Mining


• The beginnings of data mining may be traced to the 1960s and 1970s, when statisticians
and mathematicians began to create algorithms and strategies for extracting valuable
information from large databases or datasets.

33
• However, the development of powerful computers and the expansion of the internet in the
1990s led to the emergence of data mining as a separate subject. The first data mining tool,
the Intelligent Miner, was created by a team of IBM researchers in the early 1990s and was
used to examine large datasets from the financial sector.
• Data mining and other related topics, such as machine learning and artificial intelligence,
started to converge in the early 2000s, which sped up the development of more advanced
algorithms and methods.
• Data mining has many applications in a wide range of sectors today and is a mature
discipline that is expanding quickly. Data mining will probably continue to be a crucial
technique for comprehending and making sense of massive datasets in the future as big
data and artificial intelligence continue to advance.

The process of data mining is to extract information from a large volume of datasets. It is
done in four phases:

1. Data acquisition: It is the process of collecting, filtering, and cleaning the data before it is
added to the warehouse.
2. Data cleaning, preparation, and transformation: It is the process in which data is
cleaned, pre-processed, and transformed after adding it to the warehouse.
3. Data analysis, modelling, classification, and forecasting: In this step, data is analyzed,
with the help of various models, and classification is done.
4. Final report: In this step, the final report, insights, and analysis are finalized.

The Data Mining Process


Data mining is the practice of analysing large databases to generate new information from them.
It is also uncovering the patterns or hidden data from the given databases and analysing them. It
combines statistical and artificial intelligence to analyse these large data sets which we obtain
using various techniques and then discover useful information from these. Using data warehousing
and the growth of big data, and other such techniques of data mining helped in transforming the
raw data into useful knowledge. It involves the discussion of various data mining techniques and
architectures.

34
Data mining requires domain knowledge, technical skills, and creativity to extract essential
insights from giant data sets. The steps and techniques used in the process can vary depending on
the data and the specific problem being solved.

There are multiple steps involved in the process of data mining:

1. Data collection: The first step in data mining is collecting relevant data from various
sources. This can include data from databases, spreadsheets, websites, social media,
sensors, and other sources.
2. Data preprocessing: When the data is collected, it must be pre-processed. This involves
cleaning the data, removing duplicates, and dealing with missing data.
3. Data exploration: The next step is to identify patterns, trends, and relationships. This can
be done using various visualization techniques such as scatter plots, histograms, and box
plots.
4. Data transformation: In this step, the data is transformed into a format suitable for
analysis. This can involve normalization, discretization, or other techniques.
5. Data modelling: In this step, various algorithms and methods are used to build a model
that can be used to extract insights from the data. This can include clustering, classification,
regression, or other procedures. Data modeling in data mining involves creating a
mathematical or statistical representation of data to understand patterns, relationships, and
make predictions or classifications. Data modeling is the process of creating a visual
representation of either a whole information system or parts of it to communicate
connections between data points and structures. The objective is to explain the types of
data used and stored within the system, the relationships among these data types, the ways
the data can be grouped and categorized and its formats and attributes. Data models are
built around business needs. Rules and requirements are defined through feedback from
business stakeholders so they can be incorporated into the design of a new system or
adapted in the iteration of an existing one.

35
6. Model evaluation: Once the model is built, it must be evaluated to determine its efficiency.
This involves testing the model on a subset of the data and comparing the results to known
outcomes.
7. Deployment: Finally, the model is deployed for a real-world problem, and the insights
gained from the data mining process are used to make informed decisions.

Data Preparation
Data preparation is a crucial step in the data mining process. It involves transforming raw data
into a format that is suitable for analysis and modeling. In this phase, data is made production
ready. The data preparation process consumes about 90% of the time of the project. The data
from different sources should be selected, cleaned, transformed, formatted, anonymized, and
constructed (if required).

Data mining is highly useful in the following domains –


Numerous businesses use data mining, which has many varied uses. Here are a few of the most
widespread Applications of Data Mining

1. Customer Relationship Management: Customer relationship management (CRM) uses


data mining to examine customer data and extract patterns and trends in consumer
behaviour. This can assist companies in creating more efficient marketing and sales
strategies by helping them better understand their clients. It is also used for Customer
Profiling, Customer Retention, etc.
2. Fraud Detection: To find unexpected patterns and abnormalities in data that can point to
fraudulent behaviour, fraud detection uses data mining. This can aid companies in
identifying and stopping illicit activity like credit card fraud and identity theft.
3. Healthcare: Data mining is used in the healthcare industry to analyze patient data, spot
patterns and trends in patient behavior, and assess the effectiveness of treatments. This can
aid medical professionals in creating better treatment strategies and enhancing patient
outcomes.
4. Market Research, Market Analysis and Management: To analyze customer data and
spot patterns and trends in consumer behavior and preferences, market researchers employ

36
data mining. In order to stay competitive, this can help organizations create more successful
marketing and sales strategies.
5. Predictive analytics: Data mining is a technique used in predictive analytics to find
patterns and trends in data that may be used to forecast upcoming events or results.
Businesses may be able to forecast future trends and developments and make better
decisions as a result.
6. Financial Analysis, Finance Planning and Asset Evaluation: To analyze financial data
and identify patterns and trends in the financial markets and investment performance,
financial analysts employ data mining. Investors may be able to manage risk more skillfully
and make better investing decisions as a result.
7. Sports Analysis: To analyze player and team data and find patterns and trends in player
performance and team dynamics, sports analysis uses data mining. This can assist
managers and coaches in developing more sensible judgements and winning game plans.
8. Apart from these, data mining can also be used in the areas of production control, science
exploration, astrology, Internet Web Surf-Aid, Corporate Analysis & Risk Management,
etc.

Architecture of Data Mining


The architecture of Data Mining includes the hardware and software infrastructure needed to
support data mining, the data storage and retrieval mechanisms, and the data processing algorithms
used to analyze the data. In essence, the data mining architecture provides the foundation upon
which the data mining process can be executed.
Data mining is the process of identifying and separating new patterns from previously gathered
data. Data mining combines the fields of statistics and computer science with the goal of finding
patterns in very huge datasets and then structuring them for subsequent use.
Basic working:
1. The user submits specific data mining requests, which are subsequently forwarded to
data mining engines for pattern analysis.
2. These programmes use the existing database to try to locate the answer to the question.

37
3. The collected metadata is then given to the data mining engine for proper processing,
which occasionally collaborates with modules for pattern assessment to get the desired
outcome.
4. A appropriate interface is then used to send this result to the front end in a
comprehensible format.

Following is a detailed explanation of the components of the data mining architecture:

1. Data sources include databases, the World Wide Web (WWW), and data
warehouses. These sources may provide data in the form of spreadsheets, plain text,
or other types of media like pictures or videos. One of the largest sources of data is
the WWW.

Fig. Data Mining Architecture

38
2. Database Server: The database server houses the real, processed data. According
to the user's request, it does data retrieval tasks.
3. Data Warehouse :A data warehouse is a mainstream repository of information that
can be analyzed to create more informed decisions. Data regularly flows into a data
warehouse from transactional systems, relational databases, and other sources. Data
warehouse works on various schemas. A schema refers to a logical structure of the
database that stores the data. Data mining consists of multiple schemas:
i. Star Schema: In this schema, a multi-dimensional model is used to organize
data in the database.
ii. Snowflake Schema: It is an extension of a star schema in this
multidimensional model is divided into subdivisions.
iii. Fact Constellation Schema: This schema collects multiple fact tables that
have common dimensions.
4. Data Mining Engine: One of the key elements of the data mining architecture is
the data mining engine, which executes various data mining operations including
association, classification, characterisation, clustering, prediction, etc.
5. Modules for Pattern Evaluation: These modules are in charge of spotting
interesting patterns in data, and occasionally they also work with database servers
to fulfil user requests.
6. Graphic User Interface: Because the user cannot completely comprehend the
intricacy of the data mining process, a graphical user interface enables efficient
user-data mining system communication.

7. Knowledge Base: A crucial component of the data mining engine, Knowledge


Base is very helpful in directing the search for outcome patterns. The knowledge
base may occasionally provide input to data mining tools. The information in this
knowledge base could come from user experiences. The knowledge base's goal is
to improve the reliability and accuracy of the outcome.

39
Alternative names for data mining
Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis,
data archaeology, data dredging, information harvesting, business intelligence, etc.

Knowledge discovery (mining) in databases (KDD)


KDD is an integral part of Data Mining. KDD stands for Knowledge Discovery in Databases. It is
an approach that involves discovering valuable knowledge from large datasets. The KDD process
involves several steps designed to transform raw data into actionable knowledge. Here are the
steps involved in the KDD process:

Steps involved in Knowledge discovery (mining) in databases (KDD)


• Data Cleaning – Data cleansing is the first and most important stage in this process. It is
crucial because if utilized in mining, unclean data might confuse processes and yield false
results. This process aids in removing erroneous or lacking data from the data collection.
Some techniques can automatically clean data, however they are not reliable. The stages
that Data Cleaning takes to complete its task are as follows:
i. Filling the missing data: There are several ways to fill in the missing data, including
manually doing so, utilising the measure of central tendency, median, deleting or
ignoring the record, or filling in the value that is most likely.
ii. Remove the noisy, inconsistent, irrelevant data: Noisy data is an error caused by
randomness. Binning is a technique that may be used to reduce this noise. Also find the
outliers and resolve contradictions. It is the process of identifying and removing
duplicate or irrelevant data. It is the process of removing the deficiencies and loopholes
in the data.
• Data Integration − In this step, multiple data sources are combined. Data integration is
the process of combining several data sources, such as databases, data cubes, or files, for
analysis. This improves the mining process' accuracy and speed. Redundancies result from
differing variable name patterns used by various databases. Additional data cleaning can
eliminate these duplicates and inconsistencies without compromising the accuracy of the

40
data. Utilizing migration tools like Microsoft SQL and Oracle Data Service Integrator, data
integration is carried out.

• Data Selection − In this step, data relevant to the analysis task are retrieved from the
database. It consists of selecting the relevant data from a larger dataset based on specific
criteria, such as the relevance of the data to the problem at hand.
• Preprocessing: This step involves cleaning and transforming the selected data to ensure it
is in a suitable format for analysis. This may also include removing duplicates, dealing
with missing values, and transforming data into a standard format.
• Data Transformation − here, data is transformed or consolidated into forms appropriate
for mining by performing summary or aggregation operations. This step transforms the
pre-processed data into a form that is suitable for analysis. This may include aggregating
data, reducing data dimensionality, or creating new variables from existing ones. Also,
Removal of noise from data using methods like clustering, regression techniques, etc. this
process is called as Smoothing. In this step these strategies are used like scaling of data to
come within a smaller range aka as Normalisation and Intervals replace raw values of
numeric data which is Discretization.

• Data Mining − In this step, intelligent methods are applied to extract data patterns. In this
step, various data mining techniques are applied to the transformed data to discover
patterns and relationships. This may involve using techniques such as clustering,
classification, and regression analysis.
• Interpretation: This step consists of interpreting the results of the data mining process to
identify helpful knowledge and insights. This may include visualizing the data or creating
models to explain the patterns and relationships identified.
• Pattern Evaluation − In this step, data patterns are evaluated. The final step involves
evaluating the usefulness of the knowledge discovered in the previous steps. This may
include testing the knowledge against new data or measuring the impact of the insights on
a particular problem or decision-making process. Identifying interesting patterns that
represent the information based on some measurements is the process of pattern evaluation.
Methods for data summary and visualization help the user comprehend the data.

41
• Knowledge Presentation − In this step, knowledge is represented using Data visualization
and knowledge representation tools which represent the mined data in this step. Data is
visualized in the form of reports, tables, graph, plots etc.

Fig. KDD Process

Know your progress:


1. Which of the following processes uses intelligent methods to extract data patterns?
A. Data mining
B. Text mining
C. Warehousing
D. Data selection
2. What is the full form of KDD in the data mining process?
A. Knowledge data house
B. Knowledge data definition

42
C. Knowledge discovery data
D. Knowledge discovery database
3. What are the chief functions of the data mining process?
A. Prediction and characterization
B. Cluster analysis and evolution analysis
C. Association and correction analysis classification
D. All of the above

Examples of Data Mining


Some of the common examples of Data mining are:

1. Retail: Retailers use data mining to analyze customer transactions to discover


patterns and trends that can be used to improve customer loyalty and increase sales.
For example, a retailer might use data mining to discover that customers who buy a
specific product also tend to buy another product, allowing the retailer to make
recommendations to customers based on their buying habits.

2. Healthcare: Healthcare providers use data mining to analyze patient data to discover
patterns and relationships that can help diagnose diseases and build treatment plans.
For example, a healthcare provider might use data mining to find that patients with
specific symptoms are more likely to develop a particular disease, allowing the
provider to create a screening program to catch the disease early.

3. Finance: Financial institutions utilize data mining to analyze customer data to


discover patterns and trends that can be used to detect fraud and improve risk
management. For example, a financial institution might use data mining to learn that
customers who make large cash withdrawals are more likely to be victims of fraud,
allowing the institution to implement additional security measures to protect these
customers.

43
Advantages of Data Mining

Data mining offers several advantages to help businesses and organizations make better
decisions and gain valuable insights. Here are some of the main advantages of data mining:

• Predictive analysis: Data mining allows businesses to predict future trends and
behaviors based on historical data. This enables organizations to make better
decisions about future strategies, products, and services.
• Improved marketing: Data mining helps businesses identify customer behavior
and preference patterns. This can help organizations create targeted marketing
campaigns and personalized offers that are more likely to resonate with customers.
• Improved customer experience: Data mining can help businesses understand
customer preferences and behaviors, enabling organizations to tailor products and
services to meet their needs. This can result in higher customer satisfaction and
loyalty.
• Competitive advantage: Data mining enables businesses to gain insights into their
competitors' strategies and performance. This can help organizations identify areas
where they can earn a competitive advantage and differentiate themselves in the
marketplace.
• Increased efficiency: Data mining can help businesses streamline processes and
operations by identifying inefficiencies and bottlenecks. This can help
organizations optimize workflows and reduce costs.
• Fraud detection: Data mining can help detect fraudulent activities and patterns in
financial transactions. This can help organizations prevent financial losses and
maintain the integrity of their operations.
• By correctly identifying future trends, aids in averting potential enemies.
• Helps in the process of making crucial decisions.
• Transforms compressed data into useful information.
• Presents fresh patterns and surprising tendencies.
• Big data analysis is made easier.
• Helps businesses locate, draw, and keep consumers.

44
• Aids in strengthening the company's interaction with its clients.
• Helps businesses save costs by assisting them in optimizing their production in
accordance with the appeal of a certain product.

Disadvantages of Data Mining


Data mining has several disadvantages, which can impact its effectiveness and reliability. Here
are some of the main disadvantages of data mining:

• Cost: Data mining can be expensive, as it requires significant computing power


and resources to analyze large datasets. Small businesses or organizations with
limited budgets may need help implementing data mining. Large investment
requirements can also be viewed as a disadvantage since sometimes gathering data
requires a lot of resources, which implies a high price.
• Complexity: Data mining is a complex process that requires specialized knowledge
and expertise. Analyzing large and complex datasets can be challenging, and it may
require a team of data scientists to develop and implement effective data mining
techniques.
• Bias and inaccuracies: Data mining can uncover biased or discriminatory patterns
if the data used in the analysis is biased or incomplete. This can lead to incorrect or
unfair conclusions and decisions.
• Ethical concerns: Data mining raises ethical concerns about the use of personal
data and privacy issues. Organizations must be transparent about their data
collection and use practices and ensure that they comply with relevant laws and
regulations.
• Over-reliance on technology: Data mining can lead to an over-reliance on
technology, which may result in a lack of human judgement and intuition. Human
interpretation and analysis are essential to ensure that data mining results are
accurate and meaningful.
• Data quality: Data mining requires high-quality data to produce reliable and
valuable results. If the data is incomplete, inconsistent, or of poor quality, the data
mining results may not be accurate or valid.

45
• Extreme workloads call for high-performance teams and staff training.
• Since the data may include sensitive client information, a lack of security might
potentially greatly increase the risk to the data.
• The incorrect outcome might result from inaccurate data.
• Large databases may be quite challenging to handle.

Decision Trees: Decision Trees are the Supervised Machine learning algorithm (this topic will be
explained in detail in the next unit) that can be used for Classification (predicting categorical
values) and Regression (predicting continuous values) problems. Decision trees are special in
machine learning due to their simplicity, interpretability, and versatility. Moreover, they serve as
the foundation for more advanced techniques, such as bagging, boosting, and random forests.

A classification problem identifies the set of categories or groups to which an observation belongs.
A Decision Tree uses a tree or graph-like model of decision. Each internal node represents a "test"
on attributes, and each branch represents the outcome of the test. Each leaf node represents a class
label (decision taken after computing all features).

Fig. Decision Tree - B and C are children (leaf nodes) of A (sub root), A is Parent
46
A decision tree starts with a root node that signifies the whole population or sample, which then
separates into two or more uniform groups via a method called splitting. When sub-nodes undergo
further division, they are identified as decision nodes, while the ones that don't divide are called
terminal nodes or leaves. A segment of a complete tree is referred to as a branch.

It is showed that decision trees could be used for both regression and classification tasks.

The following decision tree example shows that whether a customer at a company is likely to buy
a computer or not. Each internal node represents a test on an attribute. Each leaf node represents a
class.

Fig. Decision Tree

Visualization of Decision Tree using R


Data visualization is the representation of data through use of common graphics, such as charts,
plots, infographics, and even animations. These visual displays of information communicate
complex data relationships and data-driven insights in a way that is easy to understand.

47
To visualize a decision tree using the R language, you can use the “rpart.plot” package, which
provides an easy way to plot decision trees created with the “rpart” package. Here's an example
of how to do it:
Install and load the required packages:

Write the following code in R IDE

install.packages("rpart.plot")
library(rpart)
library(rpart.plot)

#Create a decision tree using the rpart function:

# Example
data(iris)
# Create decision tree
tree <- rpart(Species ~ ., data = iris)

Visualize the decision tree using rpart.plot:


# Plot the decision tree
rpart.plot(tree)

This will display the decision tree in a graphical format.

48
Note that the example above uses the built-in iris dataset for demonstration purposes. Make
sure to replace Species with the target variable in your own dataset and adjust the formula
and data accordingly.

The rpart.plot function offers various customization options, such as controlling the colors,
node shapes, and labels.
Decision Trees are very useful algorithms as they are not only used to choose alternatives based
on expected values but are also used for the classification of priorities and making predictions. It
is up to us to determine the accuracy of using such models in the appropriate applications.

Advantages of Decision Trees


1. Easy to understand and interpret.

49
2. Does not require Data normalization.
3. Doesn’t facilitate the need for scaling of data.
4. The pre-processing stage requires less effort compared to other major algorithms, hence in
a way optimizes the given problem.
Disadvantages of Decision Trees
1. Requires higher time to train the model.
2. It has considerable high complexity and takes more time to process the data.
3. When the decrease in user input parameter is very small it leads to the termination of the
tree
4. Calculations can get very complex at times.

Regression
• Regression is defined as a statistical method that helps us to, understand, summarize and
analyze the relationship between two or more variables of interest. The process that is
designed to perform regression analysis helps to understand which factors are important,
which factors can be ignored, and how they are affecting each other. Francis Galton
introduced the term Regression.
• Regression refers to a type of supervised machine learning technique that is used to predict
any continuous-valued attribute. Regression helps any business organization to analyze
the target variable and predictor variable relationships. It is a most significant tool to
analyze the data that can be used for financial forecasting and time series modeling.
• For example, let's say we want to predict the price of a house based on its size, number
of bedrooms, and location. In this case, price is the dependent variable, while size,
number of bedrooms, and location are the independent variables. By analyzing the
historical data of houses with similar characteristics, we can build a regression model
that predicts the price of a new house based on its size, number of bedrooms, and
location.
• For an example. Let's say the price of a car depends on its horsepower, number of seats
and its top speed. In this example the car becomes the dependent variable whereas the
horsepower, number of seats and the top speed are all independent variables. If we have
a data record containing previous records of the price of cars with their features, we

50
can build a regression model to predict the price of a car depending on its horsepower,
number of seats and the top speed.
• There are several types of regression models, including linear regression, logistic
regression, and polynomial regression. Linear regression in data mining is the most
commonly used type, which assumes a linear relationship between the independent and
dependent variables. However, nonlinear relationships may exist between the variables
in some cases, which can be captured using nonlinear regression models.
• In regression, we generally have one dependent variable and one or more independent
variables. Here we try to “regress” the value of the dependent variable “Y” with the
help of the independent variables. That means, we are trying to understand, how the
value of ‘Y’ changes with respect to change in ‘X’.

Regression Analysis
• Regression analysis is used for prediction and forecasting. This has a significant
overlap with the field of machine learning. Regression modelling provides the
prediction mechanism by analysing the relationship between two variables. The main
use of regression analysis is to determine the strength of predictors, forecast an effect,
a trend, etc. The independent variable is used to explain the dependent variable in
Linear Regression Analysis. Regression modelling is a statistical tool for building a
mathematical equation depicting how there is a link between one response variable and
one or many explanatory variables.
• Let’s see an example of regression analysis.
• Imagine you're a sales manager attempting to forecast the sales for the upcoming
month. You are aware that the results can be influenced by dozens, or even hundreds
of variables, such as the weather, a competitor's promotion, or rumours of a new and
improved model. There may even be a notion within your company about what will
affect sales the most. "Believe me. We sell more when there is more rain. "Sales
increase six weeks after the competitor's promotion."
• It is possible to determine statistically which of those factors actually has an effect by
using regression analysis. It responds to the question: What elements are most crucial?

51
Which may we ignore? What connections do the elements have with one another? The
most crucial question is how confident we are in each of these characteristics.
• Those variables are referred to as "factors" in regression analysis. The fundamental
element you are attempting to comprehend or anticipate is your dependent variable.
Monthly sales are the dependent variable in Redman's example from above. Then there
are the independent variables, which are the elements you believe have an effect on
your dependent variable.

Basically, two variables are involved in this process:


Dependent Variable: This is the variable that we are trying to forecast or predict.
Independent Variable: It is a predictor, or explanatory variable. These are factors that
influence the analysis or target variable and provide us with information regarding the
relationship of the variables with the target variable. It is a variable which is used to predict
the other variable's value.

Regression can be broadly classified into five major types.


i. Linear Regression
ii. Logistic Regression
iii. Lasso Regression
iv. Ridge Regression
v. Polynomial Regression

Linear Regression

Linear Regression is a predictive model used for finding the linear relationship between a
dependent variable and one or more independent variables.

Here, ‘Y’ is our dependent variable, which is a continuous numerical and we are trying to
understand how ‘Y’ changes with ‘X’.

52
Examples of Independent & Dependent Variables:
• x is Rainfall and y is Crop Yield
• x is Advertising Expense and y is Sales
• x is sales of goods and y is GDP

If the relationship with the dependent variable is in the form of single variables, then it is
known as Simple Linear Regression. In regression, the equation that describes how the
response variable (y) is related to the explanatory variable (x) is known as Regression
Model.

Simple Linear Regression


X —–> Y

In simple linear regression, the data are modeled to fit a straight line. For example, a
random variable, Y (called a response variable), can be modeled as a linear function of
another random variable, X (called a predictor variable), with the equation.
Y= B0 + B1 X.
where the variance of Y is assumed to be constant.
In the context of data mining, X and Y are numeric database attributes.
B0 and B1 are called regression coefficients.

The graph presents the linear relationship between the output(Y) variable and predictor(X)
variables. The blue line is referred to as the best fit straight line. Based on the given data
points, we attempt to plot a line that fits the points the best. A regression line is also known
as the line of average relationship. It is also known as the estimating equation or prediction
equation. The slope of the regression line of Y on X is also referred to as the Regression
coefficient of Y on X.

To calculate best-fit line linear regression uses a traditional slope-intercept form which is :
Y= B0 + B1 X.

53
Linear Regression is the process of finding a line that best fits the data points available
on the plot, so that we can use it to predict output values for given inputs.
“Best fitting line”
A Line of best fit is a straight line that represents the best approximation of a scatter plot
of data points. It is used to study the nature of the relationship between those points.
The equation to find the best fitting line is:
Y` = bX + A
where, Y` denotes the predicted value
b denotes the slope of the line
X denotes the independent variable
A is the Y intercept

54
A given collection of data points might show up on a chart as a scatter plot, which may or
may not look like it is organized along any lines. One of the most significant results of
regression analysis is the identification of the line of best fit, which minimizes the deviation
of the data points from that line. Many straight lines may be drawn through the data points
in the graph.

So, how do we find a line of best fit using regression analysis?


Usually, the apparent predicted line of best fit may not be perfectly correct, meaning it will
have “prediction errors” or “residual errors”.
Prediction or Residual error is nothing but the difference between the actual value and the
predicted value for any data point. In general, when we use Y` = bX +A to predict the
actual response Y`, we make a prediction error (or residual error) of size:

E = Y – Y`
where, E denotes the prediction error or residual error.
Y` denotes the predicted value.
Y denotes the actual value

55
A line that fits the data "best" will be one for which the prediction errors (one for each data
point) are as small as possible.

The above diagram depicts a simple representation with all the above discussed values.
Regression analysis uses “least squares method” to generate the best fitting line. This
method builds the line which minimizes the squared distance of each point from the line of
best fit.
So, the Line of Best Fit is used to express a relationship in a scatter plot of different data
points. It is an output of regression analysis and can be used as a prediction tool.

Prediction Techniques
Linear Regression Techniques
• Linear regression is used to model the relationship between a dependent variable
and one or more independent variables. Linear regression can be used for both
supervised and unsupervised learning tasks.

56
• In supervised learning, linear regression is used for regression tasks where the
goal is to predict a continuous numerical value. For example, predicting the price
of a house based on its size, location, number of bedrooms, etc. In this case, linear
regression can be used to find the best-fitting line or hyperplane that predicts the
house price based on the given features.
• In unsupervised learning, linear regression is used for dimensionality reduction
tasks. For example, in principal component analysis (PCA), linear regression is
used to find the principal components that capture the most variance in the data.
Linear regression can also be used for clustering tasks, where the goal is to group
similar data points together based on their features.
• There are several techniques used to estimate the parameters of a linear regression
model, including ordinary least squares (OLS), maximum likelihood estimation
(MLE), and gradient descent. These techniques are used to find the best-fitting line
or hyperplane that minimizes the sum of squared errors between the predicted
values and the actual values of the dependent variable.

This statistical method is used across different industries such as,


Manufacturing- Evaluate the relationship of variables that determine to define a better
engine to provide better performance.
.
Finance - To analyze financial measures, regression models are frequently employed in
the financial sector. It may be used to research and make predictions about how future
events will be impacted by variables like GDP. Understand the trend in the stock prices,
forecast the prices, and evaluate risks in the insurance domain.
Marketing- Understand the effectiveness of market campaigns and forecast pricing and
sales of the product. Regression is also utilized in the marketing sector to comprehend
customer behavior and assist businesses in predicting and identifying their target
objectives.
Future Projection: We may project future trends and develop data-driven forecasts using
regression and the most recent data statistics.

57
Healthcare - The field of medical research relies heavily on regression. It is used for
research reasons to evaluate medications, forecast the future prevalence of a specific
condition, and other things.
Medicine- Forecast the different combinations of medicines to prepare generic medicines
for diseases.

Applications of Regression
Regression is a highly common approach with several uses in commerce and industry.
Predictor and responder variables are used in the regression process. The main regression
application is provided here.

• modelling of the environment


• Analyzing behavior in business and marketing
• financial forecasting or predictions
• examining the latest patterns and trends.

Importance of Linear Regression

• Linear-regression models are relatively simple and provide an easy-to-interpret


mathematical formula that can generate predictions.
• It is used everywhere from biological, behavioral, environmental, and social sciences to
business.
• Linear-regression models have become a proven way to predict the future scientifically and
reliably.
• E. g., performing an analysis of sales and purchase data can help you uncover specific
purchasing patterns on particular days or at certain times.
SUMMARY
• Data mining is an extensive field, and it uses techniques from machine learning,
statistics, neural networks, genetic algorithms and so on.
• Alternative name for Data mining is Knowledge Discovery Process

58
• Data Cleaning, Data Integration, Data Selection, Data Transformation, Data
Mining, Pattern Evaluation, and Knowledge presentation are the steps of KDD
Process.
• Data preparation means transforming raw data into a format that is suitable for
analysis and modeling.
• Data modeling is the process of creating a visual representation of either a whole
information system or parts of it to communicate connections between data points
and structures.
• Linear regression analysis is used to predict the value of a variable based on the
value of another variable.
• Y= B0 + B1 X using this formula, best fit line is drawn or plotted to predict the
dependent variable Y. where X is independent Variable, B0 and B1 are constants
• Regression analysis can be used to evaluate trends and sales estimates, analyze
pricing elasticity, access risks in insurance company, sports analysis, and so on.

References
1. Data Warehousing, OLAP and Data Mining by S Nagabhushana
2. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
3. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed J. Zaki
and Wagner Meira Jr.
4. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
5. Regression and Other Stories by Andrew Gelman, Jennifer Hill, Aki Vehtari
6. An Introduction To Regression Analysis by Anusha Illukkumbura
7. Regression Analysis by Example by Samprit Chatterjee Ali S. Hadi
8. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak

QUESTION FOR SELF ASSESSMENT

59
Q1. Data Mining is an integral part of ______
A. Data Warehousing
B. KDD
C. RDBMS
D. DBMS

Q.2 KDD describes the ____________.


A. Whole process of extraction of knowledge from data
B. Extraction of data
C. Extraction of information
D. Extraction of rules.

Q.3 ________in data mining involves creating a mathematical or statistical representation of


data
A. Data preparation
B. Data cleaning
C. Data modeling
D. Data transfer

Q4. To visualize a decision tree using the R language, you can use the “_______” package.
A. rpart
B. rpart.plot
C. rplot
D. ggplot

Q5. __________means transforming raw data into a format that is suitable for analysis and
modeling.
A. Data Cleaning
B. Data modelling
C. Data preparation
D. data visualization

60
Q.6 Which of the following are not correct about Data mining?
A. Extracting useful information from large datasets
B. The process of discovering meaningful correlations, patterns, and trends through large
amounts of data stored in repositories.
C. The practice of examining large pre-existing databases
D. Data mining has applications only in science and research.

Q.7 The primary use of data cleaning is:

A. Removing the noisy data


B. Correction of the data inconsistencies
C. Transformations for correcting the wrong data
D. All of the above

Q.8 Which of these is correct about data mining?

A. It is a procedure in which knowledge is mined from data.


B. It involves processes like Data Transformation, Data Integration, Data Cleaning.
C. It is a procedure using which one can extract information from huge sets of data.
D. All of the above

Q.9 The primary use of data cleaning is:


A. Removing the noisy data
B. Correction of the data inconsistencies
C. Transformations for correcting the wrong data
D. All of the above

Q.10 11. Out of the following, which one is the proper application of data mining?

A. Fraud Detection

61
B. Market Management and Analysis
C. Health Care and Medicine
D. All of the above

**********************************************************

Extra Reading:
1. Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman,
Jennifer Hill
2. Applied Regression Analysis (Wiley Series in Probability and Statistics) Third Edition by
Norman R. Draper, Harry Smith
3. Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression,
and Survival Analysis (Springer Series in Statistics): Frank E. Harrell
4. Regression Analysis by Example (Wiley Series in Probability and Statistics) eBook:
Samprit Chatterjee, Ali S. Hadi
5. Data Mining for the Masses by Dr. Matthew North
6. An Introduction to Statistical Learning: with Applications in R by Gareth James, Daniela
Witten, Trevor Hastie, Robert Tibshirani

Subjective Questions
1. Define data mining, explain data mining process.
2. Explain data mining architecture with suitable diagram in detail.
3. What is KDD? Explain KDD Process in detail with proper diagram.
4. Write Data mining applications.
5. What is decision tree and how decision tree can be visualized using R language?
6. What is regression? What are its types? Explain Linear regression in detail.
7. Explain how linear regression is used to predict in data analysis.

62
Unit III: Data Mining Concepts & Techniques II

• Association (Market Basket Analysis)


o Market Basket Analysis.
o Algorithm used for Market Basket Analysis
o Examples of Market Basket analysis
o Steps involved in Market Basket Analysis
• Visualization of Association using R
• Visualization of Association using Tableau
• Automatic Cluster Detection
• Machine Learning
• Machine Learning & Data Mining

Objectives:
10. To understand the fundamental concepts of Association rules in data mining
11. To understand the concept of market basket analysis
12. To enable students to understand and implement association rule algorithms
in data mining using R and Tableau
13. To introduce the concept of machine learning
14. To understand difference between machine learning and data mining

Outcomes:
10. Learn and understand techniques of preprocessing various kinds of data.
11. Understand concept of Association Rule.
12. Apply association Mining Techniques on large Data Sets using R and
Tableau
13. Understand the concept of machine learning.

63
14. Comprehend the difference between machine learning and data mining.

Introduction to the Unit


This unit covers the concepts of association rule in data mining. Association rule mining is
explained using market basket analysis as its application. Association rule mining is demonstrated
using R programming language and Tableau. Next, we will see what machine learning is and how
data mining and machine learning are two important concepts in the field of data analysis. It is
also discussed here how both play a crucial role in uncovering patterns, relationships, and insights
from large datasets, enabling informed decision-making and predictive modelling.

In the previous unit we have seen the concepts of data mining and algorithms associated with it.
We will continue with data mining and machine learning concepts with one more algorithm i. e.
association rule mining.

What is Association?
• Association rule mining is a methodology that is used to discover unknown relationships
hidden in large databases.
• Association rule learning is a rule-based machine learning method for discovering
interesting relations between variables in big data.
• An example of an unsupervised learning approach is association rule learning, which
looks for relationships between data items and maps them appropriately for increased
profitability. It looks for any interesting relationships or correlations between the
dataset's variables.
• For example, the rule {onions, potatoes} => {burger} found in the sales data of a
supermarket would indicate that if a customer buys onions and potatoes together, they
are likely to also buy hamburger meat.
• Association rule is also called as
• Affinity Analysis
• Market Basket Analysis

64
o Due to its origin from the studies of customer purchase transactions
databases
o Formulate probabilistic association rules for the same.
o “What goes with what”

• Understanding the customer purchasing behavior by using association rule mining


enables different applications. Rules help to identify new opportunities and ways for
cross-selling products to customers. If we can find item groups which are consistently
purchased together, such info could be used for
• Designing Store layouts, cross selling, promotions, catalog design, and customer
segmentation, online recommendation systems or recommender systems in online
shopping websites of e-commerce companies like Amazon, Flipkart, and Snapdeal
• It is used for personalized marketing promotions, smarter inventory management,
product placement strategies in stores, and better customer relationship management.
• An extensive toolbox is available in the R-extension package – “arules” for association
rules.
• Mining association rules often results in a very large number of found rules, leaving the
analyst with the task of going through all the rules and discovering interesting ones.
Selecting manually through large sets of rules is time consuming and strenuous.
• Association rules consist of an antecedent (premise) and a consequent (conclusion), and
they indicate that if the antecedent items are present in a transaction, the consequent item
is likely to be present as well. The IF component of an association rule is known as the
antecedent. The THEN component is known as the consequent. The antecedent and the
consequent are disjoint; they have no items in common.
• For example, the rule "If a customer buys bread, then they are also likely to buy milk" is
an association rule that could be mined from this data set.
• In this example (if) bread is an antecedent (premise), (then) milk is
consequent(conclusion)

History of Association Rule

65
Although the ideas underlying association rules have been around for a while, association rule
mining was created by computer scientists Rakesh Agrawal, Tomasz Imieliński, and Arun
Swami in the 1990s as a method for exploiting point-of-sale (POS) systems to uncover
associations between different commodities. By applying the algorithms to supermarkets, the
researchers were able to identify associations—also known as association rules—between
variously purchased items. They then used this knowledge to forecast the possibility that certain
products would be bought in combination.

Association rule mining provided businesses with a tool to comprehend the buying patterns of
consumers. Association rule mining is frequently referred to as market basket analysis because to
its retail origins.

Know your progress:


Q.1 Which of the following is not an advantage of association rules?
A. The rules are transparent and easy to understand.
B. Generates clear and simple rules.
C. Generates too many rules.
D. None of the above

Q.2 In one of the frequent item-set examples, it is observed that if tea and milk are bought
then sugar is also purchased by the customers. After generating an association rule among
the given set of items, it is inferred:
A. {Tea} is antecedent and {sugar} is consequent.
B. {Tea} is antecedent and the item set {milk, sugar} is consequent.
C. The item set {Tea, milk} is consequent and {sugar} is antecedent.
D. The item set {Tea, milk} is antecedent and {sugar} is consequent.

Q.3 In one of the frequent item-set examples, it is observed that if milk and bread are
bought, then eggs are also purchased by the customers. After generating an
association rule among the given set of items, it is inferred:
A. {Milk} is antecedent and {eggs} is consequent.

66
B. {Milk} is antecedent and the item set {bread, eggs} is consequent.
C. The item set {milk, bread} is consequent and {eggs} is antecedent.
D. The item set {milk, bread} is antecedent and {eggs} is consequent.

Q.4 Association rule mining can find application in:


A. Market basket analysis
B. Cross-marketing
C. Catalog designing
D. All of the above
Q.5 Online recommender systems is an example of:
A. Cluster Analysis
B. Affinity analysis
C. Decision analysis
D. Both a and b

Introduction to Market Basket Analysis


Market basket analysis is a data mining technique used by retailers to increase sales by
better understanding customer purchasing patterns. Market basket analysis is a technique
to analyze consumer behavior and preferences.

It involves analyzing large data sets, such as purchase history, to reveal product groupings,
as well as products that are likely to be purchased together.

Market basket analysis, also known as association analysis or affinity analysis, is a


technique used in data mining and analytics to identify relationships or associations among
items frequently purchased or used together. It is widely used in retail and e-commerce
industries to understand customer behavior and optimize product placement, promotions,
and cross-selling strategies.

67
The primary goal of market basket analysis is to discover patterns or associations between
items that co-occur in transactions or customer purchases. These associations are often
expressed in the form of rules, commonly known as "association rules."

Identifying associations between purchased items is an application of market basket


analysis. The purpose of association rules in market basket analysis is to identify frequent
item sets.

Apriori Algorithm
The best algorithm for market basket analysis is Apriori Algorithm: The Apriori
Algorithm is a popular market basket analysis algorithm that uses the Association Rule
algorithm.
The Apriori algorithm is given by R. Agrawal and R. Srikant in 1994 for finding frequent
itemsets in a dataset for boolean association rule. The name of the algorithm is Apriori
because it uses prior knowledge of frequent itemset properties. We apply an iterative
approach or level-wise search where k-frequent itemsets are used to find k+1 itemsets.
A frequent item set can be defined as
– An item set having a higher support than user specified minimum Support

Apriori algorithm is quite fast even for a large no. of unique items where each step requires
a single run through the dataset.
To improve the efficiency of level-wise generation of frequent itemsets, an important
property is used called Apriori property which helps by reducing the search space.

Confidence metrics are commonly used to measure the strength of association between
items in market basket analysis. It is an idea to generate candidate itemsets of a given size
and then scan dataset to check if their counts are really large. This process is an iterative
process.

Steps:
Here are the steps involved in implementing the Apriori algorithm in data mining -

68
1. Define minimum support threshold - This is the minimum number of times an item
set must appear in the dataset to be considered as frequent. The support threshold is
usually set by the user based on the size of the dataset and the domain knowledge.
2. Generate a list of frequent 1-item sets - Scan the entire dataset to identify the items
that meet the minimum support threshold. These item sets are known as frequent 1-
item sets.
3. Generate candidate item sets - In this step, the algorithm generates a list of candidate
item sets of length k+1 from the frequent k-item sets identified in the previous step.
4. Count the support of each candidate item set - Scan the dataset again to count the
number of times each candidate item set appears in the dataset.
5. Prune the candidate item sets - Remove the item sets that do not meet the minimum
support threshold.
6. Repeat steps 3-5 until no more frequent item sets can be generated.
7. Generate association rules - Once the frequent item sets have been identified, the
algorithm generates association rules from them. Association rules are rules of form
A -> B, where A and B are item sets. The rule indicates that if a transaction contains
A, it is also likely to contain B.
8. Evaluate the association rules - Finally, the association rules are evaluated based
on metrics such as confidence and lift.

The Apriori algorithm is also thought to be more precise than the AIS and SETM
algorithms. It aids in the discovery of common transaction item sets and the identification
of association rules between them. The objective of association rule mining is to discover
interesting relationships or associations among items. The significance of support in
association rule mining measures the importance of an itemset in a dataset.

Problem Statement: When we go grocery shopping, we often have a standard list of things
to buy. Each customer/consumer has a distinctive list, depending on one’s needs and

69
preferences. A housewife might buy healthy ingredients for a family dinner, while a
bachelor might buy beer and chips. Understanding these buying patterns can help to
increase sales in several ways. If there is a pair of items, X and Y, that are frequently bought
together:

• Both X and Y can be placed on the same shelf, so that buyers of one item would be
prompted to buy the other.
• Promotional discounts could be applied to just one out of the two items.
• Advertisements on X could be targeted at buyers who purchase Y.
• X and Y could be combined into a new product, such as having Y in flavors of X.

We may be aware that particular things are commonly purchased together, but how can we
find these connections?

Association guidelines are applicable in a variety of industries in addition to boosting sales


earnings. For example, knowing which symptoms are frequently associated might aid
patient treatment and medication selection.

Definition: Association rules analysis is a technique to uncover how items are associated
to each other. There are 3 common ways to measure association.

Measure 1: Support. This says how popular an itemset is, as measured by the proportion
of transactions in which an itemset appears. In Table 1 below, the support of {apple} is 4
out of 8, or 50%. Itemsets can also contain multiple items. For instance, the support of
{apple, beer, rice} is 2 out of 8, or 25%.

70
Fig. Transaction Examples

If you discover that sales of items beyond a certain proportion tend to have a significant
impact on your profits, you might consider using that proportion as your support threshold.
You may then identify itemsets with support values above this threshold as significant
itemsets.
Measure 2: Confidence. This says how likely item Y is purchased when item X is
purchased, expressed as {X -> Y}. This is measured by the proportion of transactions with
item X, in which item Y also appears. In Table 1, the confidence of {apple -> beer} is 3
out of 4, or 75%.

One drawback of the confidence measure is that it might misrepresent the importance of
an association. This is because it only accounts for how popular apples are, but not beers.
If beers are also very popular in general, there will be a higher chance that a transaction
containing apples will also contain beers, thus inflating the confidence measure. To account
for the base popularity of both constituent items, we use a third measure called lift.

71
Measure 3: Lift. This says how likely item Y is purchased when item X is purchased,
while controlling for how popular item Y is. In Table 1, the lift of {apple -> beer} is 1,
which implies no association between items. A lift value greater than 1 means that item Y
is likely to be bought if item X is bought, while a value less than 1 means that item Y is
unlikely to be bought if item X is bought.

We use a dataset on grocery transactions from the arules R library. It contains actual
transactions at a grocery outlet over 30 days. The network graph below shows associations
between selected items. Larger circles imply higher support, while red circles imply higher
lift:
Observe the following diagram and see Associations between selected items. Visualized
using the arulesViz R library.

Several purchase patterns can be observed. For example:


• The most popular transaction was of pip and tropical fruits.
• Another popular transaction was of onions and other vegetables.
• If someone buys meat spreads, he is likely to have bought yogurt as well.
• Relatively many people buy sausage along with sliced cheese.
• If someone buys tea, he is likely to have bought fruit as well, possibly inspiring the
production of fruit-flavored tea.

72
Advantages
1. Easy to understand algorithm.
2. Join and Prune steps are easy to implement on large itemsets in large databases.
Disadvantages
1. It requires high computation if the itemsets are very large and the minimum support
is kept very low.
2. The entire database needs to be scanned.

Know your progress

73
Q.1 In Apriori algorithm, for generating e.g., 5 item sets, we use:
A. Frequent 4 item sets
B. Frequent 6 item sets
C. Frequent 5 item sets
D. None of the above

Q.2 Support is:


A. No. of transactions with both antecedent and consequent item sets.
B. Measures the degree of support the data provides for the validity of the rule.
C. Expressed as a percentage of total records.
D. All the above

Examples of Market Basket Analysis

• An online store examining customer purchase data to see which goods are often
purchased together.
• The study may indicate that customers who buy laptops also buy mouse pads, extra
hard drives, and extended warranties.
• With this information, the online merchant might build targeted product bundles or
upsell opportunities, such as giving a package deal for a laptop, mouse pad, external
hard drive, and extended warranty.
• A healthcare organization uses market basket analysis to determine that patients
who are diagnosed with diabetes frequently also have high blood pressure and high
cholesterol.
• Depending on this information, the organizations create a care plan that addresses
all three conditions, which leads to improved patient outcomes and reduced
healthcare costs.

• A grocery store evaluating customer purchase data to discover which goods are
usually purchased together is a real-world example of market basket analysis.

74
Customers who buy bread may also buy peanut butter, jelly, and bananas. With this
knowledge, the retailer may make modifications to improve sales of these, such as
positioning them near each other on the shelf or providing discounts when
consumers purchase all four items together.

• For example, a market basket analysis conducted on a supermarket dataset may


reveal the following association rule: {Bread, Milk} => {Butter}. This rule
indicates that customers who buy bread and milk together are also likely to purchase
butter.

To perform market basket analysis, the following steps are basically followed:

1. Data Preparation: Gather transactional data, which consists of records of


individual purchases or events. Each transaction contains a list of items purchased.

2. Data Transformation: Convert the transactional data into a suitable format, such
as a binary matrix or a transaction-item matrix, where each row represents a
transaction, and each column represents an item. The matrix is populated with
values indicating the presence or absence of an item in a transaction.

3. Itemset Generation: Identify frequently occurring combinations of items, known


as itemsets. Itemsets can range from single items (itemsets of size 1) to sets of
multiple items. The support of an itemset is defined as the proportion of transactions
in which the itemset occurs.

4. Rule Generation: Use the frequent itemsets to generate association rules. These
rules are derived based on metrics such as support, confidence, and lift. Support
measures the frequency of occurrence of an itemset, confidence quantifies the
reliability of the rule, and lift indicates the strength of the association between the
antecedent and consequent.

75
5. Rule Evaluation and Selection: Evaluate the generated rules based on predefined
thresholds or criteria, such as minimum support and minimum confidence. Select
the rules that meet the criteria and are considered interesting or actionable.

6. Interpretation and Application: Interpret the discovered rules, analyze the


associations between items, and apply the insights to business decisions. This may
involve product placement optimization, targeted promotions, recommendations,
and cross-selling strategies.

Market basket analysis provides valuable insights into customer purchasing behavior,
allowing businesses to understand which items are frequently bought together. By
leveraging these associations, companies can make informed decisions to enhance
customer experiences, optimize inventory management, and increase sales.

Visualization of association using R

Therefore, in addition to our calculations of associations, we can use the package


“arulesViz” in R to visualize our results as:
• Scatterplots,
• interactive scatterplots and
• Individual rule representations.

76
Fig. Refers to the Association rule.

With the above diagram, you can see the association between the products like {brown
bread, whole milk}, {whole milk, newspapers}, etc. how these products are related to each
other. These relations further extended to other products as well.

In R, we can visualize associations obtained from market basket analysis using various
packages. We will use a popular package for association rule mining and visualization i.e.,
“arules”.

Here are the steps to visualize associations using R:

77
Install and load the arules package:
install.packages("arules")
library(arules)
Generate association rules from transaction data. Assuming we have our transaction data
in a format suitable for arules (e.g., a binary matrix or a transaction object), we can use the
apriori() function to mine association rules:

# Assuming 'transactions' is our transaction data


rules <- apriori(transactions, parameter = list(support = 0.1, confidence = 0.8))

In the above example, the apriori() function is used to generate association rules with a
minimum support of 0.1 and a minimum confidence of 0.8. Adjust these parameters
according to our data and requirements.

Visualize the associations using a scatter plot. The plot() function in arules allows us to
visualize the associations in a scatter plot:

plot(rules, method = "scatterplot")


This will generate a scatter plot of the associations, where the x-axis represents the
confidence of the rules, and the y-axis represents the support of the rules. The size and
color of the points can be used to represent other metrics such as lift or conviction.

Customize the plot. The scatter plot can be customized to display additional information.
Here are a few customization options:

# Change the color scheme


plot(rules, method = "scatterplot", control = list(col = "red"))

# Add labels to the points


plot(rules, method = "scatterplot", control = list(labels = TRUE))

78
# Highlight specific rules
plot(rules, method = "scatterplot", control = list(highlight = c("antecedent item",
"consequent item")))
By visualizing the associations, we can gain a better understanding of the relationships
between items and identify patterns in our data. This can help us make informed business
decisions and develop effective strategies based on the discovered associations.

To visualize associations using Tableau, let’s follow these steps:

1. Prepare data: data should be in proper format, suitable for association analysis,
where each row represents a transaction, and each column represents an item. There
should be a binary indicator or count of each item's presence in a transaction.

2. Import data into Tableau: Open Tableau and connect to data source by selecting
the appropriate file type or database connection.

3. Create a new worksheet: Click on the "Sheet" tab at the bottom of the Tableau
interface to create a new worksheet.

4. Drag and drop data fields: From the Dimensions or Measures pane, drag the fields
which are to be analyzed onto the appropriate shelves. Typically, we would place
the transaction ID or unique identifier in the "Rows" shelf and the item fields in the
"Columns" or "Marks" shelf.

5. Adjust the visualization type: In the "Show Me" pane, choose a suitable
visualization type for association analysis. Some common choices include scatter
plots, heat maps, or treemaps. Select the visualization type that best represents the
associations we want to explore.

79
6. Add filters and highlight specific associations: Use Tableau's filtering
capabilities to focus on specific subsets of data or adjust the criteria for association
rules. We can also highlight specific associations by applying conditional
formatting or color-coding.

7. Customize the visualization: Modify the appearance of the visualization to


enhance readability and clarity. we can adjust axis labels, colors, font sizes, and
other visual elements to create an informative and visually appealing representation
of the associations.

8. Add additional visual elements: Consider adding supplementary visual elements,


such as tooltips, legends, or annotations, to provide additional context or insights
to association analysis.

9. Save and share the visualization: Save Tableau workbook and share it with others
in organization or export it in various formats, such as PDF or image files, for
further presentation or reporting purposes.

10. Tableau offers a wide range of visualization options and customization features,
allowing us to create interactive and insightful representations of association
analysis results.

Automatic cluster detection:

Automatic cluster detection is also known as cluster analysis or clustering, is a data analysis
technique that aims to identify natural groupings or clusters within a dataset without prior
knowledge of the cluster assignments. It is commonly used in various fields, including
machine learning, data mining, pattern recognition, and exploratory data analysis.

The goal of automatic cluster detection is to partition the data into clusters, where objects
within the same cluster are more like each other compared to objects in other clusters.

80
Cluster analysis can help uncover hidden patterns, relationships, and structures in data,
providing valuable insights and facilitating decision-making processes.

There are several algorithms and methods available for automatic cluster detection, and the
choice depends on the characteristics of the data and the specific requirements of the
analysis. Here are a few commonly used techniques:
• K-means Clustering
• Hierarchical Clustering
• Gaussian Mixture Models (GMM)
• Self-Organizing Maps (SOM)
• Mean Shift

When applying automatic cluster detection techniques, it is important to consider the


choice of distance or similarity measures, as well as the selection of appropriate parameters
such as the number of clusters (K). It is also crucial to evaluate and interpret the results of
clustering to assess the quality of the clusters and their meaningfulness in the given context.

Machine Learning & Data Mining


Introduction
▪ Definition of Machine Learning: Arthur Samuel described it as: “the field of study that
gives computers the ability to learn without being explicitly programmed.” This is an
older, informal definition.
▪ Tom Mitchell gives a more modern definition: “A computer program is said to learn from
experience E with respect to some class of tasks T and performance measure P, if its
performance at tasks in T, as measured by P, improves with experience E.”
▪ An algorithm can be thought of as a set of rules/instructions that a computer programmer
specifies, which a computer can process. Simply put, machine learning algorithms learn
by experience, like how humans do. For example, after having seen multiple examples of
an object, a compute-employing machine learning algorithm can become able to recognize
that object in new, previously unseen scenarios.

81
▪ Machine learning behaves similarly to the growth of a child. As a child grows, her
experience E in performing task T increases, which results in higher performance measure
(P).
▪ Machine learning is a branch of artificial intelligence (AI) and computer science which
focuses on the use of data and algorithms to imitate the way that humans learn, gradually
improving its accuracy.
▪ Machine learning (ML) is a type of artificial intelligence (AI) that allows software
applications to become more accurate at predicting outcomes without being explicitly
programmed to do so.
▪ Machine learning is a field of study and application in artificial intelligence (AI) that
focuses on the development of algorithms and models that enable computers to learn and
make predictions or decisions without being explicitly programmed. It involves using
statistical techniques to enable computers to learn from data, identify patterns, and make
informed decisions or predictions.
▪ The main goal of machine learning is to develop algorithms and models that can learn
from data and improve their performance on specific tasks without being explicitly
programmed.
▪ Machine learning has a wide range of applications across various domains, including
natural language processing, computer vision, recommendation systems, fraud detection,
healthcare, finance, and many others. Its capabilities have significantly advanced fields
such as autonomous driving, speech recognition, image classification, and language
translation.
▪ It's important to note that machine learning is not a magical solution to all problems. It
requires careful data preprocessing, feature engineering, model selection, and evaluation
to build accurate and reliable models. Additionally, ethical considerations and responsible
use of machine learning systems are crucial to mitigate potential biases and ensure fairness
and accountability in decision-making processes.
▪ Companies use machine learning for purposes like self-driving cars, credit card fraud
detection, online customer service, e-mail spam interception, business intelligence (e.g.,
managing transactions, gathering sales results, business initiative selection), and
personalized marketing.

82
▪ Companies that rely on machine learning include heavy hitters such as Yelp, Twitter,
Facebook, Pinterest, Salesforce, and the famous, popular, and ultimate search engine:
Google.

▪ In traditional programming, a programmer writes explicit instructions for a computer to


follow. However, in machine learning, instead of explicitly programming rules, the
computer learns from examples and experiences. The learning process involves training a
model on a dataset, which consists of input data and corresponding output labels or targets.
The model then learns patterns and relationships in the data and uses this knowledge to
make predictions or decisions when presented with new, unseen data. Machine learning
algorithms use historical data as input to predict new output values. Machine learning is
the autonomous acquisition of knowledge using computer programs.
Major Machine Learning Algorithms:
The following are some major machine learning algorithms being used today.

1. Regression (Prediction)
We use regression algorithms for predicting continuous values.

Regression algorithms:
• Linear Regression
• Polynomial Regression
• Exponential Regression
• Logistic Regression
• Logarithmic Regression
2. Classification
We use classification algorithms for predicting a set of items’ classes or categories.

Classification algorithms:

• K-Nearest Neighbors
• Decision Trees

83
• Random Forest
• Support Vector Machine
• Naive Bayes

3. Clustering
We use clustering algorithms for summarization or to structure data.

Clustering algorithms:
• K-means
• DBSCAN
• Mean Shift
• Hierarchical

4. Association
We use association algorithms for associating co-occurring items or events.

Association algorithms:
• Apriori
5. Anomaly Detection
We use anomaly detection for discovering abnormal activities and unusual cases like
fraud detection.

6. Sequence Pattern Mining


We use sequential pattern mining for predicting the next data events between data
examples in a sequence.

7. Dimensionality Reduction
We use dimensionality reduction to reduce the size of data to extract only useful features
from a dataset.

8. Recommendation Systems

84
We use recommenders algorithms to build recommendation engines.

Examples:

• Netflix recommendation system.


• A book recommendation system.
• A product recommendation system on Amazon.
There are different types of machine learning algorithms, including:
▪ Supervised learning: Supervised learning is defined as learning from labeled
examples. In supervised learning, the training dataset is labeled, meaning each
example has input data along with corresponding output labels. The training set is
used to teach the machine learning model by providing labeled examples. The
model learns patterns and relationships in the data to make accurate predictions on
new, and unseen data. The model learns to map inputs to outputs by generalizing
from the labeled examples. Common algorithms include decision trees, random
forests, support vector machines (SVM), and neural networks. Decision tree
classification is an example of supervised learning. In supervised learning, we are
given a data set and are already aware of what the ideal form of our output should
be, working on the assumption that there is a connection between the input and the
output. Given labeled reviews, the model can learn from them and predict new
examples.
▪ The primary goal of supervised learning is to learn from labelled examples, where
the input data is paired with corresponding output labels. The model is trained to
make predictions or decisions on unseen data based on the patterns learned from
the labelled examples.
▪ Classification is a supervised learning task where the goal is to predict the class or
category of a given input, while regression is a supervised learning task where the
goal is to predict a continuous value. K-means and Hierarchical Clustering are
examples of unsupervised learning algorithms, not supervised learning.
▪ "Regression" and "classification" problems are two categories of problems of
supervised learning. In a regression problem, we are attempting to map input

85
variables to a continuous function in order to predict outcomes within a continuous
output. Instead, in a classification problem, we are attempting to forecast outcomes
in a discrete output. We are attempting to map input variables into discrete
categories, to put it another way.

Let’s see an example:

We have been given data about the size of houses on the real estate market, we try
to predict their price. Price as a function of size is a continuous output, so this is a
regression problem.
We could turn this example into a classification problem by instead making our
output about whether the house “sells for more or less than the asking price.”

Here we are classifying the houses based on price into two discrete categories.
Second example:
(a) Regression — Given a picture of a person, we have to predict their age on the
basis of the given picture
(b) Classification — Given a patient with a tumor, we have to predict whether the
tumor is malignant or benign.

▪ Unsupervised learning: Unsupervised learning involves training a model on an


unlabeled dataset. The model learns patterns and structures in the data without
explicit output labels. Clustering and dimensionality reduction techniques, such as
k-means clustering and principal component analysis (PCA), are examples of
unsupervised learning algorithms. Unsupervised learning enables us to address
problems without having an idea of the final product. Even if we may not be aware
of the exact impact of the variables, we may nevertheless infer structure from the
data.
▪ By grouping the data depending on how the variables in the data relate to one
another, we may deduce this structure. Anomaly detection involves identifying

86
abnormal or unusual patterns in the data. Unsupervised learning does not require
labelled training data or perform classification tasks.
▪ In Unsupervised learning, there is no prior information about the data. And since
the data is not labelled, the machine should learn to categorize the data on the
similarity and patterns it finds in the data. There is no feedback based on the
outcomes of the predictions when learning is done unsupervised. This helps us to
obtain interesting relations between the features present in the data.
▪ Unsupervised learning is used for tasks like customer segmentation, where patterns
or clusters are identified in the data, and anomaly detection, where abnormal data
points are identified.
▪ Unsupervised learning is a type of machine learning where the algorithm learns
from unlabeled data without any specific target variable or labels. The goal is to
discover patterns, structures, or relationships within the data.
▪ Clustering is a technique in unsupervised learning that groups similar data points
together, while dimensionality reduction aims to reduce the number of features
while retaining important information.

Let’s see an example of unsupervised machine learning:


▪ there would be a case where a supermarket wants to increase its revenue. It decides
to implement a machine learning algorithm on its sold products’ data. It was
observed that the customers who bought cereals more often tend to buy milk or
those who buy eggs tend to buy bacon. Thus, redesigning the store and placing
related products side by side can help them understand consumer mindset and
increase revenue.

▪ Reinforcement learning: Reinforcement learning involves training an agent to


interact with an environment and learn optimal actions through a trial-and-error
process. The agent receives rewards or penalties based on its actions, and its goal
is to maximize the cumulative reward. This type of learning is commonly used in
robotics, game playing, and autonomous systems.

87
▪ Reinforcement learning is an area of machine learning inspired by behaviorist
psychology, concerned with how software agents ought to take actions in an
environment so as to maximize some notion of cumulative reward.

▪ The reinforcement learning algorithm (called the agent) continuously learns from
the environment in an iterative fashion. The agent learns from its environment’s
experiences until it has explored the whole spectrum of conceivable states.
▪ Reinforcement Learning is a discipline of Artificial Intelligence that is a form of
Machine Learning. It enables machines and software agents to automatically select
the best behavior in a given situation in order to improve their efficiency. For the
agent to learn its behavior, it needs only simple reward feedback, which is known
as the reinforcement signal.

Example 1:
▪ Consider teaching a dog a new trick. You cannot tell it what to do, but you can
reward/punish it if it does the right/wrong thing. It has to figure out what it did that
made it get the reward/punishment. We can use a similar method to train computers
to do many tasks, such as playing backgammon or chess, scheduling jobs, and
controlling robot limbs.

Example 2:
▪ Teaching a game bot to perform better and better at a game by learning and adapting
to the new situation of the game.

Advantages and Disadvantages of Machine Learning:


Advantages:
• Trends and patterns are easily discernible.
• No human intervention
• Continuous improvement with low cost
• Handling multidimensional data

88
• Handling a variety of data
• Wide applications
Disadvantages:
• Requires a large amount of data.
• Utilization of time and resources
• Result interpretation
• Susceptibility to errors in critical domains

Know your progress:

Q.1 What is the main goal of machine learning?


A. To write algorithms that can perform specific tasks.
B. To develop intelligent machines that can think and reason.
C. To enable computers to learn from data and improve performance over time.
D. To create systems that can mimic human behavior.
Q.2 What is the purpose of the training set in machine learning?
A. To test the performance of the model on new data
B. To fine-tune the hyperparameters of the model
C. To evaluate the performance of the model
D. To teach the model to make predictions based on labeled examples.
Q.3 You are given reviews of a few movies marked as positive, negative, or neutral.
Classifying reviews of a new movie is an example of
A. Reinforcement learning
B. Semi-Supervised learning
C. Unsupervised learning
D. Supervised learning

Machine Learning and Data Mining

▪ Machine learning and data mining are closely related fields that overlap in several
areas. Both fields deal with extracting knowledge and insights from data, but they

89
have different approaches and objectives. Here's an overview of machine learning
and data mining:

Data Mining:
▪ Data mining, on the other hand, is a broader field that encompasses various techniques
for discovering patterns, relationships, and insights from large datasets. It involves the
extraction of useful and previously unknown information from data, typically with the
aim of solving specific business or research problems.
▪ Data mining techniques often involve the application of statistical and machine learning
algorithms to large datasets. The focus is on finding patterns, trends, anomalies, or
meaningful associations in the data that can be used for decision-making or gaining
valuable insights. Data mining can involve exploratory data analysis, data
preprocessing, feature selection, and applying algorithms such as clustering, association
rules, classification, regression, and more.
▪ Data mining is designed to extract the rules from large quantities of data, while machine
learning teaches a computer how to learn and comprehend the given parameters.
▪ Data mining is simply a method of researching to determine a particular outcome based
on the total of the gathered data. On the other side of the coin, we have machine learning,
which trains a system to perform complex tasks and uses harvested data and experience
to become smarter.
▪ Data Mining is a process of separating the data to identify a particular pattern, trends,
and helpful information to make a fruitful decision from a large collection of data.
▪ Data mining is designed to extract the rules from large quantities of data, while machine
learning teaches a computer how to learn and comprehend the given parameters.
▪ Overall, data mining is a broader field that includes various techniques for extracting
knowledge from data, while machine learning is a subset of data mining that focuses on
the development of algorithms and models for learning from data and making
predictions or decisions. Machine learning techniques are often employed within the
data mining process to automate the discovery of patterns and relationships in large
datasets.

90
Relationship between Machine Learning and Data Mining:
▪ Machine learning techniques are often utilized within the broader process of data mining.
Data mining can involve using machine learning algorithms as one of the tools to extract
knowledge from data. Machine learning algorithms, such as decision trees, support vector
machines, neural networks, and ensemble methods, are commonly applied in data mining
tasks to discover patterns or make predictions. Data Preprocessing is common to both
machine learning and data mining.
▪ Machine Learning: Machine learning often involves an iterative process of training models,
evaluating their performance, and fine-tuning them based on feedback. It focuses on
building models that generalize well to unseen data.
▪ Data Mining: Data mining also follows an iterative process, involving data preprocessing,
pattern discovery, evaluation, and interpretation. It aims to extract meaningful insights
from data and communicate them effectively.
▪ In practice, machine learning techniques are often utilized within data mining workflows
to build predictive models or automate certain aspects of the data mining process. Data
mining can help identify relevant features or variables for machine learning models, and
machine learning can enhance the accuracy and predictive power of data mining
algorithms.
▪ From all the above points we can conclude that machine learning and data mining are
complementary fields that share common goals of extracting knowledge from data but
employ different techniques and approaches to achieve those goals. They contribute to each
other's progress and are essential components of the broader field of data science.

SUMMARY
• Market basket analysis is a data mining technique used by retailers to increase sales by
better understanding customer purchasing patterns.
• Identifying items that buyers desire to buy is the major goal of market basket analysis.
It may help sales and marketing teams develop more effective product placement,
pricing, cross-sell, and up-sell tactics.

91
• The Apriori Algorithm is a popular market basket analysis algorithm that uses the
Association Rule algorithm.
• Association is intended to identify strong rules discovered in databases using some
measures of interest.
• Association rules are visualized using two different types of vertices to represent
the set of items and the set of rules R, respectively. The edges indicate the
relationship in rules.
• An R-extension package arules is used to find the rules among data and arulesViz
package from R is used to visualize association rules.
• Automatic cluster detection provides a valuable tool for exploratory data analysis,
pattern recognition, and data-driven decision-making, allowing for the
identification of inherent structures and groupings within datasets without prior
knowledge or assumptions.
• By leveraging Tableau's capabilities, we can visually explore and communicate the
associations in data effectively.
• Data Mining is a process of separating the data to identify a particular pattern,
trends, and helpful information to make a fruitful decision from a large collection
of data.
• Machine learning is the autonomous acquisition of knowledge using computer
programs.
• Data mining is designed to extract the rules from large quantities of data, while
machine learning teaches a computer how to learn and comprehend the given
parameters.
• Supervised learning, Unsupervised learning, Reinforcement learning are some of
the machine learning algorithms.
• There is challenge commonly associated with data mining and machine learning is
Lack of skilled professionals.

References
1. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak

92
2. Data Warehousing, OLAP and Data Mining by S Nagabhushana
3. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
4. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed J. Zaki
and Wagner Meira Jr.
5. arulesViz: Interactive Visualization of Association Rules with R by Michael Hahsler
6. Visualizing Association Rules: Introduction to the R-extension Package arulesViz by
Michael Hahsler and Sudheer Chelluboina
7. A Course in Machine Learning by Hal Daumé III
8. Introduction to Machine Learning Alex Smola and S.V.N. Vishwanathan
9. Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz
and Shai Ben-David

Multiple Choice Questions


Q1. What is the primary goal of supervised learning?
A. To cluster similar data points together
B. To uncover hidden patterns in the data
C. To learn from labeled examples and make predictions or decisions on unseen data.
D. To learn from rewards and punishments to maximize performance.
Q2. Which of the following task(s) can be solved using unsupervised learning?
A. Image classification
B. Customer segmentation
C. Anomaly detection
D. Spam email detection

Q3. Which of the following is a classification task?


A. Predict the number of copies of a book that will be sold this month.
B. Predict the price of a house based on floor area, number of rooms, etc.
C. Predict the age of a person.
D. Predict whether there will be abnormally heavy rainfall next year.

Q4. Which of the following statement(s) is/are true about unsupervised learning in machine

93
learning?
A. Unsupervised learning algorithms require labelled training data.
B. Unsupervised learning algorithms discover patterns and structures in unlabeled data.
C. Clustering and dimensionality reduction are examples of unsupervised learning
techniques.
D. Unsupervised learning is used for classification tasks.
E. Anomaly detection is a common application of unsupervised learning.
Q5. Which of the following statement(s) is/are true about supervised learning in machine
learning?
A. Supervised learning requires labelled training data.
B. The goal of supervised learning is to discover hidden patterns in unlabelled data.
C. Classification and regression are examples of supervised learning tasks.
D. Supervised learning algorithms can make predictions on new, unseen data.
E. K-means and Hierarchical Clustering are supervised learning algorithms.

Q6. Which of the following is NOT a type of machine learning algorithm?


A. Supervised learning
B. Unsupervised learning
C. Reinforcement learning
D. Data preprocessing

Review Questions:
1. What is association rule and explain it in details.
2. What is Market Basket Analysis and write all the steps of Market Basket Analysis.
3. What is Machine Learning? Explain its advantages and disadvantages.
4. Explain all the types of machine learning.
5. Differentiate between data mining and machine learning.

94
Further Readings:

1. Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and
Presenting Data by EMC Education Services (2015)
2. Data Mining for Business Intelligence: Concepts, Techniques, and Applications in
Microsoft Office Excel with XLMiner by Shmueli, G., Patel, N. R., & Bruce, P. C.
(2010)

95
Unit IV: Business Forecasting

Topics:
• Business Forecasting
• Qualitative Techniques
• Quantitative Techniques
• Time Series Forecasting

Objectives:
15. To understand the concept of business forecasting
16. To understand the concept of qualitative and quantitative forecasting
techniques
17. To understand the applications of forecasting techniques
18. To introduce the concept of time forecasting
19. To understand use of forecasting techniques

Outcomes:
15. Understand the concept of business forecasting.
16. Understand the concept of qualitative and quantitative forecasting
techniques.
17. Understand the applications of forecasting techniques.
18. Introduce the concept of time forecasting.
19. Understand use of forecasting techniques.

96
Introduction to the Unit
This unit introduces the concept of business forecasting. The terms quantitative and qualitative
data. The quantitative and qualitative forecasting techniques are explained with applications and
their uses. At the end time series forecasting techniques is briefed. This unit defines quantitative
and qualitative predictions, explores the value of quantitative and qualitative forecasts, and offers
some examples of quantitative and qualitative forecasting techniques.

Introduction
Forecasting

Forecasting involves making predictions about the future. It is a technique that uses historical data
as inputs to make informed estimates that are predictive in determining the direction of future
trends. Businesses utilize forecasting to determine how to allocate their budgets or plan for
anticipated expenses for an upcoming period. This is typically based on the projected demand for
the goods and services offered.
e. g., In finance, forecasting is used by companies to estimate earnings or other data for subsequent
periods. Traders and analysts use forecasts in valuation models, to time trades, and to identify
trends. Forecasts are often predicated on historical data. Because the future is uncertain, forecasts
must often be revised, and actual results can vary greatly.

What is business Forecasting?


Business forecasting involves making informed guesses about certain business metrics, regardless
of whether they reflect the specifics of a business, such as sales growth, or predictions for the
economy. Business forecasting tries to make predictions about the future state of certain business
metrics such as gross domestic product (GDP) growth in the next quarter. Business forecasting
relies on both quantitative and qualitative techniques to improve accuracy. Managers use
forecasting for internal purposes to make capital allocation decisions and determine whether to

97
make acquisitions, expand, or divest. They also make forward-looking projections for public
dissemination such as earnings guidance.
Key factors of forecasting:

• Financial and operational decisions are made based on economic conditions and how the
future looks, albeit uncertain.
• Forecasting is valuable to businesses so that they can make informed business decisions.
• Financial forecasts are fundamentally informed guesses, and there are risks involved in
relying on past data and methods that cannot include certain variables.
• Forecasting approaches include qualitative models and quantitative models.

Understanding Business Forecasting


Companies use forecasting to facilitate them to develop business strategies. Historical data is
collected and analysed so that patterns can be found. Today, big data and artificial intelligence has
transformed business forecasting methods. There are several different methods by which a
business forecast is made. All the methods fall into one of two predominant approaches: qualitative
and quantitative.

While there might be large variations on a practical level when it comes to business forecasting,
on a conceptual level, most forecasts follow the same process:

1. Selecting the critical problem or data point: This can be something like "will people buy
a high-end coffee maker?" or "what will our sales be in March next year?"
2. Identifying theoretical variables and an ideal data set: This is where the forecaster
identifies the relevant variables that need to be considered and decides how to collect the
data.
3. Assumption time: To cut down the time and data needed to make a forecast, the forecaster
makes some explicit assumptions to simplify the process.
4. Selecting a model: The forecaster picks the model that fits the dataset, selected variables,
and assumptions.
5. Analysis: Using the model, the data is analyzed, and a forecast is made from the analysis.

98
6. Verification: The forecast is compared to what happens to identify problems, tweak some
variables, or, in the rare case of an accurate forecast, pat themselves on the back.
7. Interpretation and decision making: Once the analysis has been verified, it must be
condensed into an appropriate format to easily convey the results to stakeholders or
decision-makers. Data visualization and presentation skills are helpful here.

Quantitative data
• Quantitative data is number-based, countable, or measurable. Quantitative data tells us
how many, how much, or how often in calculations.
• e. g. no. of employees in organization, no. of defects in product assembly, no. of
specific brand’s products purchased.
• Quantitative analysis involves looking at the hard data, the actual numbers.

Qualitative data
• Qualitative data is interpretation-based, descriptive, and relating to language. Qualitative
data can help us to understand why, how, or what happened behind certain behaviors.
• e. g. customer remarks, product reviews, etc. Qualitative analysis is less tangible.
• Qualitative data concerns subjective characteristics and opinions – things that cannot be
expressed as a number.

Qualitative and quantitative forecasting techniques are two distinct approaches used to
make predictions or estimates about future events or trends. Let's study each of these
techniques:

Qualitative Forecasting:
• Qualitative forecasting techniques rely on subjective judgments, expert opinions, and
qualitative data to make predictions. These techniques are typically used when historical
data is limited, unreliable, or unavailable, and when the focus is on understanding and
incorporating qualitative factors that may influence the future. The Qualitative forecasting
method is primarily based on fresh data like surveys and interviews, industry benchmarks,

99
and competitive analysis. This technique is useful for newly launched products, or verticals
wherein historical data doesn’t exist yet.

• Qualitative forecasting can help a company make predictions about their financial standing
based on opinions in the company. If you work as a manager or other high-level employee,
you can use forecasts to assess and edit your company's budget. Knowing how to use
qualitative forecasting can benefit your company by allowing input from external and
internal sources.

• Qualitative forecasting is a method of making predictions about a company's finances that


uses judgment from experts. Expert employees perform qualitative forecasting by
identifying and analyzing the relationship between existing knowledge of past operations
and potential future operations. This allows the experts to make estimates about how a
company might perform in the future based on the opinions they offer and the information
they collect from other sources, like staff polls or market research.

• Qualitative models have typically been successful with short-term predictions, where the
scope of the forecast was limited. Qualitative forecasts can be thought of as expert-driven,
in that they depend on market mavens or the market as a whole to weigh in with an
informed consensus.

• Qualitative models can be useful in predicting the short-term success of companies,


products, and services, but they have limitations due to their reliance on opinion over
measurable data.

• Qualitative forecasting models are useful in developing forecasts with a limited scope.
These models are highly reliant on expert opinions and are most beneficial in the short
term. Examples of qualitative forecasting models include interviews, on-site visits, market
research, polls, and surveys that may apply the Delphi method (which relies on aggregated
expert opinions).

100
• Gathering data for qualitative analysis can sometimes be difficult or time-consuming. The
CEOs of large companies are often too busy to take a phone call from a retail investor or
show them around a facility. However, we can still sift through news reports and the text
included in companies’ filings to get a sense of managers’ records, strategies, and
philosophies.

Importance of qualitative forecasting:


1. Qualitative forecasting is important for helping executives make decisions for a
company. Performing qualitative forecasting can inform decisions like how much
inventory to keep, whether a company should hire new staff members and how they
can adjust their sales operations.
2. Qualitative forecasting is also crucial for developing projects like marketing
campaigns, as it can provide information about a company's service that can
highlight which elements of the business to feature in advertisements.

3. Some benefits of qualitative forecasting include the flexibility to use sources other
than numerical data, the ability to predict future trends and phenomena in business
and the use of information from experts within a company's industry.

Qualitative Forecasting Techniques:

Some commonly used qualitative forecasting techniques include:

▪ Expert Opinion: Gathering insights and predictions from subject matter experts or
individuals with domain knowledge and expertise.
▪ Executive opinions: Upper management uses intuition to make decisions.
▪ Internal polling or Panel Consensus: Customer-facing employees share insights
about customers.
▪ Panel approach: this can be a panel of experts or employees from across a business
such as sales and marketing executives who get together and act like a focus group,

101
reviewing data and making recommendations. Although the outcome is likely to be
more balanced than one person’s opinion, even experts can get it wrong.
▪ Delphi Method: A structured approach that involves obtaining anonymous input
from a panel of experts through a series of questionnaires, followed by iterative
feedback and consensus building. Experts share their projections in a panel
discussion. The Delphi method is commonly used for technological forecasting.
This method is commonly used to forecast trends based on the information given
by a panel of experts. This series of steps is based on the Delphi method, which is
about the Oracle of Delphi. It assumes that a group's answers are more useful and
unbiased than answers provided by one individual. The total number of rounds
involved may differ depending on the goal of the company or group's researchers.
▪ These experts answer a series of questions in continuous rounds that ultimately lead
to the "correct answer" a company is looking for. The quality of information
improves with each round as the experts revise their previous assumptions
following additional insight from other members of the panel. The method ends
upon completion of a predetermined metric.
▪ Market Research: Conducting surveys, focus groups, or interviews to gather
qualitative information from customers, stakeholders, or target audience.
Customers report their preferences and answer questions.
▪ Historical Analysis: This kind of forecasting is used to forecast sales on the
presumption that a new product will have a similar sales pattern to that of an
existing product.
▪ Scenario Analysis and Scenario planning: Developing multiple scenarios or
hypothetical situations based on different assumptions and exploring the potential
outcomes for each scenario.: this can be used to deal with situations with greater
uncertainty or longer-range forecasts. A panel of experts is asked to devise a range
of future scenarios, likely outcomes and plans to ensure the most desirable one is
achieved. For example, predicting the impact of a new sales promotion, estimating
the effect a new technology may have on the marketplace or considering the
influence of social trends on future buying habits.

102
▪ Qualitative forecasting techniques are subjective in nature and rely on human
judgment, making them useful when dealing with complex or uncertain situations,
emerging trends, or when there is limited historical data available.

Industries that use qualitative forecasting

Companies in almost any industry can use qualitative forecasting to make predictions about
their future operations. Here's how a few industries might use qualitative forecasting:

• Sales: Qualitative forecasting can help companies in sales makes decisions like
how much of a product to produce and when they should order more inventory.
• Healthcare: Healthcare employees can use qualitative forecasting to identify
trends in public health and decide which healthcare operations might be in high
demand in the near future.
• Higher Education: Colleges or universities can use qualitative forecasting to
predict the number of students who might enroll for the next term or year.
• Construction and manufacturing: Qualitative forecasting can show construction
and manufacturing companies the quantity of different materials they use to help
determine which materials or equipment they might need for their next project.
• Agriculture: Farmers can use qualitative forecasting to assess their sales and
decide which crops to plant for the next season based on which products consumers
purchase most often.
• Pharmaceutical: Qualitative forecasting in pharmaceuticals can help identify
which medications are popular among consumers and which needs people are using
pharmaceuticals to predict which kinds of pharmaceuticals they might benefit from
developing.

Know your progress:

Q1. The choice of a forecasting method should be based on an assessment of the costs and
benefits of each method in a specific application.

103
A. True
B. False
Q.2 Surveys and opinion polls are qualitative techniques.
A. True
B. False
Q.3 The Delphi method generates forecasts by surveying consumers to determine their
opinions.
A. True
B. False

Quantitative Forecasting:

• Quantitative forecasting techniques, on the other hand, rely on historical data and
mathematical models to make predictions. Quantitative forecasting methods use
past data to determine future outcomes. These techniques are data-driven and
involve analyzing patterns, trends, and statistical relationships in the available data
to project future outcomes.

• The formulas used to arrive at a value are entirely based on the assumption that the
future will majorly imitate history.

• Quantitative forecasting takes historical demand data and combines it with


mathematical formulas to determine future performance. For this reason, it is also
often called statistical demand forecasting. Data sets can go back decades, can be
run for the last calendar year, or can be based on the previous few weeks’
consumption.

• Understanding how your company's past might affect its future is essential for
managing a business or working in sales. One method for evaluating your company
successfully is quantitative forecasting, which uses gathered data to draw

104
conclusions about prospective future prospects. Understanding quantitative
forecasting may help you see future sales estimates and make smarter business
decisions, whether you're running your own company or attempting to predict the
future of a certain product.

• Quantitative models discount the expert factor and try to remove the human element
from the analysis. These approaches are concerned solely with data and avoid the
fickleness of the people underlying the numbers. These approaches also try to
predict where variables such as sales, gross domestic product, housing prices, and
so on, will be in the long term, measured in months or years.

Importance of quantitative forecasting:


Examining data and creating inferences using quantitative forecasting is important because
it provides:

1. Objectivity: Numbers are neutral and free from any subjective judgment.
Examining empirical data provides a standard of objectivity that is useful for
making important business decisions. This makes realistic projections easier to
calculate and guarantees that the information is trustworthy.
2. Reliability: As analysts record and use accurate data in quantitative forecasting,
the inferences made become more reliable. Quantitative forecasting takes
advantage of the available information to provide reliable and accurate predictions
based on an established history. This makes it easier for business owners or
salespeople to pinpoint areas for growth.
3. Transparency: Because data reflects exactly how a business is performing, it
provides a level of transparency that can be very useful for quantitative forecasts.
The collected records present all of the information accurately and openly, which
provides an added level of clarity for making future business decisions.
4. Predictability: When businesses monitor their history and record their data for
quantitative forecasts, it makes trends easier to identify and predict. Using this

105
information, businesses can set realistic expectations and adjust their goals to
measure growth.

Quantitative Forecasting Techniques:

Some commonly used quantitative forecasting techniques include:

▪ Time Series Analysis: Analyzing historical data to identify patterns, trends,


seasonality, and other time-dependent factors to forecast future values. Techniques
such as moving averages, exponential smoothing, and ARIMA models are
commonly used in time series analysis. Time series use past data to predict future
events. The difference between the time series methodologies lies in the fine details,
for example, giving more recent data more weight or discounting certain outlier
points. By tracking what happened in the past, the forecaster hopes to get at least a
better than average view of the future. This is one of the most common types of
business forecasting because it is inexpensive and no better or worse than other
methods.

▪ Regression Analysis: Building mathematical models that capture the relationship


between a dependent variable and one or more independent variables. These models
can be used to make predictions by applying them to new data. Regression analysis
is a method of forecasting that uses historical data to predict future trends. This
method is used when there is a relationship between two or more variables.
Regression analysis involves analyzing historical data, identifying the relationship
between variables, and using this information to make predictions about the future.
This method is useful when the data is affected by external factors.

▪ Straight-line method: Businesses evaluate recent growth and predict how growth
might continue influencing data. The straight-line method is one of the simplest and
easy-to-follow forecasting methods. A financial analyst uses historical figures and
trends to predict future revenue growth.

106
▪ Machine Learning: Utilizing various machine learning algorithms such as
decision trees, random forests, support vector machines, or neural networks to learn
patterns from historical data and make predictions or classifications.

▪ Naive method: Businesses review historical data and assume future behavior will
reflect past behavior. The naive method bases future predictions by anticipating
similar results to the data gathered from the past. This method does not account for
seasonal trends or any other patterns that might arise in the collected data. It is the
most straightforward forecast method and is often used to test the accuracy of other
methods.

▪ Trend projection: Trend projection uses your past sales data to project your future
sales. It is the simplest and most straightforward demand forecasting method.
It’s important to adjust future projections to account for historical anomalies. For
example, perhaps you had a sudden spike in demand last year. However, it
happened after your product was featured on a popular television show, so it is
unlikely to repeat. Or your eCommerce site got hacked, causing your sales to
plunge. Be sure to note unusual factors in your historical data when you use the
trend projection method.
▪ Seasonal index: Businesses analyze historical data to find seasonal patterns.
▪ Moving average method: Businesses determine averages over a large time
duration. Moving averages are a smoothing technique that looks at the underlying
pattern of a set of data to establish an estimate of future values. The most common
types are the 3-month and 5-month moving averages.
▪ Exponential smoothing: It is similar to the moving average method except for the
fact that recent data points are given more weightage. It needs a single parameter
called alpha, also known as the smoothing factor. Alpha controls the rate at which
the influence of past observations decreases exponentially. The parameter is often
set to a value between 0 and 1.

107
▪ The indicator approach: The indicator approach depends on the relationship
between certain indicators, for example, GDP and the unemployment rate
remaining relatively unchanged over time. By following the relationships and then
following leading indicators, you can estimate the performance of the lagging
indicators by using the leading indicator data.

▪ Econometric modeling: This is a more mathematically rigorous version of the


indicator approach. Instead of assuming that relationships stay the same,
econometric modeling tests the internal consistency of datasets over time and the
significance or strength of the relationship between datasets. Econometric modeling
is applied to create custom indicators for a more targeted approach. However,
econometric models are more often used in academic fields to evaluate economic
policies. Econometric modeling is a method of forecasting that uses economic data
to predict future trends. This method involves analyzing economic data, identifying
patterns and trends, and using this information to make predictions about the future.
This method is useful when there are economic factors that can affect the data.
▪ Extrapolation: a method of prediction that assumes the patterns that existed in the
past will continue on into the future and that those patterns are regular and can be
measured. In other words, the past is a good indicator of the future. It is one of the
easiest and simplest method of forecasting and it is used where there is enough past
data available that is not subjected to change drastically in near future.

Quantitative forecasting techniques require accurate and reliable historical data and assume
that past patterns and relationships will continue in the future. These techniques are most
effective when there is a significant amount of historical data available and when the
underlying factors influencing the forecast are stable and measurable.

Know your progress:


Q.1 which one of the following is the simplest and easiest method of forecasting?
A. exponential smoothing
B. Moving average method

108
C. Regression
D. Extrapolation
Q.2 ________are a smoothing technique that looks at the underlying pattern of a set of data
to establish an estimate of future values.
A. Moving averages
B. Seasonal index
C. Exponential smoothing
D. Extrapolation

Advantages of Quantitative Methods of Forecasting


There are various advantages to using quantitative methods of forecasting in business:
Accuracy: Quantitative methods of forecasting rely on data analysis and mathematical
models, which make predictions more accurate.
Objectivity: Quantitative methods of forecasting are objective and based on historical
data, which reduces the impact of personal bias.
Consistency: Quantitative methods of forecasting are consistent and can be used
repeatedly, making them reliable.
Ease of Use: Quantitative methods of forecasting are relatively easy to use and require
minimal expertise in statistics and mathematics.

Disadvantages of Quantitative Methods of Forecasting


While there are several advantages to using quantitative methods of forecasting, there are
also some limitations:

1. Data Availability: Quantitative methods of forecasting rely on historical data,


which may not always be available or reliable.

2. Assumptions: Quantitative methods of forecasting make assumptions about future


patterns based on past data, which may not always hold true.

109
3. External Factors: Quantitative methods of forecasting may not account for
external factors that can affect future trends.

It's worth noting that a combination of qualitative and quantitative techniques can be
employed in some forecasting scenarios. For example, qualitative inputs can be used to
inform or adjust quantitative models, or qualitative insights can be used to interpret and
validate quantitative forecasts. The choice of forecasting technique depends on the specific
situation, available data, and the level of accuracy and precision required for the forecast.

Qualitative vs quantitative forecasting


Quantitative forecasting is different from qualitative forecasting because quantitative
forecasting relies on numerical values and calculations to make predictions and inform
decision-making. While qualitative reasoning works through analyzing judgments and
opinions, qualitative reasoning operates based on objective data from past operations to
inform a company's decisions. Quantitative data also breaks into two categories, which are
historical data forecasts and associative data forecasts. These forecasts involve
mathematical calculations and can help a company identify trends in areas like sales or
investments.

Time Series Forecasting

While time-series data is information gathered over time, various types of information describe
how and when that information was gathered. For example:

1. Time series data: It is a collection of observations on the values that a variable takes at
various points in time.

110
2. Cross-sectional data: Data from one or more variables that were collected
simultaneously.
3. Pooled data: It is a combination of cross-sectional and time-series data.

Time series analysis has a range of applications in statistics, sales, economics, and many more
areas. The common point is the technique used to model the data over a given period of time.

The reasons for doing time series analysis are as follows:

1. Features: Time series analysis can be used to track features like trend, seasonality, and
variability.
2. Forecasting: Time series analysis can aid in the prediction of stock prices. It is used if you
would like to know if the price will rise or fall and how much it will rise or fall.
3. Inferences: You can predict the value and draw inferences from data using Time series
analysis.

What is a Time Series?


A time series is a sequence of data points that occur in successive order over some period.
This can be contrasted with cross-sectional data, which captures a point in time.

• A time series is a data set that tracks a sample over time.


• In particular, a time series allows one to see what factors influence certain variables
from period to period.
• Time series analysis can be useful to see how a given asset, security, or economic
variable changes over time.
• Forecasting methods using time series are used in both fundamental and technical
analysis.
• Although cross-sectional data is seen as the opposite of time series, the two are
often used together in practice.

111
• Time-series data is a sequence of data points collected over time intervals, allowing
us to track changes over time.
• Time-series data can track changes over milliseconds, days, or even years.
Businesses are often very interested in forecasting time series variables.
• In time series analysis, we analyze the past behavior of a variable to predict its
future behavior.
• A time series is a set of observations on a variable's outcomes in different time
periods: the quarterly sales for a particular company during the past five years, for
example, or the daily returns on a traded security.
• A time series data example can be any information sequence that was taken at
specific time intervals (whether regular or irregular).
• Time-series analysis is a method of analyzing a collection of data points over a
period of time. Instead of recording data points intermittently or randomly, time
series analysts record data points at consistent intervals over a set period of time.
• Time series contains observation in the numerical form represented in
chronological order. Analysis of this observed data and applying it as input to
derive possible future developments was popularized in the late 20th century. It was
primarily due to the textbook on time series analysis written by George E.P. Box
and Gwilym M. Jenkins. They introduced the procedure to develop forecasts using
the input based on the data points in the order of time, famously known as Box-
Jenkins Analysis.

Examples of time series


• Stock Data: Stock exchange data. Stock market analysis is an excellent example of
time series analysis in action, especially with automated trading algorithms.
• Marking: Sales, inventory, customer counts etc.
• Economics: Interest rates, GDP, employment etc.
• Energy: (Electricity, Gas, Oil and Solar) demands, prices etc.
• Sensors: IOT data
• Weather: Local and global temperature data: the weather today is usually more
similar to the weather tomorrow than the weather a month from now. So, predicting

112
the weather based on past weather observations is a time series problem. Time
series analysis is ideal for forecasting weather changes, helping meteorologists
predict everything from tomorrow's weather report to future years of climate
change.
• Common data examples could be anything from heart rate to the unit price of store
goods.

Time-series forecasting
Time-series forecasting is a type of statistical or machine learning approach that tries to
model historical time-series data to make predictions about future time points. Time series
analysis is perhaps the most common statistical demand forecasting model.

Time series forecasting is a method of predicting future values based on historical patterns
and trends in a sequence of data points ordered over time. It is widely used in various fields,
including finance, economics, weather forecasting, sales forecasting, and demand
forecasting, among others.

The goal of time series forecasting is to understand and capture the underlying patterns and
dependencies within the data and use them to make accurate predictions about future
values. These predictions can help in making informed decisions, planning resources, and
identifying potential risks or opportunities.

Time-series regression is a statistical method of forecasting future values based on


historical data. The forecast variable is also called the regressand, dependent or explained
variable.

Components of Time-series forecasting

113
1. Secular Trend: The general tendency of a time series to increase, decrease or
stagnate over a long period of time. Trend shows a common tendency of data. It
may move upward or increase or go downward or decrease over a certain, long
period of time. The trend is a stable and long-term general tendency of movement
of the data. Some of the examples of trends can be agricultural production, increase
in population, in summer, the temperature may rise or decline in a day, but the
overall trend during the first two months will show how the heat has been rising
from the beginning. A Trend can be either linear or non-linear.
2. Seasonal Trend or Variations: Fluctuations within a year during the season.
Seasonal variations are changes in time series that occur in the short term. These
variations are often recorded as hourly, daily, weekly, quarterly, and monthly
schedules. The festivals, customs, fashions, habits, and various occasions, such as
weddings impact the seasonal variations. for example, the sale of umbrellas
increases during the rainy season, and air conditioners increase during summer.
Apart from natural occurrences, man-made conventions like fashion, marriage
season, festivals, etc., play a key role in contributing to seasonal trends.

3. Cyclical Variation: Changes in series, caused by circumstances, which repeat in


cycles. Such oscillatory movements of time serious often have a duration of more
than a year. One complete period of operation is called either a cycle or a ‘Business
Cycle’. Cyclic variations for any business may contain four phases - prosperity,

114
recession, depression, and recovery. It may be regular or non-periodic in nature
depending on certain situations. Normally, cyclical variations occur due to a
combination of two or more economic forces and their interactions.

4. Irregular or random variations: caused by unpredictable influences, which are


not regular and do not repeat in a particular pattern. It is pure Irregular and Random
Movement. As the name suggests, no hypothesis or trend can be used to suggest
irregular or random movements in a time series. These outcomes are unforeseen,
erratic, unpredictable, and uncontrollable in nature.

Earthquakes, war, famine, and floods are some examples of random time series
components.

Know your progress:


Q.1 A time series consists of:
A. Short-term variations
B. Long-term variations
C. Irregular variations
D. All of the above

Q.2 Increase in the number of patients in the hospital due to heat stroke is:
A. Secular trend
B. Irregular variation
C. Seasonal variation
D. Cyclical variation

Q.3 The sales of departmental store on Dussehra and Diwali are associated with the component
of a time series _________ variation.
A. Trend
B. Seasonal

115
C. Irregular
D. Cyclical

Q.4 Wheat crops badly damaged on account of rains is:


A. Cyclical movement
B. Random movement
C. Secular trend
D. Seasonal movement

Time series forecasting process:

1. Data collection: Collect historical data points ordered by time. This data can
include measurements, observations, or any relevant information related to the
phenomenon being studied.

2. Data exploration and preprocessing: Analyze the data to understand its


characteristics, such as trends, seasonality, and outliers. Clean the data by handling
missing values, outliers, and any other inconsistencies.

3. Stationarity: Check if the time series is stationary, meaning its statistical properties
remain constant over time. Stationarity is often assumed for many forecasting
models. If the series is non-stationary, transformations or differencing techniques
can be applied to make it stationary.

4. Model selection: Choose a suitable forecasting model based on the characteristics


of the data. Following are some common models:
• Moving Average (MA) models
• Autoregressive (AR) models
• Autoregressive Moving Average (ARMA) models
• Autoregressive Integrated Moving Average (ARIMA) models

116
• Exponential Smoothing (ES) models
• Seasonal ARIMA (SARIMA) models
• Long Short-Term Memory (LSTM) networks (a type of deep learning model)
The selection of the model depends on factors such as the presence of seasonality,
trend, and the complexity of the data.

5. Model fitting and evaluation: Fit the selected model to the training data and tune
its parameters if necessary. Evaluate the model's performance using appropriate
metrics such as mean absolute error (MAE), mean squared error (MSE), or root
mean squared error (RMSE). Use validation data or cross-validation techniques to
assess the model's accuracy.

6. Forecasting: Once the model is trained and validated, use it to make predictions
on unseen future data points. Generate forecasts for the desired time horizon,
considering the level of uncertainty associated with the predictions.

7. Model refinement: Periodically re-evaluate and refine the forecasting model as


new data becomes available. This helps to ensure the model's accuracy and
adaptability to changing patterns in the time series.

Time series forecasting can be a complex task, and the choice of models and
techniques depends on the specific characteristics of the data and the goals of the
forecasting task. It's often beneficial to explore multiple models and compare their
performance to select the most appropriate one for a given application.

Know your progress:


Q.1 Time-series analysis generates forecasts by identifying cause and effect relationships
between variables.
A. True
B. False
Q.2 Time-series data are observations on a variable at different points in time.

117
A. True
B. False
Q.3 The fundamental assumption of time-series analysis is that past patterns in time-series
data will continue unchanged in the future.
A. True
B. False
Q.4 Time-series forecasting tends to be more accurate than "naive" forecasting.
A. True
B. False
Q.5 The long-run increase or decrease in time-series data is referred to as a cyclical
fluctuation.
A. True
B. False

Importance of time series forecasting


• To forecast future events.
• It shows why trends exist in past data and how they may be explained by underlying
patterns or processes.
• It is a basic tool for the analysis of natural systems. Climate cycles and fluctuations in the
economy, or volcanic eruptions and earthquakes, are examples of natural systems, whose
behavior can best be studied using time series analysis.
• Time series analysis is helpful in financial planning as it offers insight into the future data
depending on the present and past data of performance. It can lead to the estimation of an
expected time’s data by checking the current and past data. That means, time series is used
to determine the future by using the trends and valuations of the past and present.

Choosing the Right Forecasting Method


• The right forecasting method will depend on the type and scope of the forecast. Qualitative
methods are more time-consuming and costly but can make very accurate forecasts given
a limited scope. For instance, they might be used to predict how well a company’s new
product launch might be received by the public.

118
• For quicker analyses that can encompass a larger scope, quantitative methods are often
more useful. Looking at big data sets, statistical software packages today can crunch the
numbers in a matter of minutes or seconds. However, the larger the data set and the more
complex the analysis, the pricier it can be.

• Thus, forecasters often make a sort of cost-benefit analysis to determine which method
maximizes the chances of an accurate forecast in the most efficient way. Furthermore,
combining techniques can be synergistic and improve the forecast’s reliability.

Limitations of forecasting:

• Forecasting can be dangerous. Forecasts become a focus for companies and governments
mentally limiting their range of actions by presenting the short to long-term future as pre-
determined. Moreover, forecasts can easily break down due to random elements that cannot
be incorporated into a model, or they can be just plain wrong from the start.
• But business forecasting is vital for businesses because it allows them to plan production,
financing, and other strategies. However, there are three problems with relying on
forecasts:
• The data is always going to be old. Historical data is all we have to go on, and there is no
guarantee that the conditions in the past will continue in the future.
• It is impossible to factor in unique or unexpected events, or externalities. Assumptions are
dangerous, such as the assumption that banks were properly screening borrowers prior to
the subprime meltdown. Black swan events have become more common as our reliance on
forecasts has grown.
• Forecasts cannot integrate their own impact. By having forecasts, accurate or inaccurate,
the actions of businesses are influenced by a factor that cannot be included as a variable.
This is a conceptual knot. In a worst-case scenario, management becomes a slave to
historical data and trends rather than worrying about what the business is doing now.

119
• Negatives aside, business forecasting is here to stay. Appropriately used, forecasting allows
businesses to plan ahead for their needs, raising their chances of staying competitive in the
markets. That's one function of business forecasting that all investors can appreciate.

SUMMARY
▪ Quantitative forecasting methods use past data to determine future outcomes.
▪ The qualitative forecasting method is primarily based on fresh data like surveys and
interviews, industry benchmarks, and competitive analysis.
▪ Quantitative methods of forecasting are an essential tool used by businesses to make
informed decisions about the future.
▪ Regression Analysis, Naïve Method, etc. are Quantitative Techniques, whereas Delphi
Method, Market Research, etc. are Qualitative Techniques
▪ Time-series is a set of observations on a quantitative variable collected over time.
▪ Time Series Analysis consists of two steps build a model that represents a time series
and validate the proposed model and then use it.
▪ Trend, Cyclical Variation, Seasonal Variation, and Irregular Variation are the
components of Time Series.

References
1. Data Warehousing, OLAP and Data Mining by S Nagabhushana
2. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
3. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed
J. Zaki and Wagner Meira Jr.
4. Data Analytics made Accessible by Dr. Anil Maheshwari
5. Data Science and Big Data Analytics by EMC Education Services.
6. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei

120
7. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed
J. Zaki and Wagner Meira Jr.
8. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta,
Umesha Nayak

Review Questions:
Q1. What is forecasting? Explain business forecasting and its need.
Q.2 Discuss Qualitative and Quantitative forecasting.
Q.3 What are qualitative forecasting techniques?
Q.4 Explain quantitative forecasting techniques.
Q5. What are advantages and disadvantages of quantitative forecasting?
Q.6 What are limitations of forecasting methods.
Q.7 differentiate between qualitative and quantitative forecasting methods.
Q.8 explain time series forecasting.
Q.9 Explain time series forecasting components with examples.
Q.10 explain time series forecasting process.

Further Readings:
1. Introductory Time Series with R (Use R!) 2009th Edition by Paul S.P. Cowpertwait,
Andrew V. Metcalfe
2. Forecasting principles and practice by Rob J Hyndman and George Athanasopoulos
3. Time Series Analysis and Its Applications: With R Examples 4th ed. 2017 Edition by
Robert H. Shumway, David S. Stoffer
4. Advances in Business and management forecasting by Kenneth D Lawrence
5. Forecasting methods and applications by Spyros G Makridakis, Steven C,
Wheelwright, Rob J Hyndman

121
Unit V: Social Network Analytics

Topics
• Social Network
• Social Network Analytics
• Text Mining
• Text Analytics
• Difference between Text Mining and Text analytics
• R Snippet to do Text Analytics

Objectives:
20. To understand the concept of social network
21. To understand the social network analytics and its process
22. To understand the concept of text mining
23. To comprehend the applications of text mining and text analytics
24. To apply the concept of Text Analytics using R Programming Language

Outcomes:
20. Understand the concept of social network.
21. Understand the social network analytics and its process.
22. Understand the concept of text mining.
23. Comprehend the applications of text mining and text analytics.
24. Apply the concept of Text Analytics using R Programming Language

122
Introduction to the Unit
This unit introduces social networks, social network analytics and the process of social network
analytics (SNA). The theoretical basics of social network analysis are briefly reviewed, and the
main methods needed to carry out this kind of analysis are covered. It addresses concerns with
data gathering, and social network structural metrics. The Unit also covers the advantages and
disadvantages of SNA. Also, the concept of text mining and applications of text mining are
discussed in this unit.

Social networks
• Social networks are websites and mobile apps that allow users and organizations to
connect, communicate, share information, and form relationships. Social networks have
become very popular in recent years because of the increasing proliferation and
affordability of internet enabled devices such as personal computers, mobile devices, and
other more recent hardware innovations such as internet tablets, etc. People can connect
with others in the same area, families, friends, and those with the same interests. E. g.
Facebook, WhatsApp, twitter, Instagram, etc.

• Social media and social networks are not the same thing, despite their frequent interchange.
An individual's connections and interactions with others are the main emphasis of a social
network. In social media, individual sharing with a big audience is increasingly important.
Media is used here in the same manner as it is in mass media. The majority of social
networks also function as social media websites.

• Social network analysis is a psychological study that looks at how people interact with each
other as individuals and groups. It enables one to understand the networks of relationships
between people in society and analyze the different cultural and relational paths societies
take.

123
• Social network analysis (SNA) is the method of investigating social structures with the
help of networks and graph theory.

• It characterizes networked structures in terms of nodes (individual actors, people, or things


within the network) and the ties, edges, or links (relationships or interactions) that connect
them.

• Social network analysis (SNA) has been used in many disciplines such as sociology,
anthropology, political science, psychology, philosophy, and many others.

• SNA is one of the most significant tools used in psychology research to investigate people's
thoughts and feelings in social contexts. The technique can be used for many purposes,
such as understanding the spread of disease, predicting crime rates, or understanding social
movements.

• The focus of a social network will be user-generated content. Users mostly engage with,
and view material created by other users. They are urged to provide text, status updates, or
images for public access.

• Users and organisations can build profiles on social networks. The profile includes the
person's bio and a core page containing the material they have uploaded. Their profile could
correspond to their legal name.

• A social network can help members establish enduring connections with one another.
Friending or following the other user are popular terms used to describe these ties. They
enable people to connect with one another and build social networks. Frequently, an
algorithm may suggest additional persons and businesses that they would like to connect
with.

• Social network analytics refers to the process of analyzing and interpreting data from social
networks to gain insights and understand the patterns, dynamics, and characteristics of

124
social relationships. It involves using various techniques and methodologies to extract
valuable information from social network data.

• The focus of a social network will be user-generated content. Users mostly engage with,
and view material created by other users. They are urged to provide text, status updates, or
images for public access.

• Users and organisations can build profiles on social networks. The profile includes the
person's bio and a core page containing the material they have uploaded. Their profile could
correspond to their legal name.

Purpose of Social Networks

Sharing: Geographically separated friends or family members may communicate remotely and
exchange information, updates, images, and videos. People can grow their existing social networks
or meet new people through social networking who share their interests.
Learning: Social media sites are excellent forums for education. Customers may quickly acquire
breaking news, updates on friends and family, or information on what's going on in their
neighbourhood.
Interacting: Social networking improves user interactions by removing time and distance
restrictions. People can have face-to-face conversations with anyone in the globe using cloud-
based video communication services like WhatsApp or Instagram Live.
Marketing: Companies may utilise social networking platforms to boost brand and voice
identification, increase customer retention and conversion rates, and increase brand awareness
among platform users.

Different types of social networking:

125
Social Connections: In this kind of social network, users may connect with friends, family,
acquaintances, brands, and more through online profiles and updates. Users can also meet new
people through shared interests. Instagram, Myspace, and Facebook are a few examples.
relationships with colleagues. These social networks are made for business interactions and are
geared towards professionals. These websites may be utilised, for instance, to research career
prospects, strengthen current business relationships, and develop new contacts in the

Professional connections: They could give an exclusive platform focusing on certain professions
or interests, or they might feature a generic forum for professionals to communicate with
colleagues. Examples are Microsoft Viva, LinkedIn, and Yammer.

Sharing of Multimedia: YouTube and Flickr, among other social networks, offer facilities for
sharing videos and photos.

News or informational: Users can submit news articles, educational materials, or how-to guides
to this sort of social networking site, which can be both all-purpose and topic specific. These social
networks, which share a lot with web forums, host groups of users seeking for solutions to common
issues. Members answer inquiries, host forums, or instruct others on how to do various activities
and projects to foster a sense of assisting others. Examples in use now include Digg, Stack
Overflow, and Reddit.

Communication: In this case, social networks emphasise enabling direct one-on-one or group
discussions between users. They resemble instant chat applications and place less emphasis on
postings or updates. WeChat, snapchat and WhatsApp are two examples.

Educational: Remote learning is made possible by educational social networks, allowing students
and professors to work together on assignments, do research, and communicate through blogs and
forums. Popular examples include ePals, LinkedIn Learning, and Google Classroom, MS Teams,
Zoom, etc.

Advantages and Disadvantages of Social Networking

126
Advantages of Social Networking:

Brand Recognition: Social networking helps businesses to connect with both potential and
current customers. This increases brand awareness and helps make brands more relevant.
Instant reachability: Social networking websites may offer immediate reachability by removing
the geographic and physical distances between individuals.
Builds a following: Social networking may help businesses and organisations grow their clientele
and reach throughout the globe.
Business Achievement: On social networking sites, customers' favourable evaluations and
comments may boost sales and profitability for businesses.
Increased Usage of Websites: Social networking profiles may be used by businesses to increase
and steer inbound traffic to their websites. They can do this, for instance, by including motivating
images, employing plugins and social media sharing buttons, or promoting inbound links.

Disadvantages of Social Networks:

Rumours and misinformation: Social networking sites have gaps where inaccurate information
might fall through, confusing and upsetting customers. People frequently believe everything they
read on social networking platforms without checking the origins.

Negative Comments and Reviews: An established company may suffer from a single
unfavourable review, particularly if it is published on a website with a sizable audience. A
damaged corporate reputation frequently results in permanent harm.

Data security and privacy concerns: Social media platforms may unintentionally expose user
data to risk. For instance, when a social networking site suffers a data breach, all of its users are
immediately put on alert. The personal information was exposed in April 2021, according to
Business Insider.

127
Time Consuming Process: Social media marketing for a company necessitates ongoing care and
maintenance. Regular post creation, updating, planning, and scheduling may take a lot of time.
Small companies that do not have the extra personnel or resources to devote to social media
marketing may find this to be particularly burdensome.

Social network analysis: mapping and measuring of relationships and flows between people,
groups, organizations, computers, or other information/knowledge processing entities; the nodes
in the network are the people and groups, while the links show relationships or flows between the
nodes – provides both a visual and a mathematical analysis of human relationships.

Based on mathematical underpinnings of graph theory and theoretical assumptions of sociology,


social network analysis is the study of structure and how it affects health. Structure is the regular
patterns in the interactions between people, groups, and/or organisations.

128
Fig. Network Graph

Here are some key aspects and methods used in social network analytics:
Data collection: Social network analytics begins with collecting relevant data from social
networks, which may include information about individuals, their connections, and their
activities within the network. This data can be obtained from online platforms, such as
Facebook, Twitter, LinkedIn, or specialized datasets.
Network visualization: Visualizing social networks is an essential step in understanding
the structure and organization of the network. Network graphs and visual representations
help in identifying patterns, clusters, and influential nodes within the network.
Centrality measures: Centrality measures quantify the importance or prominence of
nodes within a social network. Degree centrality measures the number of connections a
node has, while betweenness centrality identifies nodes that act as bridges between
different parts of the network. Other measures include closeness centrality and eigenvector
centrality.

129
Community detection: Community detection algorithms help identify groups or
communities of nodes that have stronger connections with each other compared to the rest
of the network. It helps in understanding the formation of subgroups or clusters within the
network.

Link prediction: Link prediction techniques aim to predict future connections or


relationships between nodes in a social network. These predictions are based on existing
network structures, node attributes, and various algorithms.

Sentiment analysis: Sentiment analysis involves analysing textual data from social
network posts, comments, or messages to determine the sentiment expressed by
individuals. It helps in understanding the attitudes, opinions, and emotions within the
network.

Influence identification: Social network analytics can be used to identify influential nodes
or individuals who have a significant impact on information flow or decision-making
within the network. Influence identification methods may consider factors such as node
centrality, activity level, or community involvement.

Network dynamics: Analysing changes and dynamics in a social network over time
provides insights into the evolution of relationships, communities, and information
diffusion. It helps in understanding how networks grow, adapt, and transform.
Network diffusion: Network diffusion models study the spread of information, ideas, or
behaviours within a social network. These models help in understanding the mechanisms
of influence, viral marketing, or the propagation of trends and innovations.
Social network theory: Social network analytics draws upon theories and concepts from
social network analysis, sociology, graph theory, and other related disciplines. These
theories provide a framework for understanding social structures, relationships, and
interactions within networks.

Advantages of Social Network Analytics

130
• SNA can be used to identify people who are linked, but who may not be part of a formal
community. These people can be invited to join a community relevant to them.
• Improving effectiveness of functions or business units.
• Forming strategic partnerships or assessing client connectivity.
• social network analytics enables researchers, organizations, and individuals to gain
valuable insights into social relationships, behaviors, and communication patterns, aiding
decision-making, marketing strategies, community management, and various research
endeavors.

Applications of Social Network Analysis:

Social network analysis (SNA) has a wide range of applications across various domains due to
its ability to reveal hidden patterns, connections, and insights within social networks. Here are
some key applications of social network analysis:

1. Organizational Analysis and Management:


• Identifying communication patterns within organizations to improve efficiency and
collaboration.
• Analyzing social networks within companies to identify informal leadership and power
structures.
• Enhancing knowledge sharing and innovation by identifying key knowledge brokers.
2. Marketing and Customer Relationship Management:

• Identifying key influencers and opinion leaders for targeted marketing campaigns.
• Analyzing customer connections to improve segmentation and personalized marketing
strategies.
• Monitoring brand reputation and sentiment on social media platforms.
3. Healthcare and Public Health:

• Tracking the spread of diseases and predicting disease outbreaks.

131
• Identifying individuals at high risk in epidemiological studies.
• Understanding patient referral patterns and healthcare provider collaborations.
4. Social Media Analysis:
• Identifying trending topics, hashtags, and influential users on social media platforms.
• Analyzing sentiment and emotions in user-generated content.
• Studying information diffusion and viral content propagation.
5. Counterterrorism and Security:
• Detecting and preventing radicalization by identifying key individuals or groups.
• Analyzing communication networks to track potential threats and activities.
• Understanding the structure of criminal networks and organized crime.
6. Academic Research:
• Studying collaboration and knowledge flow among researchers and academics.
• Analyzing citation networks to identify influential papers and researchers.
• Understanding scientific collaboration patterns and interdisciplinary research.
7. Political Science and Policy Analysis:
• Analyzing political connections and alliances among politicians.
• Studying lobbying activities and interest group networks.
• Tracking the flow of information and influence within political systems.
8. Supply Chain Management:
• Analyzing supplier-customer relationships to optimize supply chain efficiency.
• Identifying potential disruptions and vulnerabilities within supply networks.
• Improving coordination and collaboration among supply chain partners.
9. Education and Learning:
• Analyzing student interactions to enhance classroom dynamics and group projects.
• Identifying peer influence and social learning patterns.
• Studying the spread of educational innovations and best practices.

10. Urban Planning and Transportation:


• Analyzing transportation networks to improve traffic flow and infrastructure planning.

132
• Identifying commuting patterns and social interactions within cities.
• Studying social networks within neighborhoods for community development.

These are some examples of the various applications of social network analysis. SNA remains to
find new applications as technology evolves and more data becomes available for analysis.

Text mining
• Text mining is a subset of data mining, which is the process of finding patterns in large
volumes of data. Text mining identifies facts, relationships and assertions that would
otherwise remain buried in the mass of textual big data. Manually scanning and classifying
these documents can be extremely time-consuming, so automating text mining can save
businesses considerable time and effort. Managers can then use the discoveries to make
better informed decisions and quickly take action.

• Text mining is the process of exploring and analyzing large amounts of unstructured text
data aided by software that can identify concepts, patterns, topics, keywords, and other
attributes in the data.

• Text mining focuses specifically on unstructured data in everyday documents, such as


emails, text messages, survey responses, customer feedback, online reviews, support
tickets, websites, books, and articles.

• Text mining has become more practical for data scientists and other users due to the
development of big data platforms and deep learning algorithms that can analyze massive
sets of unstructured data.

• Mining and analyzing text help organizations find potentially valuable business insights in
corporate documents, customer emails, call center logs, verbatim survey comments, social
media posts, medical records, and other sources of text-based data. Increasingly, text
mining capabilities are also being incorporated into AI chatbots and virtual agents that

133
companies deploy to provide automated responses to customers as part of their marketing,
sales, and customer service operations.
• Once extracted, this information is converted into a structured form that can be further
analyzed, or presented directly using clustered HTML tables, mind maps, charts, etc. The
structured data created by text mining can be integrated into databases, data warehouses or
business intelligence dashboards and used for descriptive, prescriptive, or predictive
analytics.

Text Mining and Text Analytics


Text mining and text analytics are related concepts, but they are not the same, they have a
minute difference. They both involve the analysis of unstructured text data to extract valuable
insights, but they may emphasize different aspects and employ distinct techniques. Text
analytics and text mining are frequently used interchangeably. While text analytics produces
numbers, text mining is the process of extracting qualitative information from unstructured
text. Here's a breakdown of the differences between text mining and text analytics:
Text Mining:
• Text mining refers to the process of discovering patterns, trends, and valuable information
from large volumes of unstructured text data. It often involves the use of advanced
techniques from natural language processing (NLP), machine learning, and data mining to
extract meaningful insights from text. Text mining goes beyond simple keyword searches
and includes tasks such as:
• Information Retrieval: Extracting relevant documents or pieces of text based on specific
queries or criteria.
• Text Categorization and Classification: Automatically categorizing and labelling text
into predefined categories or classes.
• Named Entity Recognition (NER): Identifying and extracting names of people,
organizations, locations, and other entities from text.
• Sentiment Analysis: Determining the sentiment or emotional tone expressed in text (e.g.,
positive, negative, neutral).

134
• Topic Modelling: Identifying the underlying topics or themes within a collection of
documents.
• Relationship Extraction: Discovering relationships and connections between entities
mentioned in the text.
• Summarization: Creating concise and coherent summaries of lengthy documents or texts.
• Clustering: Grouping similar documents or texts together based on content similarity.

Text Analytics:

Text analytics is a broader term that encompasses a range of techniques used to process, analyse,
and interpret unstructured text data. It includes text mining as one of its components but also
includes other methods that focus on deriving insights from text. Text analytics often has a
stronger emphasis on business intelligence and decision-making. Text analytics refers to the
application that uses text mining techniques to sort through data sets. In order to extract insights
and patterns from massive amounts of unstructured text—text that does not follow a
predetermined format—text analytics integrates a variety of machine learning, statistical, and
linguistic approaches. It makes it possible for organizations, governments, scholars, and the
media to use the vast material at their disposal to make important choices. Sentiment analysis,
topic modelling, named entity identification, phrase frequency, and event extraction are just a
few of the techniques used in text analytics.

In addition to the tasks mentioned under text mining, text analytics may involve:
• Descriptive Analytics: Summarizing and describing text data to provide an overview of
its content.
• Predictive Analytics: Using text data to make predictions or forecasts about future events
or trends.
• Prescriptive Analytics: Offering recommendations or suggesting actions based on text
data analysis.
• Contextual Analysis: Understanding the context and context-dependent meanings of
words and phrases.

135
• Text Visualization: Creating visual representations of text data to aid in understanding
and exploration. Data visualization techniques can then be harnessed to communicate
findings to wider audiences. By transforming the data into a more structured format
through text mining and text analysis, more quantitative insights can be found through text
analytics.

Overall, while text mining is a specific subset of text analytics focused on extracting patterns and
insights from text data, text analytics encompasses a broader set of techniques that includes mining
as well as other analytical and interpretative approaches. Both text mining and text analytics play
important roles in turning unstructured text data into actionable information for various
applications and industries.

Steps Involved with Text Analytics

Several pre-steps are used in the sophisticated approach of text analytics to collect and clean the
unstructured material. Text analytics may be carried out in a variety of methods. Using this as a
model workflow example.

1. Data collection: Text information is often dispersed throughout an organization's internal


databases, such as in customer conversations, emails, product evaluations, service
complaints, and Net Promoter Score surveys. In the form of blog postings, news articles,
product evaluations, social media updates, and web forum conversations, users also
provide external data. The external data must be obtained, even when the internal data is
easily accessible for analytics.
2. Data preparation: it is required before machine learning algorithms can analyze the
unstructured text data after it has been made accessible. The majority of text analytics
software does this step automatically. Natural language processing methods used in text
preparation include the following:
a. Tokenization: The text analysis algorithms divide the continuous string of text data
into tokens, or smaller units that make up complete words or phrases, in the

136
tokenization stage. For instance, character tokens may represent each letter in the word
"Fish" individually. Alternately, you may segment by sub word tokens, like fishing.
The foundation of all-natural language processing is represented by tokens.
Additionally, all of the text's undesirable elements—including white spaces—are
removed in this stage.
b. Part of Speech Tagging: Each token in the data is given a grammatical category, such
as a noun, verb, adjective, or adverb, at this stage.
c. Parsing: Understanding the syntactical structure of a document is the process of
parsing. Two common methods for determining syntactical structure are constituency
parsing and dependency parsing.
d. Lemmatization and stemming: These two procedures are used in data preparation to
take away the tokens' affixes and suffixes while keeping the tokens' dictionary form, or
lemma.
e. Stopword removal: All the tokens that often appear but have little value in text
analytics are at this stage. ‘And’ 'the', and 'a' are examples of words that fall under this
category.
3. Text analytics: Unstructured text data must first be prepared before text analytics
techniques may be used to provide insights. Text analytics uses a variety of methodologies.
Text categorization and text extraction stand out among them.
4. Text classification: This method is sometimes referred to as text tagging or text
classification. The text is given specific tags in this phase based on their significance. For
instance, labels like "positive" or "negative" are applied while analyzing customer
feedback. Rule-based or machine learning-based systems are frequently used for text
categorization. Humans specify the connection between a linguistic pattern and a tag in
rule-based systems. "Good" may denote a favorable review, while "bad" may denote a
negative review.
To tag a new set of data, machine learning algorithms employ training data or examples
from the past. Larger amounts of data enable the machine learning algorithms to produce
correct tagging results, therefore the training data and its volume are essential. Support
Vector Machines (SVM), the Naive Bayes family of algorithms (NB), and deep learning
algorithms are the primary algorithms utilized in text categorization.

137
5. Text extraction: is the process of taking recognizable and structured data out of the input
text's unstructured form. Keywords, names of individuals, places, and events are included
in this data. Regular expressions are one of the straightforward techniques for text
extraction. When input data complexity rises, it becomes difficult to sustain this strategy.
Conditional Random Fields (CRF) is a statistical method used in text extraction. CRF is a
modern but effective way of extracting vital information from the unstructured text.
Know your progress:

Q.1 Text analytics can be used in predicting:


A. Weather patterns
B. Consumer preferences
C. Planetary motion
D. DNA sequences
Q.2 What does sentiment analysis in text analytics involve?
A. Identifying patterns in text data
B. Predicting future events based on text.
C. Determining the emotional tone expressed in text.
D. Categorizing text into different topics
Q.3 What is the primary goal of text analytics?
A. Creating visual representations of text data
B. Extracting valuable insights from unstructured text data
C. Translating text from one language to another
D. Analyzing structured numerical data

Here are some commonly used algorithms in text mining:

• Naive Bayes: Naive Bayes is a probabilistic algorithm that is often used for text
classification tasks. It assumes that the presence of a particular feature in a class is
independent of the presence of other features, hence the "naive" assumption. Naive Bayes
is computationally efficient and works well with large text datasets.

138
• Support Vector Machines (SVM): SVM is a supervised learning algorithm that can be
used for both classification and regression tasks. In text mining, SVM is often employed
for text classification problems where the goal is to assign a document to one of the
predefined categories. SVM works by finding an optimal hyperplane that separates the data
points representing different classes.

• Decision Trees: Decision trees are hierarchical structures that recursively split the data
based on different features. In text mining, decision trees can be used for tasks like text
classification, topic modeling, and sentiment analysis. Decision trees are easy to interpret
and can handle both categorical and numerical features.

• Random Forests: Random forests are an ensemble learning method that combines
multiple decision trees to make predictions. In text mining, random forests can be used for
tasks such as text classification, sentiment analysis, and feature selection. Random forests
reduce overfitting and provide robust predictions by aggregating the results from multiple
decision trees.

• Hidden Markov Models (HMM): HMM is a statistical model that is widely used for
sequence analysis and natural language processing tasks. HMMs are particularly useful for
tasks such as part-of-speech tagging, named entity recognition, and speech recognition.
They model the underlying probabilistic transitions between different states and the
observed output based on those states.

• Text mining and text analysis identifies textual patterns and trends within unstructured data
using machine learning, statistics, and linguistics.
• Natural Language Understanding helps machines “read” text (or speech) by simulating the
human ability to understand a natural language such as English, Spanish, or Chinese.
• Text mining employs a variety of methodologies to process the text, one of the most
important of these being Natural Language Processing (NLP).

139
• NLP includes both Natural Language Understanding and Natural Language Generation,
which simulates the human ability to create natural language text e. g. to summarize
information or take part in a dialogue.

These are very few some examples of the algorithms used in text mining. The choice of
algorithm depends on the specific task, dataset, and goals of the analysis. It is common to use
a combination of multiple algorithms and techniques to extract meaningful insights from text
data.

Common tasks and techniques used in text mining:


• Text Preprocessing: This step involves cleaning and transforming raw text data by
removing unnecessary characters, converting text to lowercase, tokenizing (splitting text
into words or phrases), removing stop words (common words like "and," "the," etc.),
stemming or lemmatizing (reducing words to their base form), and performing other
necessary transformations.

• Sentiment Analysis: Sentiment analysis aims to determine the sentiment or opinion


expressed in a piece of text, such as positive, negative, or neutral. It is often used to analyze
customer feedback, social media posts, or reviews to understand public opinion about
products, services, or events.

• Entity Recognition: Entity recognition involves identifying and classifying named entities
such as people, organizations, locations, dates, or other specific terms mentioned in the
text.
• Topic Modeling: Topic modeling is a technique that identifies the main topics or themes
present in a collection of documents. It can help uncover hidden patterns and structures
within the text data and is often used for document clustering, content recommendation,
and information retrieval.

140
• Text Classification: Text classification involves categorizing text documents into
predefined categories or classes. It is used for tasks like spam filtering, sentiment analysis,
news categorization, and document routing.

• Named Entity Recognition: Named Entity Recognition (NER) is the process of


identifying and extracting named entities such as names of persons, organizations,
locations, dates, etc., from text. It helps in information retrieval and understanding the
relationships between different entities.

• Text Summarization: Text summarization aims to generate a concise and coherent


summary of a longer piece of text. It can be done using extractive techniques, which
involve selecting and combining important sentences or phrases from the original text, or
abstractive techniques, which involve generating new sentences to capture the main ideas.

• Text mining involves the application of various algorithms and techniques from fields such
as natural language processing (NLP), machine learning, and statistics.

Text mining involves extracting meaningful information and insights from large volumes
of unstructured text data. It has numerous applications across various industries and
domains. Here are some key applications of text mining:

1. Sentiment Analysis:
• Analyzing social media posts, reviews, and comments to determine public
sentiment about products, services, or brands.
• Monitoring customer feedback to assess satisfaction levels and identify areas for
improvement.

2. Customer Feedback and Market Research:


• Extracting insights from customer surveys, focus group transcripts, and open-ended
responses.

141
• Identifying emerging trends and consumer preferences from online discussions and
forums.
3. Content Categorization and Classification:
• Automatically categorizing news articles, blog posts, or documents into relevant
topics or themes.
• Labeling emails or documents for better organization and retrieval.
4. Information Retrieval and Search Enhancement:
• Improving search engines by understanding user intent and returning more relevant
results.
• Extracting key information from documents to provide snippets or summaries in
search results.

5. Document Summarization:
• Generating concise and coherent summaries of lengthy documents for quick
understanding.
• Creating executive summaries of reports and research papers.
6. Named Entity Recognition (NER):
• Identifying and classifying entities such as names of people, organizations,
locations, and dates in text.
• Enhancing database entries and information retrieval by linking entities.
7. Topic Modeling:
• Discovering latent topics within a collection of documents.
• Understanding the main themes in a large corpus of text data.
8. Fraud Detection:
• Identifying patterns of fraudulent activities by analyzing textual descriptions and
transaction details.
• Detecting unusual or suspicious language in insurance claims or financial
documents.
9. Medical and Healthcare Applications:
• Mining electronic health records to identify patterns and trends in patient data.
142
• Extracting medical insights from research papers and clinical notes.
10. Legal and Compliance Analysis:
• Automating the review and analysis of legal contracts and documents.
• Identifying potential compliance violations in textual data.
11. Human Resources and Employee Feedback:
• Analyzing employee surveys and feedback to assess engagement levels and identify
workplace issues.
• Extracting skills, qualifications, and experience from resumes for recruitment
purposes.
12. Academic Research:
• Analyzing research papers to identify relevant citations and build citation networks.
• Extracting data for systematic reviews and literature surveys.
13. Social Media Analytics:
• Tracking trending topics and viral content on social media platforms.
• Understanding public opinions and reactions to current events.
14. Language Translation and Cross-Language Analysis:
• Enabling automatic translation of text between languages.
• Comparing sentiment, themes, and trends across different languages.
15. Competitive Intelligence:
• Analyzing competitor websites, press releases, and documents to gain insights into
their strategies and offerings.
• Identifying emerging competitors in the market.

Text mining's applications are enormous and continually expanding as organizations


recognize the value of extracting insights from unstructured text data to inform decision-
making and strategy development.

Advantages of text mining


• Using text mining and analytics to gain insight into customer sentiment can help
companies detect product and business problems and then address them before they
become big issues that affect sales.
143
• Mining the text in customer reviews and communications can also identify desired
new features to help strengthen product offerings. In each case, the technology
provides an opportunity to improve the overall customer experience, which will
hopefully result in increased revenue and profits.
• Text mining can also help predict customer churn, enabling companies to take
action to head off potential defections to business rivals, as part of their marketing
and customer relationship management programs.
• In healthcare, text mining may be able to help diagnose illnesses and medical
conditions in patients based on the symptoms they report.

While text mining offers several benefits, it also comes with few disadvantages and challenges.
Here are some of the key disadvantages of text mining:

1. Ambiguity and Contextual Understanding: Text often contains ambiguity, idiomatic


expressions, sarcasm, and cultural nuances that can be challenging for automated
algorithms to accurately interpret.
2. Lack of Domain Knowledge: Text mining algorithms might struggle to accurately
analyze specialized or technical content without domain-specific knowledge, leading
to misinterpretations.
3. Data Quality and Noise: Text data can be noisy, containing misspellings, grammatical
errors, and inconsistent formatting that can affect the accuracy of analysis.
4. Data Volume and Scalability: Processing large volumes of text data requires
significant computational resources and can lead to performance bottlenecks.
5. Bias and Fairness: Text mining algorithms can inherit and amplify biases present in
the training data, leading to unfair or biased results, especially when dealing with
sensitive topics or underrepresented groups.
6. Privacy Concerns: Extracting insights from text data might unintentionally reveal
sensitive personal information, raising privacy concerns.
7. Complexity of Natural Language: Natural language is complex, with intricate
sentence structures, wordplay, and metaphorical expressions that can challenge
automated analysis.

144
8. Subjectivity and Tone Analysis: Determining the emotional tone, sentiment, or
intention behind text can be challenging due to the subjectivity of human language.
9. Semantic Understanding: Extracting accurate meaning and context from text requires
deep semantic understanding, which current algorithms might struggle with.
10. Unstructured Data: Text data is unstructured, making it harder to process and analyze
compared to structured data.
11. Constantly Evolving Language: Languages and vocabularies change over time,
leading to challenges in keeping text mining algorithms up to date.
12. Interdisciplinary Expertise Required: Effective text mining often requires expertise
in linguistics, data science, and domain knowledge, making it a multidisciplinary
endeavor.
13. Overfitting and Generalization: Models trained on specific text datasets might overfit
and struggle to generalize to new, unseen data.
14. Lack of Visual Context: Text lacks the visual context present in images or videos,
making it difficult to capture certain types of information.
15. Time-Consuming Annotation: Preparing and annotating text data for training
machine learning models can be time-consuming and labor-intensive.
16. Ethical and Legal Concerns: Analyzing text data might raise ethical and legal
concerns, such as copyright infringement or unintended consequences of data use.

Despite these disadvantages, text mining remains a valuable tool for extracting insights from
unstructured data, and advancements in natural language processing and machine learning
continue to address many of these challenges.

A Practical Demonstration of Text Analytics using R

Let’s see how to do Text Analytics using R programming Language. We need multiple packages
to be installed in RStudio to do Text Analytics.
1. gutenbergr to install this library, supporting packages required are ‘lazyeval', 'urltool’,
‘triebeard’
2. tidytext

145
3. dplyr
4. ggplot2

once all these libraries are installed and loaded the following R snippet can be run in RStudio,
which will produce the expected Result.
To do text analytics, we are going to use Mark Twain’s books from gutenbergr library.
Write the following code in RStudio and run you will be able to plot the following graph.

install.packages("gutenbergr") # to install this package, install prerequisites (other packages)


library(gutenbergr) # load library

#In the gutenbergr library, each book is tagged to an ID number, which need to identify
# their location.
mark_twain <- gutenberg_download(c(76, 74, 3176, 245))
# pull the books using the gutenberg_download function and save it to a mark_twain data
#frame.

View(mark_twain) # to view data frame.

146
# Output of the above command

#When you analyse any text, there will always be redundant words that can skew the results
#depending on what patterns or trends you are trying to identify. These are called stop words.
#It is up to you if you want to remove stop_words, but for this example, let’s go ahead and #remove
them.
#First, it is needed to load the library tidytext.
install.packages("tidytext") #install tidytext library
library(tidytext) # load tidytext library
data(stop_words) # load data from tidytext library
View(stop_words) # view dataframe

147
# output of the previous command 1,149 stop words

library(dplyr) # load library dyplyr


tidy_mark_twain <- mark_twain %>%
unnest_tokens(word, text) %>% #tokenize

148
anti_join(stop_words) #remove stop words
tidy_mark_twain %>%
count(word, sort=TRUE)
library(ggplot2) # load library ggplot2 to plot graph
freq_hist <-tidy_mark_twain %>%
count(word, sort=TRUE) %>%
filter(n > 400) %>%
mutate(word = reorder(word,n )) %>%
ggplot(aes(word, n))+
geom_col(fill= 'blue')+
xlab(NULL)+
coord_flip()
print(freq_hist)

149
SUMMARY

• Social networking involves using online social media platforms to connect with new
and existing friends, family, colleagues, and businesses.
• Individuals can use social networking to announce and discuss their interests and
concerns with others who may support or interact with them.
• Social network analysis is a method of studying relationships between objects and
events in a social structure.

150
• Social network analysis (SNA) can be used to improve communities, identify missing
links, and improve connections between groups.
• Text mining uses artificial intelligence (AI) techniques to automatically discover
patterns, trends, and other valuable information in text documents.
• Text mining employs a variety of methodologies to process the text, one of the most
important of these being Natural Language Processing (NLP).
• Fraud detection, risk management, online advertising, customer churn and web
content management are other functions that can benefit from the use of text mining
tools.

References
1. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
2. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed
J. Zaki and Wagner Meira Jr.
3. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta,
Umesha Nayak
4. What is Social Network Analysis by John Scott.
5. The Text Mining Handbook - Advanced Approaches in Analyzing Unstructured
Data by Ronen Feldman, James Sanger
6. Natural Language Processing and Text Mining by Anne Kao and Stephen R.
Poteet
7. Theory and Applications for Advanced Text Mining by Shigeaki Sakurai
8. Text Mining Predictive Methods for Analyzing Unstructured Information by
Sholom M. Weiss, Nitin Indurkhya, Tong Zhang, Fred J. Damerau

Review Questions:
1. Explain what is social networking?
2. What are the advantages and disadvantages of social networking?
3. What is social network analytics?
4. What are the steps involved in social network analytics?

151
5. What are the applications of social network analytics?
6. What is text mining? What is text analytics? How both are different from each other?
7. Discuss the steps involved in text mining.
8. Explain the steps involved in text analytics/
9. Write the applications of text mining.
10. Write the advantages of text mining.
11. Write at least 6 disadvantages and limitations of text mining.
12. Write algorithms used in text mining.
13. Write an R snippet to demonstrate Text analytics.

Further Readings:
1. Scott J. Social network analysis: a handbook. Newbury Park: Sage, 2000.
2. Carrington PJ, Scott J, Wasserman S. Models, and methods in social network analysis
Cambridge: Cambridge University Press, 2005.
3. Wasserman S, Faust K. Social network Analysis: methods and applications. Cambridge:
Cambridge University Press, 1994.
4. M.E.J Newman. Networks. An Introduction. 1st edition Oxford University Press, 2010
5. Models and Methods in Social Network Analysis by Peter J. Carrington, John Scott,
Stanley Wasserman, Cambridge University Press

152
Unit VI: Introduction to Big Data Analytics

Topics:
• Introduction to Big Data Analytics
• Applications of Big Data Analytics
• Data Analysis Project Life Cycle
• Overview of Analytics Tools
• Achieving Competitive Advantage with Data Analytics

Objectives:
25. To understand the fundamental concepts of Big Data Platform and its Use
cases.
26. To understand the Big Data challenges & opportunities.
27. To understand Big Data Analytics and its applications
28. To study Data Analytics Project Life Cycle.
29. To study Data Analytics Tools
30. To study data analytics help to achieve competitive advantage

Outcomes:
Students will be able to
25. Understand the fundamental concepts of Big Data Platform and its Use
cases.
26. Understand the Big Data challenges & opportunities.
27. Understand the aspects of big data Analytics with the help of different Big
Data Applications.
28. Study Data Analytics Project Life Cycle.
29. Learn Data Analytics Tools.
30. Study data analytics can be utilized to achieve competitive advantage.

153
Introduction to the Unit
This unit explains several key concepts to clarify what is meant by Big Data, what big data
analytics is, why advanced analytics are needed. This unit also discusses the applications of Big
Data Analytics. It takes a close look at Data Analytics Project Life Cycle. Then, it outlines that
organizations contend with where they have an opportunity to leverage advanced analytics to
create competitive advantage. The unit covers the analytics tools which are used for Data
Analytics.

What is Data?
• Data is a collection of raw facts and figures. The quantities, characters, or symbols on
which operations are performed by a computer, which may be stored and transmitted in the
form of electrical signals and recorded on magnetic, optical, or mechanical recording
media.
• Every day, we create 2.5 quintillion (1 quintillion is 1030) bytes of data.
• So much that 90% of the data in the world today has been created in the last two years
alone.
• This data comes from everywhere: sensors used to gather climate information, posts to
social media sites, digital pictures and videos, purchase transaction records, and cell phone
GPS signals etc. This data is nothing but big data.

What is Big Data?


Big Data is data with a huge size. Big Data is a term used to describe a collection of data
that is huge in size and yet growing exponentially with time. In short, such data is so large
and complex that none of the traditional data management tools are able to store it or
process it efficiently.
e.g.
• The New York Stock Exchange generates about one terabyte of new trade data per
day.

154
• Social Media
• Black box data

The statistics show that 500+ terabytes of new data get ingested into the databases of
social media site Facebook, every day. This data is mainly generated in terms of photo
and video uploads, message exchanges, putting comments etc.

Big data is high-velocity and high-variety information assets that demand cost effective,
innovative forms of information processing for enhanced insight and decision making.
Big data refers to datasets whose size is typically beyond the storage capacity of and
complex for traditional database software tools.
E. g. Apache Storm, Hadoop, MongoDB, Quoble, Cassandra, CouchDB, HPCC, Statwing.

Types Of Big Data


Big Data' could be found in three forms:
1. Structured
2. Unstructured
3. Semi-structured

Structured:
Any data that can be stored, accessed, and processed in the form of fixed format is termed
as “structured” data. Over the period, talent in computer science has achieved greater
success in developing techniques for working with such kind of data (where the format is
well known in advance) and deriving value out of it. However, nowadays, we are
foreseeing issues when a size of such data grows to a huge extent, typical sizes are being
in the rage of multiple zettabytes.

Example of structured data:


An 'Employee' table in a database is an example of Structured Data
Employee_Code Name_of_Employee Gender Department Salary_In_lacs
2365 Rajesh Kulkarni Male Finance 6.5

155
3398 Pratibha Bhagwat Female Admin 6.7
7465 Pranav Roy Male Admin 5
7500 Chittaranjan Das Male Finance 5.2
7699 Beena Sane Female Finance 4.5
Unstructured:
Any data with unknown form or structure is classified as unstructured data. In addition to
the size being huge, un-structured data poses multiple challenges in terms of its processing
for deriving value out of it. A typical example of unstructured data is a heterogeneous data
source containing a combination of simple text files, images, videos etc. Nowadays
organizations have wealth of data available with them but unfortunately, they don't know
how to derive value out of it since this data is in its raw form or unstructured format.

Examples Of Unstructured Data


A typical example of unstructured data which comes from heterogeneous data sources like
simple text files, images, videos, audios, comments, reviews, etc.

➢ Text files: Word processing, spreadsheets, presentations, email, logs.


➢ Email: Email has some internal structure thanks to its metadata, and we sometimes
refer to it as semi-structured. However, its message field is unstructured and
traditional analytics tools cannot parse it.
➢ Social Media: Data from Facebook, Twitter, LinkedIn.
➢ Website: YouTube, Instagram, photo sharing sites.
➢ Mobile data: Text messages, locations.
➢ Communications: Chat, IM, phone recordings, collaboration software.
➢ Media: MP3, digital photos, audio, and video files.
➢ The output returned by 'Google Search', social media posts, Word, PDF, Text,
Media Logs, etc.
➢ Business applications: MS Office documents, productivity applications.
➢ Machine-generated unstructured data includes:
• Satellite imagery: Weather data, landforms, military movements.

156
• Scientific data: Oil and gas exploration, space exploration, seismic imagery,
atmospheric data.
• Digital surveillance: Surveillance photos and video.
• Sensor data: Traffic, weather, oceanographic sensors.

Semi-structured
Semi-structured data can contain both forms of data. We can see semi-structured data as
structured in form, but it is actually not defined with e.g., a table definition in relational
DBMS. An example of semi-structured data is data represented in an XML file.

Examples of Semi-structured Data

Personal data stored in an XML file-


<rec><name>Dinesh Rao</name><gender>Male</gender><age>35</age></rec>
<rec><name>Seema R.</name><gender>Female</gender><age>41</age></rec>
<rec><name>Satish Mane</name><gender>Male</gender><age>29</age></rec>
<rec><name>Subrato Roy</name><gender>Male</gender><age>26</age></rec>
<rec><name>Jeremiah J.</name><gender>Male</gender><age>35</age></rec>

Know your progress:

Q.1 Which types of Big Data involve structured, well-organized data that fits neatly into
traditional databases and spreadsheets?
A. Structured Data
B. Semi-structured Data
C. Unstructured Data
D. Metadata

157
Q.2 Which types of Big Data includes data with a defined structure, but not as rigid as
structured data, often containing tags or elements that provide context?
A. Structured Data
B. Semi-structured Data
C. Unstructured Data
D. Metadata

Q.3 Which types of Big Data refers to data that lacks a specific structure and is often in the
form of text, images, audio, or video?

A. Structured Data
B. Semi-structured Data
C. Unstructured Data
D. Metadata

5 V’s of Big Data:

Volume: It is the first of the 5 V's of big data, represents the amount of data that exists.
Volume is like the base of big data, as it is the initial size and amount of data that is
collected. If the volume of data is large enough, it can be considered big data. What is big
data is relative, though, and will change depending on the available computing power that's
on the market. The size and amounts of big data that companies manage and analyze.

Value: the most important “V” from the perspective of the business, the value of big data
usually comes from insight discovery and pattern recognition that leads to more effective
operations, stronger customer relationships and other clear and quantifiable business
benefits.

Variety: Variety refers to the combination or mixture of data types. An organization might
obtain data from a number of different data sources, which may vary in value. Data can

158
come from sources in and outside an organization as well. The challenge in variety
concerns the standardization and distribution of all data being collected. Variety means the
diversity and range of different data types, including unstructured data, semi-structured
data, and raw data.

Velocity: the speed at which companies receive, store, and manage data. Velocity is all
about the speed at which data is coming into the organization. The ability to access and
process varying velocities of data quickly is critical., the specific number of social media
posts or search queries received within a day, hour, or other unit of time.

Veracity: It refers to the quality and accuracy of data. Gathered data could have missing
pieces, maybe inaccurate or may not be able to provide real, valuable insight. Veracity,
overall, refers to the level of trust there is in the collected data.

Data can sometimes become messy and difficult to use. A large amount of data can cause
more confusion than insights if it's incomplete. For example, concerning the medical field,
if data about what drugs a patient is taking is incomplete, then the patient's life may be
endangered.

Know your progress:

Q.1 Which of the following is NOT one of the "5 V's" of Big Data?
A. Volume
B. Variety
C. Viscosity
D. Veracity

Q.2 Which "V" of Big Data refers to the sheer size of data generated and collected?
A. Velocity
B. Value
C. Volume

159
D. Veracity

Q.3 The "V" of Big Data that deals with the trustworthiness and reliability of data is:

A. Velocity
B. Viscosity
C. Veracity
D. Variety

How big data helps any organization:


The following process shows that the more the data, the more the accurate analysis, then
the organization can have greater confidence in decision making. Eventually, an
organization can get benefits from big data like greater operational efficiencies, cost
reduction, time reduction, new product development, optimized offerings, etc.

Challenges of Big Data:


Big data offers enormous advantages, but it also faces big challenges, including new privacy and
security concerns, user accessibility for business users, and selecting the best solutions for your
company's requirements. Organisations must deal with the following issues to benefit from
incoming data:

160
• Making big data accessible: As data volume increases, collecting and analysing it
becomes increasingly challenging. Data must be made accessible and simple for users of
all skill levels by organisations.
• Maintaining quality data: Organisations are spending more time than ever before looking
for duplication, mistakes, absences, conflicts, and inconsistencies because there is so much
data to keep.
• Protecting Data and keeping it secure: Concerns about privacy and security increase as
data volume increases. Before utilising big data, organisations will need to work towards
compliance and set up strict data protocols.
• Selecting the appropriate platforms and tools: big data processing and analysis
technologies are always evolving. To function within their current ecosystems and meet
their specific demands, organisations must find the appropriate technology. A flexible
system that can adapt to future infrastructure changes is frequently the best option.

Big Data analytics is the process of collecting, examining, and analysing large amounts of data to
discover market trends, insights, and patterns that can help companies make better business
decisions. It is a process used to extract meaningful insights, such as hidden patterns, unknown
correlations, and customer preferences. Big data analytics is a process of identifying patterns,
trends, and correlations in vast quantities of unprocessed data to support data-driven decision-
making. These procedures employ well-known statistical analysis methods, such as clustering and
regression, to larger datasets with the aid of more recent tools.

Big Data analytics provides various advantages, it can be used for better decision making,
preventing fraudulent activities, among other things.

Importance of Big Data Analytics:


• Big data analytics are important because they enable businesses to use their data to find
areas for optimisation and development. Across all corporate sectors, improving efficiency
results in more shrewd operations overall, more profits, and happy customers. Big data

161
analytics aids businesses in cost-cutting and the creation of superior, client-focused goods
and services.

• Data analytics assists in generating insights that enhance how our society operates. Big
data analytics in the healthcare industry is essential for tracking and analysing individual
patient records as well as for monitoring results on a global level. Big data helped health
ministries in each country's government decide how to handle vaccines during the COVID-
19 pandemic and come up with strategies for preventing pandemic breakouts in the future.

Advantages of Big Data Analytics:


The organizations which generate big data or use big data for analysing purpose can take data from
any source and analyse it to find answers which will enable:
1. Cost Savings/ Reduction: Some tools of Big Data like Hadoop and Cloud-Based
Analytics can bring cost advantages to business when large amounts of data are to be
stored and these tools also help in identifying more efficient ways of doing business.
Big data can reduce costs in storing all business data in one place. Tracking analytics
also helps companies find ways to work more efficiently to cut costs wherever possible.
2. Time Reductions: The high speed of tools like Hadoop and in-memory analytics can
easily identify new sources of data which helps businesses analyze data immediately
and make quick decisions based on the learnings.
3. Understand the market conditions: By analyzing big data you can get a better
understanding of current market conditions. For example, by analyzing customers’
purchasing behaviors, a company can find out the products that are sold the most and
produce products according to this trend. By this, it can get ahead of its competitors.
4. Control online reputation: big data tools can do sentiment analysis. Therefore, you
can get feedback about who is saying what about your company. If you want to monitor
and improve the online presence of your business, then, big data tools can help in all
this.
5. Boost Customer Acquisition and Retention: The customer is the most important
asset any business depends on. If a business is slow to learn what customers are looking
for, then it is very easy to begin offering poor quality products. In the end, loss of

162
customers will result, and this creates an adverse overall effect on business success.
The use of big data allows businesses to observe various customer related patterns and
trends. Observing customer behavior is important to trigger loyalty.
6. Solve Advertisers Problem and Offer Marketing Insights: Big data analytics can
help change all business operations. This includes the ability to match customer
expectations, changing the organizations’ product line and of course ensuring that the
marketing campaigns are powerful.
7. Driver of Innovations and Product Development: Another huge advantage of big
data is the ability to help organizations innovate and redevelop their products.
8. Product development: Using information gathered from client requirements and
wants makes developing and marketing new goods, services, or brands much simpler.
Businesses may better analyse product viability and stay current on trends with the use
of big data analytics.
9. Strategic business decisions: The capacity to continuously examine data aids firms in
reaching quicker and more accurate conclusions about issues like cost and supply chain
efficiency.
10. Customer experience: Data-driven algorithms provide a better customer experience,
which aids marketing efforts (targeted advertisements, for instance) and boosts
consumer happiness.
11. Risk management: Companies may find dangers by examining data trends, then come
up with ways to control those risks.

Big Data Use Cases:


• Entertainment: The provision of tailored movie and music recommendations based on a
customer's tastes has revolutionised the entertainment business (see Spotify and Netflix).

• Education: Based on student requirements and demand, big data enables educational
technology businesses and institutions create new curricula and enhance already-existing
ones.

163
• Medical care: Keeping track of individuals' medical history enables clinicians to identify
and stop illnesses.

• Government: To better manage the public sector, big data may be utilised to gather
information from CCTV and traffic cameras, satellites, body cameras and sensors, emails,
calls, and more.

• Marketing: Customer preferences and information may be leveraged to develop very


effective targeted advertising campaigns.

• Banking: Data analytics can assist in detecting and observing unauthorised money
laundering.

Applications of Big Data Analytics:


1. Transportation
• Congestion management and traffic control: Big Data analytics, Google Maps can
now tell the least traffic-prone route to any destination.
• Route planning: Different itineraries can be compared in terms of user needs, fuel
consumption, and other factors to plan for maximize efficiency.
• Traffic safety: Real-time processing and predictive analytics are used to pinpoint
accident-prone areas.
2. Banking and Financial Services
• Fraud detection: Banks monitor credit cardholders’ purchasing patterns and other
activity to flag atypical movements and anomalies that may signal fraudulent
transactions.
• Risk management: Big Data analytics enables banks to monitor and report on
operational processes, KPIs, and employee activities.
• Customer relationship optimization: Financial institutions analyze data from website
usage and transactions to better understand how to convert prospects to customers and
incentivize greater use of various financial products.
• Personalized marketing

164
3. Customer Acquisition and Retention: Customer information plays a significant role in
marketing strategies that aim to improve customer happiness through data-driven
initiatives. For Netflix, Amazon, and Spotify, personalization engines aid create better
consumer experiences and foster client loyalty.
4. Targeted Ads: To construct targeted ad campaigns for customers on a bigger scale and
at the individual level, personalised data about interaction patterns, order histories, and
product page viewing history may be quite helpful.
5. Product Development: It can produce insights on product viability, performance metrics,
development choices, etc., and direct changes that benefit the customers.
6. Price optimization: With the use of various data sources, pricing models may be
modelled and utilised by merchants to increase profits.
7. Supply Chain and Chain Analytics: Analytics for the supply chain and distribution
channels: Predictive analytical models support B2B supplier networks, proactive
replenishment, route optimisations, inventory management, and delivery delay alerts.
8. Risk management: It assists in identifying new hazards using data trends in order to
create efficient risk management solutions.
9. Better Decision-Making: Enterprises may improve their decision-making by using the
insights that can be gleaned from the data.

Data Analysis Project Life Cycle

4. Introduction
Data Analytics Project Lifecycle defines the roadmap of how data is generated, collected,
processed, used, and analysed to achieve business goals.
It offers a systematic way to manage data for converting it into information that can be
used to fulfil organizational and project goals.
Data analytics mainly involves six important phases that are carried out in a cycle - Data
discovery, Data preparation, Planning of data models, the building of data models,
communication of results, and operationalization. The six phases of the data analytics
lifecycle are followed one phase after another to complete one cycle. It is interesting to

165
note that these six phases of data analytics can follow both forward and backward
movement between each phase and are iterative.

The lifecycle of data analytics provides a framework for the best performances of each
phase from the creation of the project until its completion. This framework was built by a
large team of data scientists with much care and experiments. The key stakeholders in data
science projects are business analysts, data engineers, database administrators, project
managers, executive project sponsors, and data scientists.

Data Analytics Lifecycle:

The Data analytic lifecycle is designed for Big Data problems and data science projects.
The cycle is iterative to represent a real project. To address the distinct requirements for
performing analysis on Big Data, step – by – step methodology is needed to organize the
activities and tasks involved with acquiring, processing, analyzing, and repurposing data.

Discovery

Operationalize Data Preparation

Communicate
Model Planning
Results

Model Building

166
• Discovery: In this phase stakeholder teams identify and investigate and understand the
problem.
➢ Find out the data sources and data sets required for the project.
➢ The data science team learns and investigates the problem.
➢ examine the business trends, make case studies of similar data analytics.
➢ study the domain of the business industry.
➢ Develop context and understanding.
➢ makes an assessment of the in-house resources, the in-house infrastructure, total time
involved, and technology requirements.
➢ Come to know about data sources needed and available for the project.
➢ The team formulates an initial hypothesis for resolving all business challenges in terms
of the current market scenario that can be later tested with data.

• Data Preparation: after the data discovery phase, data is prepared by transforming it from
a legacy system into a data analytics form by using the sandbox platform. A sandbox is a
scalable platform commonly used by data scientists for data preprocessing. Analytic
Sandbox is used to execute, load, and transform to get data into it.
➢ Steps to explore, preprocess, and condition data prior to modeling and analysis.
➢ It requires the presence of an analytic sandbox, the team execute, load, and transform,
to get data into the sandbox.
➢ Data preparation tasks are likely to be performed multiple times and not in predefined
order.
➢ Several tools commonly used are – Hadoop, Alpine Miner, Open Refine, etc.

• Model Planning: the team determines the methods, techniques, and workflow it intends to
follow for the subsequent model building phase.
➢ Team explores data to learn about relationships between variables and subsequently,
selects key variables and the most suitable models.
➢ In this phase, the data science team develops data sets for training, testing, and
production purposes.

167
➢ Team builds and executes models based on the work done in the model planning
phase.
➢ Several tools commonly used for this phase are – MATLAB and STASTICA.

• Model Building: Developing Data sets for training, testing, and building purposes. Team
also considers whether its existing tools will be sufficient for running the models or if they
need a more robust environment for executing models.
➢ Team develops datasets for testing, training, and production purposes.
➢ Team also considers whether its existing tools will suffice for running the models or
if they need a more robust environment for executing models and workflows.
➢ Free or open-source tools – Rand PL/R, Octave, WEKA.
➢ Commercial tools – MATLAB and STASTICA.

• Communicate results: In this phase, the team, in collaboration with major stakeholders,
determines if the results of the project are a success or a failure based on the criteria
developed in Phase 1.
➢ The result is scrutinized by the entire team along with its stakeholders to draw
inferences on the key findings and summarize the entire work done.
➢ After executing the model team needs to compare outcomes of modeling to criteria
established for success and failure.
➢ Team considers how best to articulate findings and outcomes to various team
members and stakeholders, taking into account warning, assumptions.
➢ Team should identify key findings, quantify business value, and develop narrative
to summarize and convey findings to stakeholders.

• Operationalize: Here, the team delivers final reports, briefings, code, and technical
documents. In addition, the team may run a pilot project to implement the models in a
production environment.
➢ The team communicates benefits of project more broadly and sets up pilot project
to deploy work in controlled way before broadening the work to full enterprise of
users.

168
➢ This approach enables the team to learn about performance and related constraints
of the model in the production environment on small scale and make adjustments
before full deployment.
➢ Free or open-source tools – Octave, WEKA, SQL, MADlib.

While proceeding through these six phases, the various stakeholders that can be involved in the
planning, implementation, and decision-making are data analysts, business intelligence analysts,
database administrators, data engineers, executive project sponsors, project managers, and data
scientists. All these stakeholders are rigorously involved in the proper planning and completion of
the project, keeping in note the various crucial factors to be considered for the success of the
project.

Data Analytics Tools


• Data analytics tools are software applications that collect and analyze data about a business,
its customers, and its competition in order to improve processes and help uncover insights
to make data-driven decisions.
• Used to develop and perform the necessary analytical processes that help companies make
better, more informed business decisions while lowering costs and increasing profits within
less time.
• data analytics tools have been rapidly evolving to meet the increasing demand for efficient
and powerful data analysis capabilities. These tools help organizations and individuals
extract valuable insights from large and complex datasets.

Data Analytics Project Life Cyle goes through 6 phases, each phase use specific tools to
process the data.

• Common Tools for the Data Preparation Phase:


1. Hadoop can perform massively parallel ingest and custom analysis for web traffic parsing,
GPS location analytics, genomic analysis, and combining of massive unstructured data
feeds from multiple sources.

169
2. Alpine Miner provides a graphical user interface (GUI) for creating analytic workflows,
including data manipulations and a series of analytic events such as staged data-mining
techniques (for example, first select the top 100 customers, and then run descriptive
statistics and clustering) on Postgres SQL and other Big Data sources.
3. OpenRefine (formerly called Google Refine) is “a free, open source, powerful tool for
working with messy data.” It is a popular GUI-based tool for performing data
transformations, and it’s one of the most robust free tools currently available.
4. Like OpenRefine, Data Wrangler is an interactive tool for data cleaning and
transformation. Wrangler was developed at Stanford University and can be used to perform
many transformations on a given dataset. In addition, data transformation outputs can be
put into Java or Python. The advantage of this feature is that a subset of the data can be
manipulated in Wrangler via its GUI, and then the same operations can be written out as
Java or Python code to be executed against the full, larger dataset offline in a local analytic
sandbox.

• Common Tools for the Model Planning Phase

1. R has a complete set of modelling capabilities and provides a good environment for
building interpretive models with high-quality code. In addition, it can interface with
databases via an ODBC connection and execute statistical tests and analyses against Big
Data via an open-source connection. These two factors make R well suited to performing
statistical tests and analytics on Big Data. As of this writing, R contains nearly 5,000
packages for data analysis and graphical representation. New packages are posted
frequently, and many companies are providing value-add services for R (such as training,
instruction, and best practices), as well as packaging it in ways to make it easier to use and
more robust. This phenomenon is similar to what happened with Linux in the late 1980s
and early 1990s, when companies appeared to package and make Linux easier for
companies to consume and deploy. Use R with file extracts for offline analysis and optimal
performance and use RODBC connections for dynamic queries and faster development.
2. SQL Analysis services can perform in-database analytics of common data mining
functions, involved aggregations, and basic predictive models.

170
3. SAS/ACCESS provides integration between SAS and the analytics sandbox via multiple
data connectors such as OBDC, JDBC, and OLE DB. SAS itself is generally used on file
extracts, but with SAS/ACCESS, users can connect to relational databases (such as Oracle
or Teradata) and data warehouse appliances (such as Greenplum or Aster), files, and
enterprise applications (such as SAP and Salesforce.com

• Common Tools for the Model Building Phase

1. SAS Enterprise Miner: allows users to run predictive and descriptive models based on
large volumes of data from across the enterprise. It interoperates with other large data
stores, has many partnerships, and is built for enterprise-level computing and analytics.
2. SPSS Modeler: offers methods to explore and analyze data through a GUI developed by
IBM.
3. Alpine Miner: provides a GUI front end for users to develop analytic workflows and
interact with Big Data tools and platforms on the back end.
4. MATLAB: provides a high-level language for performing a variety of data analytics,
algorithms, and data exploration.
5. STATISTICA and Mathematica are also popular and well-regarded data mining and
analytics tools.

Free or Open-Source tools:


1. R and PL/R : R was described earlier in the model planning phase, and PL/R is a
procedural language for PostgreSQL with R. Using this approach means that R commands
can be executed in database. This technique provides higher performance and is more
scalable than running R in memory.
2. Octave: It is a free software programming language for computational modeling, has some
of the functionality of MATLAB, Octave is used in major universities when teaching
machine learning.
3. WEKA: is a free data mining software package with an analytic workbench. The functions
created in WEKA can be executed within Java code.

171
4. Python is a programming language that provides toolkits for machine learning and
analysis, such as scikit-learn, NumPy, SciPy, pandas, and related data visualization using
matplotlib.
5. SQL in-database implementations, such as MADlib, provide an alternative to in-
memory desktop analytical tools. MADlib provides an open-source machine learning
library of algorithms that can be executed in-database, for for PostgreSQL or Greenplum

Data Visualization Tools:


• As the volume of data continues to increase, more vendors and communities are developing
tools to create clear and impactful graphics for use in presentations and applications.
Although not exhaustive. Following are some examples of Data visualization tools:

1. Open-Source Tools: R (Base package, lattice, ggplot2), GGobi/Rggobi, Gnuplot,


Inkscape, Modest Maps, OpenLayers Processing, D3.js, Weave

2. Commercial Tools: Spotfire (TIBCO), QlikView, Adobe Illustrator

Tableau is a powerful data visualization tool that enables users to create interactive
and visually appealing dashboards and reports. It supports various data sources and
simplifies the process of data exploration, making it suitable for both beginners and
advanced users.

Power BI: Developed by Microsoft, Power BI is a business analytics service that


provides interactive visualizations and business intelligence capabilities. It allows
users to connect to a wide range of data sources, create insightful reports, and share
them with others.

For Data Analysis Tools:


The Hadoop Ecosystem
1.Pig: Provides a high-level data-flow programming language
2.Hive: Provides SQL-like access

172
3.Mahout: Provides analytical tools
4.HBase: Provides real-time reads and writes
Once Hadoop processes a dataset, Mahout provides several tools that can analyse the data
in a Hadoop environment. For example, a k-means clustering analysis can be conducted
using Mahout.
Differentiating itself from Pig and Hive batch processing, HBase provides the ability to
perform real-time reads and writes of data stored in a Hadoop environment.

NoSQL
NoSQL (Not only Structured Query Language): is a term used to describe those data stores that
are applied to unstructured data. As described earlier, HBase is such a tool that is ideal for storing
key/values in column families. In general, the power of NoSQL data stores is that as the size of
the data grows, the implemented solution can scale by simply adding additional machines to the
distributed system.
1. Key/value stores: contain data (the value) that can be simply accessed by a given identifier
(the key). E.g., Redis, Voldemort.
2. Document stores are useful when the value of the key/value pair is a file and the file itself
is self-describing (for example, JSON or XML). E.g., CouchDB, MongoDB
3. Column family stores are useful for sparse datasets, records with thousands of columns
but only a few columns have entries. E.g., Cassandra, HBase
4. Graph databases are intended for use cases such as networks, where there are items
(people or web page links) and relationships between these items. E.g., FlockDB, Neo4j

Some more Analytics Tools used for data analytics:


1. Microsoft Excel: One of the most widely used spreadsheet applications, Excel offers basic
data analysis functionalities, such as formulas, pivot tables, and charting. While it may not
be as advanced as dedicated data analytics tools, it remains accessible to a large user base.
2. Google Analytics: Primarily used for web analytics, Google Analytics helps website
owners track and analyze user interactions, website traffic, and other important metrics. It
offers valuable insights into user behavior and helps optimize online experiences.

173
3. Apache Spark: Spark is a fast and general-purpose distributed computing engine that also
supports big data processing. It enables data processing in real-time and batch modes and
integrates well with Hadoop and other data sources.
4. KNIME: KNIME (Konstanz Information Miner) is an open-source data analytics platform
that offers a graphical user interface for building data workflows. It supports integration
with various data sources and analysis tools.
5. SAS: SAS (Statistical Analysis System) is a software suite used for advanced analytics,
business intelligence, and data management. It has been widely used in industries like
finance, healthcare, and government.
6. IBM SPSS: SPSS is a statistical software package that allows users to analyze data using
various statistical methods. It is commonly used in social sciences, market research, and
other fields.
7. QlikView: QlikView is a data visualization and business intelligence tool that allows users
to create interactive dashboards and reports for data analysis.

Overview of Data Analytics Tools:


1. Data integration software: Programs that allow big data to be streamlined across different
platforms, such as MongoDB, Apache, Hadoop, and Amazon EMR.

2. Stream analytics tools: Systems that filter, aggregate, and analyse data that might be
stored in different platforms and formats, such as Kafka.

3. Distributed storage: Databases that can split data across multiple servers and can identify
lost or corrupt data, such as Cassandra.

4. Predictive analytics hardware and software: Systems that process large amounts of
complex data, using machine learning and algorithms to predict future outcomes, such as
fraud detection, marketing, and risk assessments.

5. Data mining tools: Programs that allow users to search within structured and unstructured
big data.

174
6. Data warehouses: Storage for large amounts of data collected from many different
sources, typically using predefined schemas.

These tools cater to different levels of data analysis complexity and user requirements. The choice
of the right data analytics tool depends on factors such as the size of the dataset, the level of
technical expertise, the specific analysis needs, and the budget available for the tool. Always
consider the features, scalability, ease of use, and integration capabilities before selecting a data
analytics tool for your needs.
Business Analytics
• Business Analytics refers to the practice of analyzing and interpreting data to gain
insights and make informed business decisions.
• Business analytics is the process of gathering data, measuring business
performance, and producing valuable conclusions that can help companies make
informed strategic decisions on the future of the business, through the use of various
statistical methods and techniques.
• Business Analytics assumes that given a sufficient set of analytics capabilities exist
within an organization, the existence of these capabilities will result in the
generation of organizational value and competitive advantage. For example,
customer intelligence is one of the factors of competitive advantage organizations
can derive from their customer relationship management system (CRM) using
business analytics.

Competitive Advantage

• A competitive advantage is a feature that allows a business to outperform its rivals.


This helps a company beat its competitors in terms of profit margins, creating value
for its shareholders.
• A competitive advantage must be difficult to imitate, if not impossible. It is not
considered a competitive advantage if it is easily duplicated or replicated.

175
• The competitive advantage is what distinguishes a company's goods or services from
all other options available to a customer.
• It enables a company to manufacture goods or services more efficiently or at a lower
cost than its competitors.
• This may result in the company gaining a large market share, higher sales, and a
higher customer base than its competitors.
• It distinguishes the Company or company's business model from its competitors.
• It might be anything from their products to their service to their reputation to their
location.
• Positive business outcomes of having a competitive advantage include implementing
stronger business strategies, warding off competitors, and capturing a larger market
share within their consumer markets.
• Improved decision-making, Enhanced customer experience, Increased revenue and
profitability are some benefits of Competitive advantage.

Competitive advantage examples

• Natural resources that are not available to competitors


• High-skilled workforce and a specific geographic location
• Ability to create things at the lowest cost due to access to innovative or exclusive
technology.
• Image recognition of the brand
Creating a competitive advantage is a process that involves putting together a strategy
that will help organizations in many aspects:
1. Benefit: A business must be clear about the benefits that its product or service
offers. It must provide significant value and pique people’s interests.
2. Target Market: A business must determine who is buying from it and how it can
cater to that market.
3. Competitors: To succeed in the competitive landscape, a company must first
understand its competitors.

176
4. For a business to gain a competitive advantage, it must articulate the benefit they
bring to their target market in ways that other businesses cannot.

Competitive advantage in business analytics

• Business analytics for organizations is becoming a competitive advantage and is


now necessary to apply business analytics, particularly its subset of predictive
business analytics. Data helps in identifying business strengths and weaknesses in
competitive advantage.
• When business analytics initiatives are adopted correctly businesses are guaranteed
success. The use of business analytics is a skill that is gaining mainstream value
due to the increasingly thinner margin for decision error. It is there to provide
insights and predict the future of the business. Hence, businesses should prioritize
analytics efforts to differentiate themselves from their competition to gain a bigger
market share through high-value opportunities.

• Business analytics facilitates differentiation through the various analytics models.


It is primarily about driving change through analytics priorities. Business analytics
drives competitive advantage by generating economies of scale, economies of
scope, and quality improvement. Taking advantage of economies of scale is the first
way organizations achieve comparative cost efficiencies and drive competitive
advantage against their peers. Taking advantage of the economies of scope is the
second-way organizations achieve relative cost efficiencies and drive competitive
advantage against their peers.

• Business analytics enhances the efficiency of business operations providing


business owners with valuable information about the performance of the business.
The efficiencies that accumulate when a firm embraces big data technology
eventually contribute to a ripple effect of increased production and reduced
business costs.

177
▪ Analytics gives companies insight into their customers’ behavior and needs,
employee retention, etc.
▪ Business Analytics help businesses stay ahead of their competitors by providing
real-time data analysis.
▪ It also makes it possible for a company to understand its brand's public opinion,
follow the results of various marketing campaigns, and strategize how to create a
better marketing strategy to nurture long and fruitful relationships with its
customers.
▪ Business analytics helps organizations to know where they stand in the industry or
a particular niche and provides the company with the needed clarity to develop
effective strategies to position itself better in the future.
▪ For a company to remain competitive in the modern marketplace that requires
constant change and growth, it must stay informed on the latest industry trends and
best practices.
▪ If the management team is analytics-impaired, then that business is at risk.
Predictive business analytics is arguably the next wave for organizations to
successfully compete and it is an advantage for organizations.

For any company to survive, it must have a competitive advantage. To be competitive,


businesses must find new ways to minimize expenses, better allocate resources, and devise
ways to reach out to each of their clients personally. All of this is possible with the help of data
analytics. Data analytics delivers unique insights that are not available through other methods.
These are the kinds of information that can help other companies make better use of their
resources, decrease expenses, and personalize their offerings to gain a competitive advantage.

SUMMARY
• Big data is anything beyond the human & technical infrastructure needed to support
storage, processing, and analysis.
• Big data refers to large amounts of data that can inform analysts of trends and
patterns.

178
• Velocity, Volume, Value, Veracity, and Variety are the characteristics of Big Data.
• Big Data analytics integrate structured and unstructured data with real time feeds
and queries, opening new paths to innovation and insight.
• Big Data Analytics has many applications such as Detecting Fraud in Banking
applications, Customer relationship management, Marketing and Advertising etc.
• Data Analytics Lifecycle defines the roadmap of how data is generated, collected,
processed, used, and analyzed to achieve business goals.
• Data Analytics Project Lifecycle consists of 6 phases.
• Discovery, Data Preparation, Model Planning, Model Building, Communicate
Results, operationalize are the phases of Data Analytics Project Lifecycle.
• Data analytics tools are software applications that collect and analyze data about a
business, to improve processes and identify hidden patterns to make data-driven
decisions.
• Several free and commercial tools are available for exploring, conditioning,
modelling, and presenting data.
• OpenRefine Data Wrangler, R and PL/R, Octave and so on are opensource free
Data Analytics tools.
• SAS, Hadoop, MapReduce, MATLAB, SQL are commercial software used for
Data Analytics.
• There are Data visualization tools available to visualize data analysis, like tableau,
R’s libraries, Python libraries, etc.
• To be competitive, businesses must find new ways to minimize expenses, better
allocate resources, and devise ways to reach out to each of their clients personally.
• Business Analytics can help companies make better use of their resources, decrease
expenses, and personalize their offerings to gain a competitive advantage.
• This will result not only from being able to predict outcomes but also to reach
higher to optimize the use of their resources, assets, and trading partners.

References

179
1. Big Data Imperatives - Enterprise Big Data Warehouse, BI Implementations and Analytics
by Soumendra Mohanty, Madhu Jagadeesh, Harsha Srivatsa
2. Big Data Analytics made easy by Y Lakshmi Prasad
3. Introduction to Big Data Analytics by EMC Education
4. Big Data Analytics by Dr. Anil Kumar K.M
5. Data Science from Scratch by Steven Cooper
6. Business Analytics using R – A Practical Approach by Dr. Umesh R. Hodeghatta, Umesha
Nayak
7. Business Analytics Principles, Concepts, and Applications, What, Why, and How, Marc J.
Schniederjans, Dara G. Schniederjans, Christopher M. Starkey
8. Data Analytics made Accessible by Dr. Anil Maheshwari
9. Data Science and Big Data Analytics by EMC Education Services.
10. Data Mining Concepts and Techniques by Jiawei Han, Micheline Kamber, Jian Pei
11. Data Mining and Analysis - Fundamental Concepts and Algorithms by Mohammed J. Zaki
and Wagner Meira Jr.

Review Questions
1. What is big data? Give examples of big data.
2. What are the applications of big data?
3. What is the importance of big data?
4. Which tools are used for handling big data?
5. Explain data analytics project life cycle with diagram.
6. What are the challenges of big data?
7. Explain at least two tools used phase wise, in data analytics project life cycle.
8. Explain applications of Big Data Analytics.
9. Explain advantages of big data analytics.
10. Explain 5 V’s of big data.
11. What are the types of big data? Explain with the examples.
Further Readings:
1. Madhu Jagadeesh, Soumendra Mohanty, Harsha Srivatsa, “Big Data Imperatives:
Enterprise Big Data Warehouse, BI Implementations and Analytics”, 1st Edition, Apress

180
2. Frank J. Ohlhorst, “Big Data Analytics: Turning Big Data into Big Money”, Wiley
Publishers
3. Cristian Molaro, Surekha Parekh, Terry Purcell, “DB2 11: The Database for Big Data &
Analytics”, MC Press
4. Tom White, “Hadoop –The Definitive Guide, Storage and analysis at internet scale”, SPD,
O’Reilly.
5. DT Editorial Services, “Big Data, Black Book-Covers Hadoop2, MapReduce, Hive,
YARN, Pig, R and Data Visualization” Dreamtech Press
6. Chris Eaton, Dirk Deroos et. al., “Understanding Big data”, Indian Edition, McGraw Hill.
7. Chen, H., Chiang, R.H. and Storey, V.C., 2012. Business intelligence and analytics: from
big data to big impact. MIS quarterly, pp.1165-1188.

181

You might also like