0% found this document useful (0 votes)
27 views18 pages

DAUnit 2

Data Analytics is essential for businesses to extract insights from vast amounts of data, improve decision-making, and enhance productivity. It encompasses various types of analytics including descriptive, diagnostic, predictive, and prescriptive analytics, each serving different purposes in understanding past events, diagnosing issues, forecasting future trends, and recommending actions. Additionally, tools like R, Python, and Tableau are commonly used in data analytics, while modeling and databases play a crucial role in optimizing business processes and managing data effectively.

Uploaded by

sudharani.am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views18 pages

DAUnit 2

Data Analytics is essential for businesses to extract insights from vast amounts of data, improve decision-making, and enhance productivity. It encompasses various types of analytics including descriptive, diagnostic, predictive, and prescriptive analytics, each serving different purposes in understanding past events, diagnosing issues, forecasting future trends, and recommending actions. Additionally, tools like R, Python, and Tableau are commonly used in data analytics, while modeling and databases play a crucial role in optimizing business processes and managing data effectively.

Uploaded by

sudharani.am
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 18

Data Analytics:UNIT-II

2.1 Introduction to Analytics


As an enormous amount of data gets generated, the need to extract useful insights
is a must for a business enterprise. Data Analytics has a key role in improving your
business. Here are 4 main factors which signify the need for Data Analytics:
 Gather Hidden Insights – Hidden insights from data are gathered and then
analyzed with respect to business requirements.
 Generate Reports – Reports are generated from the data and are passed on to
the respective teams and individuals to deal with further actions for a high
rise in business.
 Perform Market Analysis – Market Analysis can be performed to understand
the strengths and the weaknesses of competitors.
 Improve Business Requirement – Analysis of Data allows improving Business
to customer requirements and experience.

Data Analytics refers to the techniques to analyze data to enhance productivity and
business gain. Data is extracted from various sources and is cleaned and categorized
to analyze different behavioral patterns. The techniques and the tools used vary
according to the organization or individual.

Data analysts translate numbers into plain English. A Data Analyst delivers value to
their companies by taking information about specific topics and then interpreting,
analyzing, and presenting findings in comprehensive reports. So, if you have the
capability to collect data from various sources, analyze the data, gather hidden insights
and generate reports, then you can become a Data Analyst. Refer to the image below:

Fig 2.1 Data Analytics


In general data analytics also deals with bit of human knowledge as discussed
below in figure 2.2 in this under each type of analytics there is a part of human
knowledge required in prediction. Descriptive analytics requires the highest human
input while predictive analytics requires less human input. In case of prescriptive
analytics no human input is required since all the data is predicted.

Fig 2.3 Data and Human work

Fig 2.3 Data Analytics


Data analytics encompasses various types of analysis techniques aimed at
extracting insights, patterns, and valuable information from large and diverse datasets. Some
of the key types of data analytics include:
Collecting, processing, and interpreting data to find insights and help with making choices
is an important part of the field of data analytics. Data analytics is the study of looking at
large amounts of raw data to find patterns, draw conclusions, and get useful information.
To do this, different methods and tools are used to process and turn data into useful
information that can be used to make decisions.

Data analytics includes a lot of different ways to look at data and find useful information that
can make different parts of a business better. Businesses can find patterns and metrics
that they might not have seen otherwise by carefully looking at data. This lets them
improve general efficiency and make processes run more smoothly.

In manufacturing, for example, companies keep track of machine runtime, breaks, and work
queues to better plan their workloads and make sure machines work at their
best.

Data analytics is used in many areas besides just optimizing output. Companies that make
games use analytics to come up with reward systems that keep players interested. Companies
that make content use analytics to improve the placement and appearance of their content,
which in turn keeps users interested.
There are four types of Data analytics
1. Descriptive Analytics (What happened in the past)
2. Diagnostic Analytics (Why did it happen in the past)
3. Predictive Analytics (What will happen in the future)
4. Prescriptive Analytics (How can we make it happen)
Descriptive Analytics
Descriptive analytics looks at data and events that happened in the past to figure out how to
handle events that will happen in the future. Looking at past performance and understanding
performance by mining previous data to find out why things worked or didn't work in
the past. This kind of research is used in almost all management reports, like those for sales,
marketing, operations, and finances.
The descriptive model counts the connections between pieces of data in a way that is
often used to put customers or leads into groups. A predictive model tries to guess how
one customer will act, but descriptive analytics looks at all the possible connections between a
customer and a product.
Common examples of Descriptive analytics are company reports that provide historic reviews
like Data Queries, Reports, Descriptive Statistics, Data dashboard
Diagnostic Analytics
For this research, we usually use historical data over other data to find the answer to any
question or figure out how to fix any issue. We look at the problem's past data to see if
there are any patterns or dependencies.
For instance, businesses choose this type of analysis because it helps them understand a
problem better and keep thorough records on what they have available. If they don't,
collecting data for each problem individually would take a lot of time.
Diagnostic analytics goes beyond descriptive analytics to answer the question of why certain
events occurred. It aims to uncover the root causes of specific outcomes or trends observed in
the data. By analyzing historical data in detail, diagnostic analytics helps organizations
understand the factors influencing their performance or business
operation
Common techniques used for Diagnostic Analytics are Data discovery, Data mining,
Correlations
Predictive Analytics
With predictive analytics, the data is turned into useful information that can be used.
Predictive analytics uses data to figure out what will probably happen or how likely it is that
something will happen. A lot of different statistical methods, like modeling, machine
learning, data mining, and game theory, are used in predictive analytics to look at past and
present events and guess what will happen in the future.
Predictive analytics involves summarizing historical data to gain insights into past events
and understand what has happened. It focuses on identifying patterns, trends, and correlations
within the data. Predictive analytics provides a foundation for further analysis and decision-
making.
Techniques that are used for predictive analytics are Linear Regression, Time Series Analysis
and Forecasting, Data Mining
Basic Cornerstones of Predictive Analytics are Predictive modelling, Decision Analysis and
optimization, Transaction profiling

Prescriptive Analytics
Prescriptive analytics takes predictive analytics a step further by providing
recommendations or actions to optimize outcomes. It uses optimization and simulation
techniques to suggest the best course of action based on predicted future scenarios.
Prescriptive analytics helps organizations make data-driven decisions and take actions that
lead to desired outcomes.
Prescriptive analytics automatically combine large amounts of data, math, business rules, and
machine learning to make a prediction. It then offers a decision that can be made based on the
prediction.

Prescriptive analytics does more than just guess what will happen in the future. It also
suggests actions that will help the predictions come true and shows the person making the
decision what each choice will mean. This type of analytics not only guesses what will
happen and when, but also tries to figure out why it will happen. Prescriptive analytics
can also offer different ways to make decisions about how to take advantage of a
future opportunity or reduce a future risk, and it can show what each choice would
mean.
For example, Prescriptive Analytics can benefit healthcare strategic planning by using
analytics to leverage operational and usage data combined with data of external factors
such as economic data, population demography, etc.
2.2 Introduction to Tools and Environment
In general data analytics deals with three main parts, subject knowledge,
statistics and person with computer knowledge to work on a tool to give insight in to
the business. In the mainly used tool is R and Phyton as shown in figure 2

With the increasing demand for Data Analytics in the market, many tools have
emerged with various functionalities for this purpose. Either open-source or user-
friendly, the top tools in the data analytics market are as follows.

 R programming – This tool is the leading analytics tool used for statistics and data
modeling. R compiles and runs on various platforms such as UNIX, Windows, and
Mac OS. It also provides tools to automatically install all packages as per user-
requirement.
 Python – Python is an open-source, object-oriented programming language which
is easy to read, write and maintain. It provides various machine learning and
visualization libraries such as Scikit-learn, TensorFlow, Matplotlib, Pandas, Keras
etc. It also can be assembled on any platform like SQL server, a MongoDB database
or JSON
 Tableau Public – This is a free software that connects to any data source such as
Excel, corporate Data Warehouse etc. It then creates visualizations, maps,
dashboards etc with real-time updates on the web.
 QlikView – This tool offers in-memory data processing with the results delivered to
the end-users quickly. It also offers data association and data visualization with
data being compressed to almost 10% of its original size.
 SAS – A programming language and environment for data manipulation and
analytics, this tool is easily accessible and can analyze data from different sources.
 Microsoft Excel – This tool is one of the most widely used tools for data
analytics. Mostly used for clients’ internal data, this tool analyzes the tasks that
summarize the data with a preview of pivot tables.
 RapidMiner – A powerful, integrated platform that can integrate with any data
source types such as Access, Excel, Microsoft SQL, Tera data, Oracle, Sybase etc.
This tool is mostly used for predictive analytics, such as data mining, text
analytics, machine learning.
 KNIME – Konstanz Information Miner (KNIME) is an open-source data analytics
platform, which allows you to analyze and model data. With the benefit of visual
programming, KNIME provides a platform for reporting and integration through its
modular data pipeline concept.
 OpenRefine – Also known as GoogleRefine, this data cleaning software will help
you clean up data for analysis. It is used for cleaning messy data, the
transformation of data and parsing data from websites.
 Apache Spark – One of the largest large-scale data processing engine, this tool
executes applications in Hadoop clusters 100 times faster in memory and 10 times
faster on disk. This tool is also popular for data pipelines and machine learning
model development.
Apart from the above-mentioned capabilities, a Data Analyst should also possess
skills such as Statistics, Data Cleaning, Exploratory Data Analysis, and Data
Visualization. Also, if you have knowledge of Machine Learning, then that would
make you stand out from the crowd.

2.3 Application of modelling in business


Modeling is widely used in business to optimize processes, improve decision-making, and enhance
performance. Here are the key points of applying modeling in business:

 A statistical model embodies a set of assumptions concerning the


generation of the observed data, and similar data from a larger population.
 A model represents, often in considerably idealized form, the data-generating process.
 Signal processing is an enabling technology that encompasses the fundamental
theory, applications, algorithms, and implementations of processing or
transferring information contained in many different physical, symbolic, or
abstract formats broadly designated as signals.
 It uses mathematical, statistical, computational, heuristic, and linguistic
representations, formalisms, and techniques for representation, modeling,
analysis, synthesis, discovery, recovery, sensing, acquisition, extraction,
learning, security, or forensics.
 In manufacturing statistical models are used to define Warranty policies, solving
various conveyor related issues, Statistical Process Control etc.
 Applications of Data Modelling can be termed as Business analytics.
 Business analytics involves the collating, sorting, processing, and studying of business-
related data using statistical models and iterative methodologies. The goal of BA is to
narrow down which datasets are useful and which can increase revenue, productivity, and
efficiency.
 Business analytics (BA) is the combination of skills, technologies, and practices used to
examine an organization's data and performance as a way to gain insights and make data-
driven decisions in the future using statistical analysis.
Although business analytics is being leveraged in most commercial sectors and industries,
the following applications are the most common.

1. Credit Card Companies


Credit and debit cards are an everyday part of consumer spending, and they are an ideal
way of gathering information about a purchaser’s spending habits, financial situation,
behavior trends, demographics, and lifestyle preferences.

2. Customer Relationship Management (CRM)


Excellent customer relations is critical for any company that wants to retain customer
loyalty to stay in business for the long haul. CRM systems analyze important performance
indicators such as demographics, buying patterns, socio-economic information, and
lifestyle.

3. Finance
The financial world is a volatile place, and business analytics helps to extract insights that
help organizations maneuver their way through tricky terrain. Corporations turn to
business analysts to optimize budgeting, banking, financial planning, forecasting, and
portfolio management.

4. Human Resources
Business analysts help the process by pouring through data that characterizes high
performing candidates, such as educational background, attrition rate, the average length
of employment, etc. By working with this information, business analysts help HR by
forecasting the best fits between the company and candidates.
5. Manufacturing
Business analysts work with data to help stakeholders understand the things that affect
operations and the bottom line. Identifying things like equipment downtime, inventory
levels, and maintenance costs help companies streamline inventory management, risks,
and supply-chain management to create maximum efficiency.

6. Marketing
Business analysts help answer these questions and so many more, by measuring marketing
and advertising metrics, identifying consumer behavior and the target audience, and
analyzing market trends.
2.4 Databases
Database is an organized collection of structured information, or data,
typically stored electronically in a computer system. A database is usually
controlled by a database management system (DBMS)
The database can be divided into various categories such as text
databases, desktop database programs, relational database management
systems (RDMS), and NoSQL and object-oriented databases

A text database is a system that maintains a (usually large) text


collection and provides fast and accurate access to it. Eg: Text book,
magazine, journals, manuals, etc..

A desktop database is a database system that is made to run on a


single computer or PC. These simpler solutions for data storage are much
more limited and constrained than larger data center or data warehouse
systems, where primitive database software is replaced by sophisticated
hardware and networking setups. Eg: Microsoft excel, open access, etc.

A relational database (RDB) is a collective set of multiple data sets


organized by tables, records and columns. RDBs establish a
well-defined relationship between database tables. Tables communicate and
share information, which facilitates data searchability, organization and
reporting. Eg: sql, oracle,Db2, DbaaS etc

NoSQL databases are non-tabular, and store data differently than


relational tables. NoSQL databases come in a variety of types based on their
data model. The main types are document, key-value, wide-column, and
graph. Eg: JSON, Mango DB, CouchDB etc.

Object-oriented databases (OODB) are databases that represent


data in the form of objects and classes. In object-oriented terminology, an
object is a real-world entity, and a class is a collection of objects. Object-
oriented databases follow the fundamental principles of object-oriented
programming (OOP). Eg: c++, java, c#, small talk, LISP etc. Examples of
object oriented databases are ObjectDB, db4o,Versant Object Database,etc.,

2.5 Types of Data and variables

In any database we will be working with data to perform any kind of


analysis and predication. In relational data base management system we
normally use rows to represent data and columns to represent the attribute.

In terms of big data we represent the columns from RDMS as an


attribute or a variable. This variable can be divided in to two types’
categorical data or qualitative data and continuous or discrete data called as
quantitative data. As shown below in figure 2.5.

Qualitative data or Categorical data is normally represented as variable


that holds characters. And this is divided in to two types nominal data and
ordinal data.

In Nominal Data there is no natural ordering in values in the attribute


of the dataset. Eg: color, Gender, nouns (name, place, animal, thing). These
categories cannot be predefined with order for example there is no specific
way to arrange gender of 50 students in a class. In this case the first student
can be male or female similarly for all 50 students. So ordering cannot be
valid.

In Ordinal Data there is natural ordering in values in the attribute of the


dataset. Eg: size (S, M, L, XL, XXL), rating (excellent, good, better, worst).
In the above example we can quantify the amount of data after performing
ordering which gives valuable insights into the data.

Quantitative data refers to numerical data that can be measured or counted and
can be expressed in numbers. And this is divided into two types ‘interval data
and ratio data
Interval data is a type of numerical data where the difference between
values is meaningful, but there is no true zero point. This means you can
perform addition and subtraction, but not meaningful multiplication or
division. Examples of Interval Data:
1. Temperature (Celsius or Fahrenheit) Example: 10°C, 20°C, 30°C
2. IQ (Intelligence Quotient) Scores Example: 90, 100, 110, 120
3. SAT or GRE Scores Example: 400, 600, 800
4. Calendar Years Example: 1000 AD, 1500 AD, 2000 AD

Ratio data is a type of numerical data that has equal intervals between values
(like interval data) , true zero point (zero means the total absence of the quantity) and
a meaningful multiplication and division (e.g., "twice as much" makes
sense).Examples are height, weight, age

2.6 Data Modelling Techniques

Data modelling is nothing but a process through which data is stored


structurally in a format in a database. Data modelling is important because it
enables organizations to make data-driven decisions and meet varied business
goals.

The entire process of data modelling is not as easy as it seems, though.


You are required to have a deeper understanding of the structure of an
organization and then propose a solution that aligns with its end-goals and
suffices it in achieving the desired objectives.
Types of Data Models

Data modeling can be achieved in various ways. However, the basic


concept of each of them remains the same. Let’s have a look at the commonly
used data modeling methods:

2.6.1 Hierarchical model

As the name indicates, this data model makes use of hierarchy to


structure the data in a tree-like format. Hierarchical Data Modeling organizes
data in a tree-like structure with parent-child relationships, where each parent can
have multiple children, but each child has only one parent.
However, retrieving and accessing data is difficult in a hierarchical
database. This is why it is rarely used now.

Fig 2.6: Hierarchical Model Structure


Characteristics of the Hierarchical Model
 Simplicity: The tree-like structure of the hierarchical model is intuitive and easy to
understand, which simplifies the design and navigation of databases. This structure clearly
defines parent-child relationships, making the path to retrieve data straightforward.
 Data Integrity: Due to its rigid structure, the hierarchical model inherently maintains data
integrity. Each child record has only one parent, which helps prevent redundancy and
preserves the consistency of data across the database.
 Efficient Data Retrieval: The model allows for fast and efficient retrieval of data. Since
relationships are predefined, navigating through the database to find a specific record is
quicker compared to more complex models where paths are not as clearly defined.
 Security: Hierarchical databases can provide enhanced security. Access to data can be
controlled by restricting users to certain segments of the tree, which makes it easier to
implement security policies based on data hierarchy.
 Performance: For applications where relationships are fixed and transactions require high
throughput, such as in banking or airline reservation systems, the hierarchical model offers
excellent performance and speed.
Limitations of the Hierarchical Model
 Structural Rigidity: The biggest limitation of the hierarchical model is its structural rigidity.
Once the database is designed, making changes to its structure, such as adding or modifying
the relationships between records, can be complex and labor-intensive.
 Lack of Flexibility: This model does not easily accommodate many-to-many relationships
between records, which are common in more complex business applications. This limitation
can lead to redundancy and inefficiencies when trying to model more intricate relationships.
 Data Redundancy: In cases where a child has more than one parent, the hierarchical model
can lead to data duplication because each child-parent relationship needs to be stored
separately. This redundancy can consume additional storage and complicate data
management.
 Complex Implementation: Implementing a hierarchical database can be more complex than
more modern relational databases, particularly when dealing with large sets of data that have
diverse and intricate relationships.
 Query Limitations: The hierarchical model typically requires a specific type of query
language that might not be as rich or flexible as SQL, used in relational databases. This can
limit the types of queries that can be executed, affecting the ease and depth of data analysis.

Example:
Company (Root)
├── CEO (Level 1)
│ ├── Vice-President of Sales (Level 2)
│ │ ├── Sales Manager 1 (Level 3)
│ │ ├── Sales Manager 2 (Level 3)
│ │ ├── Sales Executive 1 (Level 4)
│ │ ├── Sales Executive 2 (Level 4)
│ ├── Vice-President of Engineering (Level 2)
│ │ ├── Engineering Manager 1 (Level 3)
│ │ ├── Engineering Manager 2 (Level 3)
│ │ ├── Software Engineer 1 (Level 4)
│ │ ├── Software Engineer 2 (Level 4)

2.6.2 Relational model


Proposed as an alternative to hierarchical model by an IBM researcher,
here data is represented in the form of tables. It reduces the complexity and
provides a clear overview of the data as shown below
Fig 2.7: Relational Model Structure

Characteristics of the Relational Model


 Data Representation: Data is organized in tables (relations), with rows (tuples)
representing records and columns (attributes) representing data fields.
 Atomic Values: Each attribute in a table contains atomic values, meaning no multi-
valued or nested data is allowed in a single cell.
 Unique Keys: Every table has a primary key to uniquely identify each record,
ensuring no duplicate rows.
 Attribute Domain: Each attribute has a defined domain, specifying the valid data
types and constraints for the values it can hold.
 Tuples as Rows: Rows in a table, called tuples, represent individual records or
instances of real-world entities or relationships.
 Relation Schema: A table’s structure is defined by its schema, which specifies the
table name, attributes, and their domains.
 Data Independence: The model ensures logical and physical data independence,
 allowing changes in the database schema without affecting the application layer.
 Integrity Constraints: The model enforces rules like:
o Domain constraints: Attribute values must match the specified domain.
o Entity integrity: No primary key can have NULL values.
o Referential integrity: Foreign keys must match primary keys in the
referenced table or be NULL.
 Relational Operations: Supports operations like selection, projection, join, union,
and intersection, enabling powerful data retrieval manipulation.
 Data Consistency: Ensures data consistency through constraints, reducing
redundancy and anomalies.
 Set-Based Representation: Tables in the relational model are treated as sets, and
operations follow mathematical set theory principles.
Advantages of the Relational Model
 Simple model: Relational Model is simple and easy to use in comparison to other languages.
 Flexible: Relational Model is more flexible than any other relational model present.
 Secure: Relational Model is more secure than any other relational model.
 Data Accuracy: Data is more accurate in the relational data model.
 Data Integrity: The integrity of the data is maintained in the relational model.
 Operations can be Applied Easily: It is better to perform operations in the relational model.
Disadvantages of the Relational Model
 Relational Database Model is not very good for large databases.
 Sometimes, it becomes difficult to find the relation between tables.
 Because of the complex structure, the response time for queries is high.
2.6.3 Network model

The network model is inspired by the hierarchical model. However,


unlike the hierarchical model, this model makes it easier to convey complex
relationships as each record can be linked with multiple parent records as
shown in figure 2.8. In this model data can be shared easily and the
computation becomes easier.

Fig 2.8: Network Model Structure


Advantages of Network Model
 This model is very simple and easy to design like the hierarchical data model.
 This model is capable of handling multiple types of relationships which can help in
modeling real-life applications, for example, 1: 1, 1: M, M: N relationships.
 In this model, we can access the data easily, and also there is a chance that the
application can access the owner’s and the member’s records within a set.
 This network does not allow a member to exist without an owner which leads to the
concept of Data integrity.
 Like a hierarchical model, this model also does not have any database standard,
 This model allows to represent multi parent relationships.
Disadvantages of Network Model
 The schema or the structure of this database is very complex in nature as all the
records are maintained by the use of pointers.
 There’s an existence of operational anomalies as there is a use of pointers for
navigation which further leads to complex implementation.
 The design or the structure of this model is not user-friendly.
 This model does not have any scope of automated query optimization.
 This model fails in achieving structural independence even though the network
database model is capable of achieving data independence.

2.6.4 Object-oriented
model

Object-Oriented Data Modeling (OODM) integrates object-oriented programming (OOP)


principles with database design. It represents data using objects, classes, inheritance, and
encapsulation, similar to how data is handled in object-oriented programming languages like
Java, Python, or C++.This database model consists of a collection of objects, each with its
own features and methods. This type of database model is also called the post-relational
database model which is shown as follows:

Fig 2.9: Object Oriented Model Structure

Advantages of Object Oriented Data Model :


 Codes can be reused due to inheritance.
 Easily understandable.
 Cost of maintenance can reduced due to reusability of attributes and functions because
of inheritance.
Disadvantages of Object Oriented Data Model :
 It is not properly developed so not accepted by users easily.

2.6.5 Entity-relationship model

Entity-relationship model, also known as ER model, represents entities and


their relationships in a graphical format. An entity could be anything – a
concept, a d a t a o r a n e n t i t y
You will agree with us that the main goal behind data modeling is to equip
your business and contribute to its functioning. As a data modeler, you can
achieve this objective only when you know the needs of your enterprise
correctly.
It is essential to make yourself familiar with the varied needs of your business
so that you can prioritize and discard the data depending on the situation.
Advantages of E-R Modeling

 ER diagrams use simple visual representations, making them easy to interpret


for both technical and non-technical users.
 Clearly defines entities, attributes, and relationships, helping in database
planning.
 ER models serve as a blueprint for relational databases, making it easier to
translate into SQL schemas.
 Proper normalization in ER models minimizes data duplication and improves
data integrity.
 Can model one-to-one (1:1), one-to-many (1:M), and many-to-many (M:M)
relationships effectively.
 ER diagrams can be directly mapped to a relational database schema.
 ER modeling scales well for large and complex database systems.
 Acts as a common framework for database designers, developers, and
stakeholders.

Disadvantages:


 1. Can Become Complex


 2. Lacks Representation of Behavioural Aspects


 3. Does Not Support Procedural Constructs

 4. Not Suitable for All Databases


 5. Limited Support for Unstructured Data


 6. Requires Expert Knowledge

2.7 Missing Imputations:


Imputation is the process of replacing missing data with substituted values.

Types of missing data


Missing data can be classified into one of three categories
Missing completely at random (MCAR)

 Missingness of a value is independent of attributes


 Fill in values based on the attribute as suggested above (e.g. attribute mean)
 Analysis may be unbiased overall

Missing at Random (MAR)

 Missingness is related to other variables


 Fill in values based other values (e.g., from similar instances)
 Almost always produces a bias in the analysis

Missing Not at Random (MNAR)

 Missingness is related to unobserved measurements


 Informative or non-ignorable missingness
Imputations: (Treatment of Missing Values)
1. Ignore the tuple: This is usually done when the class label is missing (assuming
the mining task involves classification). This method is not very effective, unless
the tuple contains several attributes with missing values. It is especially poor when
the percentage of missing values per attribute varies considerably.
2. Fill in the missing value manually: In general, this approach is time-consuming
and may not be feasible given a large data set with many missing values.
3. Use a global constant to fill in the missing value: Replace all missing attribute
values by the same constant, such as a label like “Unknown” or -∞. If missing
values are replaced by, say, “Unknown,” then the mining program may mistakenly
think that they form an interesting concept, since they all have a value in common-
that of “Unknown.” Hence, although this method is simple, it is not foolproof.
4. Use the attribute mean to fill in the missing value: Considering the average
value of that particular attribute and use this value to replace the missing value in
that attribute column.
5. Use the attribute mean for all samples belonging to the same class as the given
tuple:
For example, if classifying customers according to credit risk, replace the missing
value with the average income value for customers in the same credit risk category
as that of the given tuple.
6. Use the most probable value to fill in the missing value: This may be
determined with regression, inference-based tools using a Bayesian formalism, or
decision tree induction. For example, using the other customer attributes in your
data set, you may construct a decision tree to predict the missing values for
income.

2.8 Need for Business Modelling:


The main need of Business Modelling for the Companies that embrace big data
analytics and transform their business models in parallel will create new opportunities
for revenue streams, customers, products and services. Having a big data strategy and
vision that identifies and capitalizes on new opportunities.
Business modeling is essential for organizations to analyze, design, and optimize
business processes and strategies. It provides a structured representation of how a
business operates, enabling better decision-making and efficiency. Here are some key
reasons why business modeling is needed:
1. Clarity in Business Processes
 Helps visualize workflows, roles, and interactions within an organization.
 Identifies bottlenecks and inefficiencies in operations.
 Ensures consistency and standardization across departments.
2. Improved Decision-Making
 Provides a data-driven approach to strategic planning.
 Helps in risk assessment by simulating different scenarios.
 Supports better allocation of resources and budgeting.
3. Alignment with Business Goals
 Ensures that business processes align with organizational objectives.
 Helps in defining key performance indicators (KPIs) and measuring success.
 Bridges the gap between business strategy and execution.
4. Support for Digital Transformation
 Facilitates automation by defining workflows for Business Process
Management (BPM).
 Helps in integrating AI, machine learning, and cloud computing into
business operations.
 Assists in data modeling for better decision support systems.
5. Risk Management and Compliance
 Identifies potential risks and vulnerabilities in business operations.
 Helps in regulatory compliance by ensuring processes adhere to industry
standards.
 Improves auditability and transparency in operations.
6. Scalability and Growth
 Allows businesses to plan for expansion and scalability.
 Ensures that processes can handle increased workloads and market changes.
 Supports mergers, acquisitions, and business restructuring.
7. Better Communication and Collaboration
 Acts as a common language for stakeholders, reducing misunderstandings.
 Enhances coordination between departments, teams, and partners.
 Facilitates better training and onboarding for employees.
8. Competitive Advantage
 Helps businesses identify market trends and opportunities.
 Enhances customer experience by streamlining service delivery.
 Supports innovation by exploring new business models and revenue streams.
Common Business Modeling Techniques
 Business Process Modeling (BPM) – Flowcharts, BPMN, UML diagrams
 Value Chain Analysis – Understanding key value-creating activities
 SWOT Analysis – Identifying Strengths, Weaknesses, Opportunities, Threats
 Balanced Scorecard – Measuring performance across different perspectives
 Lean and Six Sigma Models – Optimizing processes and reducing waste

You might also like