0% found this document useful (0 votes)
40 views

UNIT 2 Data Analysis

Uploaded by

Hasi P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
40 views

UNIT 2 Data Analysis

Uploaded by

Hasi P
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

UNIT – 2:

Data Analytics: Introduction to Analytics, Introduction to Tools and Environment, Application of Modeling in
Business, Databases & Types of Data and Variables, Data Modeling Techniques, Missing imputations etc, Need
for Business Modeling.

Introduction to Analytics: It refers to the techniques to analyze data to enhance productivity and business gain
Analytics refers to the systematic process of analyzing data to uncover patterns, trends, and insights that can
inform decision-making and improve outcomes. It involves collecting, processing, and interpreting data to
generate actionable insights. Analytics can be applied in a variety of fields, such as business, healthcare, finance,
sports, and more.

As an enormous amount of data gets generated, the need to extract useful insights is amust for a business
enterprise. Data Analytics has a key role in improving your business.Here are main factors which signify the
need for Data Analytics:
Gather Hidden Insights – Hidden insights from data are gathered and then analyzedwith respect to business
requirements.
Generate Reports – Reports are generated from the data and are passed on to therespective teams and
individuals to deal with further actions for a high rise in business.
Perform Market Analysis – Market Analysis can be performed to understand thestrengths and the
weaknesses of competitor
Informed Decision-Making: Analysis of data provides a basis for informed decision-making by offering insights
into past performance, current trends, and potential future outcomes.
Improve Business Requirement /Business Intelligence: Analyzed data helps organizations gain a
competitive edge by identifying market trends, customer preferences, and areas for improvement. Analysis of
Data allows improving Business to customer requirements and experience

In general data analytics also deals with bit of human knowledge is important in this under each type of analytics
there is a part of human knowledge required in prediction. Descriptive analytics requires the highest human
input while predictive analytics requires less human input. In case of prescriptive analytics no human input is
required since all the data is predicted.
Steps Involved in the Data Analytics:
Step 1: Defining Problem or objectives and questions
The first step in the data analysis process is to define the objectives and formulate clear, specific questions that
your analysis aims to answer. This step is crucial as it sets the direction for the entire process. It involves
understanding the problem or situation at hand, identifying the data needed to address it, and defining the
metrics or indicators to measure the outcomes.
Step 2: Data collection
Once the objectives and questions are defined, the next step is to collect the relevant data. This can be done
through various methods such as surveys, interviews, observations, or extracting from existing databases. The
data collected can be quantitative (numerical) or qualitative (non-numerical), depending on the nature of the
problem and the questions being asked.
Step 3: Data cleaning
Data cleaning, also known as data cleansing, is a critical step in the data analysis process. It involves checking the
data for errors and inconsistencies, and correcting or removing them. This step ensures the quality and reliability
of the data, which is crucial for obtaining accurate and meaningful results from the analysis.
Step 4: Data analysis
Once the data is cleaned, it's time for the actual analysis. This involves applying statistical or mathematical
techniques to the data to discover patterns, relationships, or trends. There are various tools and software
available for this purpose, such as Python, R, Excel, and specialized software like SPSS and SAS.
Step 5: Data interpretation and visualization
After the data is analyzed, the next step is to interpret the results and visualize them in a way that is easy to
understand. This could involve creating charts, graphs, or other visual representations of the data. Data
visualization helps to make complex data more understandable and provides a clear picture of the findings.
Step 6: Presenting Data and Data storytelling
The final step in the data analysis process is data storytelling. This involves presenting the findings of the analysis
in a narrative form that is engaging and easy to understand. Data storytelling is crucial for communicating the
results to non-technical audiences and for making data-driven decisions
Introduction to Tools and Environment:
Analytics is now days used in all the fields ranging from medical science to Aero science to Government
activates.
Data Science and Analytics are used by manufacturing companies as well as Real estate’s firms to develop their
business and solve various issues by the help of historical data base.
Tools are the software that can be used for analytics like SAS or R, Python. While techniques are the procedures
to be followed to reach up to the solutions
Various steps involved in Analytics:
1. Access
2. Manage
3. Analyze
4. Report
Difference types of Tools Used in the Data Analytics:
I. Python: It operates at a high level and serves various general purposes. Known for its interpretative nature,
Python stands out for its straightforward syntax and flexible nature, appealing greatly to developers. Its
simplicity and adaptability make it especially effective for data analysis, machine learning, automation, and
creating web applications.
Features
 Extensive support for libraries like Pandas and NumPy for data analysis and manipulation.
 Strong community support and open-source libraries for machine learning and data science.
Real-world Applications
 Web scraping and data extraction.
 Predictive analytics in finance and retail.
 Development of AI and machine learning model
II. R Programming Language: It is a popular programming language and a free software environment, is
dedicated to statistical computing and graphics. It enjoys widespread popularity among statisticians and data
miners for creating statistical software and analyzing data.
Features
 Comprehensive statistical analysis toolkit.
 Extensive packages for data manipulation, visualization, and modeling.
Real-world Applications
 Statistical computing for biomedical research.
 Financial modeling and analysis.
 Data visualization for academic research.
III. Tableau: A powerful tool for creating interactive dashboards and data visualizations. It helps simplify raw
data into a very easily understandable format. Data analysis is very fast with Tableau; visualizations are like
dashboards and worksheets.
Features
 Allows for easy integration with databases, spreadsheets, and big data queries.
 Offers drag-and-drop functionality for creating interactive and shareable dashboards.
Real-world Applications
 Business intelligence to enhance decision-making.
 Sales and marketing performance tracking.
 Supply chain, inventory, and operations management.
IV. QlikView: It is business intelligence software that converts raw data into useful insights. It analyzes and
visualizes data, highlighting connections between different sources. This helps users explore data thoroughly and
identify patterns, making it easier to make informed decisions and plan effectively.
Features
 Interactive dashboards and associative exploration.
 In-memory data processing for faster responses.
Real-world Applications
 Business performance monitoring in real-time.
 Sales and customer analysis for retail.
 Supply chain and logistics optimization.
V. SAS: The Statistical Analysis System (SAS) is a software suite from the SAS Institute used for advanced
analytics, data management, and predictive analysis. Known for its strong statistical modeling features, SAS is
widely used in various industries for detailed data exploration and generating insights.
Features
 Provides a powerful environment for data analysis and visualization.
 Offers extensive libraries for advanced statistical analysis.
Real-world Applications
 Clinical trial analysis in pharmaceuticals.
 Risk assessment in banking and finance.
 Customer segmentation in retail.
VI. Microsoft Excel: It is a part of Microsoft Office, is a spreadsheet application that offers tools for calculations,
graphing, pivot tables, and a programming language called VBA (Visual Basic for Applications). Its wide use for
data analysis and visualization highlights its usefulness and flexibility in managing various data tasks.
Features
 Powerful data analysis and visualization tools.
 VBA for custom scripts and automation.
Real-world Applications
 Financial reporting and analysis.
 Inventory tracking and management.
 Project planning and tracking.

VII. Rapid Miner: It is a complete data science platform that provides a single environment for data preparation,
machine learning, deep learning, text mining, and predictive analytics. It is suitable for users of all skill levels,
offering tools for a variety of data science tasks.
Features
 A visual workflow designer for easy model building.
 Extensive data mining functionality for predictive modeling.
Real-world Applications
 Predictive maintenance in manufacturing.
 Customers churn prediction in telecommunications.
 Fraud detection in banking and finance.

VII. KNIME: It is an open-source data analytics, reporting, and integration platform allowing users to create data
flows visually, selectively execute some or all analysis steps, and inspects the results, models, and interactive
views.
Features
 Node-based interface allows for easy assembly of workflows.
 Supports integration with various data sources and types.
Real-world Applications
 Pharmaceutical research data analysis.
 Customer data analysis for marketing insights.
 Financial data analysis for risk modeling.
VIII. Open Refine: It is a (formerly Google Refine) is a powerful tool used for cleaning, transforming, and
organizing messy data, especially in large datasets.
Features of Open Refine:
 Data Cleaning and Transformation
 Faceted Browsing (It offers an easy way to explore data by filtering and categorizing it, making it simpler
to review and refine large datasets interactively.)
Real-World Applications:
 Preparing datasets for analysis
 Standardizing metadata for libraries and museums.
IX. Apache Spark: Apache Spark is an open-source computing framework that manages clusters with automatic
data handling and error recovery. It supports tasks like batch processing, real-time streaming, interactive
queries, and machine learning.
Features
 Offers high-speed processing for large-scale data operations.
 Supports sophisticated analytics capabilities, including machine learning and graph algorithms.
Real-world Applications
 Real-time data processing and analytics.
 Machine learning model development.
 Large-scale data processing in financial services.
X. Apache Hadoop: It is a free, open-source framework for storing and processing large data sets across clusters
of standard hardware. It scales from one server to thousands, with each machine offering storage and
computing power, making it easy to manage huge amounts of data.
Features
 Distributed processing of large data sets across clusters.
 High fault tolerance and scalability.
Real-world Applications
 Big data processing and analysis for insights.
 Data warehousing and storage.
 Log and event data analysis for cyber security.
XI. Microsoft Azure: It developed by Microsoft, offers a cloud computing platform for creating, testing,
deploying, and managing applications and services via data centers managed by Microsoft. This service
encompasses a range of solutions, including SaaS(Software as a service), PaaS(platform as a service), and
IaaS(Infrastructure as a service). It is designed to support various programming languages, tools, and
frameworks, accommodating both Microsoft-specific and third-party technologies and systems.
Features:
 Wide range of cloud services, including AI and machine learning, analytics, and databases.
 Scalability and flexibility with a pay-as-you-go pricing model.
Real-world Applications
 Building and deploying web applications.
 Big data and analytics solutions.
 Development of IOT applications.

The future data analytics environment must expand to incorporate a full spectrum of analytics utilities and
capabilities, including:
1. Predictive analytics: In this which uses data mining, machine learning and artificial intelligence
techniques to develop models for predicting future behaviors
2. Prescriptive analytics: it provides recommendations for optimal outcomes of selected options
based on predictive analytics. It helps automate decision processes.
3. Integrated analytics: It allows developed analytical models to be integrated within information
flow to execute automated decision support and execution.
4. Feature extraction and text analytics: It helps automatically identify and extract features from
semi-structured and unstructured data that can then be used to fuel predictive and prescriptive
analysis.
Categories of Tools in Data Analytics
1. Data Collection Tools: Tools used to gather raw data from various sources such as databases, web APIs,
sensors, or direct user inputs.
Example: API, Web Scraping Tools, SQL
2. Data Processing and Cleaning Tools: Tools designed to clean, transform, and prepare raw data for
analysis.
Example: Python programming language, ETL Tools.
3. Data Analysis Tools: These tools enable users to analyze and model data, perform statistical analysis,
and identify patterns or trends.
Examples: Python, R programming, Excel and SAS
4. Data Visualization Tools: These tools allow users to create visual representations of data, making it
easier to understand and communicate insights.
Examples: Tableau, Power BI, Google Data Studio, Matplotlib/Seaborn (Python)
Data analytics tools are essential for turning raw data into actionable insights. The right combination of tools
enables businesses to handle data at scale, analyze complex patterns, and make data-driven decisions. From
collecting and cleaning data to performing in-depth analysis and visualization, these tools automate and simplify
many processes, helping businesses harness the power of data for growth and innovation.
Application of Modeling in Business:
Data analytics plays a vital role in business, serving various purposes that depend on specific business needs, as
discussed below. Today, most businesses use large amounts of data for predictions. Companies need new
capabilities to make decisions based on big data, but many struggle to access all their data resources.
Different sectors gain valuable insights from structured data collected through enterprise systems and analyzed
by commercial database management systems. For example:
1. Social Media: Facebook and Twitter analyze the immediate impact of campaigns and consumer opinions
about products.
2. E-commerce: Companies like Amazon, eBay, and Google look at performance factors to understand
what drives sales revenue and user engagement.
Utilizing the Hadoop: Hadoop is an open source software platform that enables processing of large data sets in
a distributed computing environment.
It discusses some concepts according to big data, the rules for building, organizing and analyzing huge data-sets
in the business environment.
They offered 3 architecture layers and also they indicate some graphical tools to explore and represent
unstructured-data, the authors specified how the famous companies could improve their business.
Optimize Operations: Companies can streamline operations like inventory management, supply chain, and
customer service by analyzing large amounts of data quickly.
Improve Decision-Making: With Hadoop's data processing power, businesses can make data-driven decisions
faster, based on real-time insights.
Enhance Customer Experience: By analyzing customer data more efficiently, companies can better understand
customer preferences and behavior, leading to more personalized services and products.
Reduce Costs: Hadoop's distributed computing allows companies to use regular, cheaper hardware, making it
more cost-effective to store and process big data.
Eg: Google, Twitter and Face book show their attention in processing big data within cloud-environment
Map- Reduce Architecture:

Data flow in the Map-Reduce:


Input Data  Split Data into Chunks  [Map per 1] [Map per 2] [Map per N]
| | |
(Key, Value) (Key, Value) (Key, Value)
| | |
Shuffle & Sort (by Key)
|
[Reducer 1] [Reducer 2] [Reducer N]
| | |
Final Output Data
The Map Reduce architecture consists of two main phases: Map and Reduce. Here's a simple breakdown of how
the process works:
1. Map Reduce is a software framework and programming model used for processing huge amounts of
data.
2. Map Reduce program work in two phases, namely, Map and Reduce. Map tasks deal with splitting and
mapping of data while Reduce tasks shuffle and reduce the data.
3. Hadoop is capable of running Map Reduce programs written in various languages: Java, Ruby, Python,
and C++.
4. The programs of Map Reduce in cloud computing are parallel in nature, thus are very useful for
performing large-scale data analysis using multiple machines in the cluster.
5. The input to each phase is key-value pairs. In addition, every programmer needs to specify two
functions: map function and reduce function.
The data goes through the following phases of Map Reduce in Big Data
Input Splits: The raw data is split into smaller chunks or blocks for parallel processing. An input to a Map Reduce
in Big Data job is divided into fixed-size pieces called input splits Input split is a chunk of the input that is
consumed by a single map
Mapping: The input data is fed into multiple "Mapper" tasks. This phase data in each split is passed to a
mapping function to produce output values Each Mapper processes a small chunk of data. A job of mapping
phase is to count a number of occurrences of each word from input splits (more details about input-split is given
below) and prepare a list in the form of <word, frequency>
The Mapper transforms this data into key-value pairs (e.g., (word, 1) for word count).
Shuffling and Sorting: This phase consumes the output of mapping phase. The key-value pairs from all Mappers
are shuffled and grouped by key. The data is sorted, ensuring that all identical keys are grouped together (e.g.,
all occurrences of a word) i.e., the same words are clubbed together along with their respective frequency.
Reducing: In this phase, output values from the Shuffling phase are aggregated. The grouped key-value pairs are
sent to "Reducer" tasks. The Reducer processes each group of keys, performing operations like summing,
counting, or aggregating.
This phase combines values from Shuffling phase and returns a single output value. In short, this phase
summarizes the complete dataset.
Output Data: The results from the Reducer tasks are written to the final output, typically stored in HDFS
(Hadoop Distributed File System).
Example: Consider you have following input data for your Map Reduce in Big data Program
Welcome to Hadoop Class
Hadoop is good
Hadoop is bad
The "Word Count Problem" uses the Map Reduce framework as follows:
1. Data Splitting: The process begins by dividing the data into smaller chunks, which will be processed by
mappers.
2. Mapping: Each map per generates key/value pairs from the data.
3. Shuffling: The key/value pairs are then shuffled to group the same keys together on the same worker
node.
4. Reducing: Finally, the reduce functions count the occurrences of each word and produce a final output.
As a result, the output will be a sorted list of word counts based on the original text input.
The Employment of Big Data Analytics on IBM :IBM and Microsoft are leading companies in big data. IBM
provides tools for storing, managing, and analyzing data, with a strong focus on business intelligence and
healthcare. Microsoft also excels in big data but emphasizes cloud computing. Additionally, Face book and
Twitter collect data from user profiles to increase their revenue.
The Performance of Data-Driven Companies
Big data analytics and business intelligence are important in both business and academia. Companies strive to
gain insights from the increasing variety, volume, and speed of data (the three V's) to enhance their decision-
making.
Data Modeling: Data modeling is nothing but a process through which data is stored structurally in a format in a
database. Data modeling is important because it enables organizations to make data-driven decisions and meet
varied business goals.
You are required to have a deeper understanding of the structure of organization and then purpose a solution
that aligns with its end-goals suffices it in achieving the desired objectives.
Types of Data Models:
1. Hierarchical model
2. Relational model
3. Network model
4. Object-oriented model
5. Entity-relationship model
Hierarchical model: Data is stored in a tree-like structure with parent and child records that comprise a
collection of data fields. A parent can have one or more children, but a child record can have only one parent.
The hierarchical model is also composed of links, which are the connections between records, and the types that
specify the kind of data contained in the field

Relational model: Relational Mode represents the links between tables by representing data as rows and
columns in tables. It is frequently utilized in database design and is strongly related to relational database
management systems (RDBMS).
Network Model: It is a way of organizing data where information is stored in the form of records, and these
records are connected to each other through links (or pointers).
Think of it like a web, where different pieces of information (nodes) are connected through lines (relationships).
Each record can have multiple parent and child records, making it flexible to represent complex relationships
between data.

Object oriented data model: In this model, data is represented as objects, similar to those used in object-
oriented programming, creating objects with stored values is the object-oriented method. In addition to
allowing data abstraction, inheritance, and encapsulation, the object-oriented architecture facilitates
communication.

Entity-relationship model: A high-level relational model called the entity-relationship model (ER model) is used
to specify the data pieces and relationships between the entities in a system. This conceptual design gives us an
easier-to-understand perspective on the facts. An entity-relationship diagram, which is made up of entities,
attributes, and relationships, is used in this model to depict the whole database.
A relationship between entities is called an association. Mapping cardinality many associations like:
 one to one
 one to many
 many to one
 many to many

Databases:
A database is an organized collection of data that enables users to store, manage, and retrieve information
efficiently. It provides a structured way to organize data, making it easy to access, update, and manipulate.
Databases are commonly used in various applications, from small software programs to large enterprise
systems. A database is usually controlled by Data Management System (DBMS).
Databases can be primarily divided into two main types:
 Single-file: Used for representing a single piece of information or data, they use individual files and simple
structures.
 Multi-file relational: These databases are relatively a lot more complicated, and they make use of tables in
order to display the relationship between different sets of data.

The database can be divided into various categories such as text databases, desktop database programs,
relational database management systems (RDMS), and NoSQL and object-oriented database.
Text Data Bases: A text database is a system that maintains a (usually large) text collection and provides
fast and accurate access to it.
Eg: Text book, magazine, journals, manuals, etc.
Desktop Databases: A desktop database is a database system that is made to run on a single computer or PC.
These simpler solutions for data storage are much more limited and constrained than larger data center or data
warehouse systems, where primitive database software is replaced by sophisticated hardware and networking
setups.
Eg: Microsoft excel, open access, etc
Cloud Databases: A cloud database is used where data requires a virtual environment for storing and
executing over the cloud platforms and there are so many cloud computing services for accessing the data
from the databases (like SaaS, Paas, etc).
There are some names of cloud platforms are-
 Amazon Web Services (AWS)
 Google Cloud Platform (GCP)
 Microsoft Azure
 Science Soft, etc.
Centralized Databases: A centralized database is a type of database that is stored, located as well as
maintained at a single location and it is more secure when the user wants to fetch the data from the
Centralized Database. It provides the Data Security, Reduced Redundancy and Consistency
The size of the centralized database is large which increases the response and retrieval time.
It is not easy to modify, delete and update.
Personal Databases: A personal database is a small, single-user database designed for individual use on a
personal computer or mobile device. It is often used to manage personal information like contacts, budgets, or
notes. These databases are simple, easy to use, and don’t need advanced management. They are perfect for
individuals or small tasks where only one person needs access. It is easy to handle and It occupies less space
Examples include Microsoft Access and SQLite.
Object-Oriented Databases: Those familiar with Object-Oriented Programming (OOP) can easily understand
this type of database. Information in the database is treated as objects, which can be referenced and used
without hassle. This reduces the workload on the database. Object-oriented databases follow OOP principles,
like those used in languages such as C++, Java, C#, and others.
NoSQL database: It is a type of database that stores and manages data in formats other than traditional tables
used in relational databases (like SQL). It is designed to handle large amounts of unstructured or semi-structured
data, making it more flexible and scalable for certain applications. Flexible Data Storage (Can store data in
various formats, such as documents, key-value pairs, or graphs.)Scalable (Can easily grow by adding more
machines to handle large amounts of data.) Faster for Certain Tasks (Optimized for specific operations that can
be slower in traditional) databases. Eg: Mongo DB (document-based),Redis (key-value store),Neo4j (graph
database)NoSQL databases are often used for big data, real-time web apps, and scenarios where the data
structure isn't fixed.
Types of Data and Variables:
Data refers to information that can be collected, observed, and analyzed. It can be numbers, text, images, or any
other form of information that is processed by a computer. Data is used in various fields like business, science,
and education to help make decisions, understand patterns, and gain insights.
Types of Variables:

Variables are the elements that hold data, and they can be of different types:
1. Categorical / Qualitative Variables
2. Numerical / Quantitative Variable
1. Categorical / Qualitative Variables: These are numbers that measure quantities (e.g., height,
age).These are a type of data that can be grouped into categories, based on certain characteristics.
They are typically used in statistical analysis to measure the relationships between different factors in a
study. Categorical variables are also known as qualitative variables because they represent values
without any numerical significance.
The following include different types of categorical variable
I. Binary / Boolean variables: Binary variables are commonly used when measuring dichotomous
outcomes and whether someone is classified as belonging to a particular group or not.
Examples could include gender (male / female), current employment status (yes/no), etc.
II. Nominal variables: Nominal variables are generally used when there are multiple categories that
need to be identified but cannot be compared against each other due to their qualitative nature.
Examples include eye color, nationality and religious beliefs. profession (doctor, nurse, lawyer,
engineer, etc.,) without any order to their relative importance or priority.
Ex: 2 marital statuses (single, married, divorced and widowed). Again there is no ordering between
these states; they are all equally important.
III. Ordinal variables: Ordinal categorical variables are variables that represent categories of data in
which the order of the categories has meaning. The categories are ranked, so that one is “greater
than” or “less than” the other. Example: A survey about a customer’s satisfaction with a product or
service on a scale from 1-5, where 1 is extremely dissatisfied and 5 is extremely satisfied. In this
situation, the higher numbers represent a greater amount of satisfaction.
2. Numerical / Quantitative Variables: Numerical variables are a type of variable used in data analysis to
quantify or measure the characteristics of an entity or phenomenon. They are also known
as quantitative variables because they involve counting, measuring, or assigning values to a particular
characteristic. Numerical variables can be divided into two main types: continuous and discrete.
Examples of quantitative data:
 Income of individuals
 Daily temperature
 Test scores
 Price of items
 Number of hours of study
 Weight of a person
I. Discrete data: Discrete data refers to data values that can only attain specific values and not a range of values.
In other words, discrete data involves only integers. It must be divided into parts. Families come in all shapes
and sizes. The number of members per family can be classified as discrete data, meaning it is counted rather
than measured. You will count individuals, and you cannot have answers like 1.5. You can represent discrete
data using bar charts, column charts, spider charts, stacked bar charts, and stacked column charts.
Common examples of discrete data:
 The number of participants in an event
 The number of students in a school
 The number of questions in an exam
 The number of employees in a company
 The number of chairs in a room
 The number of biscuits in a packet
II. Continuous data: (Ratio Data) Continuous data is quantitative data that describes data points that are not
separated by distinct intervals. It can include values between a specific range and be further divided into parts.
Its values typically lie within the highest and lowest values. Continuous data changes over time, may or may not
be whole numbers, and is measured using line graphs, skews, and other data analysis methods.
Discrete data can only take certain values, while continuous data can take any value within a given range. You
can tabulate continuous variables using a frequency distribution table.
Common examples of continuous data:
 Height of a student
 Temperature recordings of a place
 Speed of a car or bike
 Daily wind speed
 Length of customer service calls
 Time required to complete a task
i. Interval Scale –
An interval scale has ordered numbers with meaningful divisions, the magnitude between the
consecutive intervals are equal. Interval scales do not have a true zero i.e In Celsius 0 degrees does not
mean the absence of heat.
Interval scales have the properties of:
 Identity
 Magnitude
 Equal distance
For example, temperature on Fahrenheit/Celsius thermometer i.e. 90° are hotter than 45° and the
difference between 10° and 30° are the same as the difference between 60° degrees and 80°.
ii. Ratio Scale –
The ratio scale of measurement is similar to the interval scale in that it also represents quantity and has
equality of units with one major difference: zero is meaningful (no numbers exist below the zero). The
true zero allows us to know how many times greater one case is than another. Ratio scales have all of
the characteristics of the nominal, ordinal and interval scales. The simplest example of a ratio scale is
the measurement of length. Having zero length or zero money means that there is no length and no
money but zero temperature is not an absolute zero.
Properties of Ratio Scale:
 Identity
 Magnitude
 Equal distance
 Absolute/true zero

Data Modeling Techniques: Data Modeling Techniques Data modeling is nothing but a process through which
data is stored structurally in a format in a database. Data modeling is important because it enables organizations
to make data-driven decisions and meet varied business goals. The entire process of data modeling is not as
easy as it seems, though. You are required to have a deeper understanding of the structure of an
organization and then propose.

Data modeling techniques are methods used to visually represent and structure how data is organized, stored,
and managed in a database. These techniques help design databases that support efficient data storage,
retrieval, and manipulation.
Business and technical personnel to collaborate on how data will be kept, accessed, shared, updated, and
utilized within an organization.
The primary goal of using a data model technique is to create a clear, structured representation of data that
ensures efficient storage, retrieval, and management of information in a database. Key objectives include:
1. Organizing Data Clearly: Data modeling helps structure complex data into easily understandable
entities, relationships, and rules.
2. Improving Communication: It provides a visual representation of data that makes it easier for
stakeholders, such as developers, analysts, and business users, to understand how data will be stored
and used.
3. Ensuring Data Integrity: A good data model defines rules and constraints that ensure the accuracy,
consistency, and reliability of data.
4. Reducing Redundancy: By organizing data efficiently, data models minimize duplication, preventing
unnecessary storage of the same information multiple times.
5. Facilitating Database Design: Data modeling techniques provide a blueprint for creating and optimizing
databases, ensuring they meet performance requirements and are scalable.
6. Supporting Decision-Making: Well-structured data models enable businesses to retrieve and analyze
data more effectively, improving the quality of insights and decisions.
The primary goal is to create a logical framework for data that enhances its usability, accuracy, and efficiency in
various applications.
Types of Data Models
There are three main types of data models:

Conceptual Data Model: It is a high-level view of data, focusing on key business concepts. It is used early in a
project to understand important ideas and needs. This model helps organize business problems, rules, and
concepts, allowing businesses to see data like market trends, customer details, and purchases. It acts as a
starting point for more detailed models developed later
 Purpose: Provides a high-level, abstract view of the data, focusing on the big picture without diving into
the details.
 Features:
o Shows entities, relationships, and key attributes.
o Does not include technical details like data types or how the data will be stored.
o Used for communication between business stakeholders and database designers.
 Example: A simple diagram showing "Customer" and "Order" entities and how they are related.
2. Logical Data Model: It builds on the conceptual model by providing a detailed view of the data structure,
including tables, columns, relationships, and rules. While it doesn't depend on a specific database system, it
closely resembles how the data will be organized in a database. The physical design of databases is based on this
model
 Purpose: Expands on the conceptual model by detailing the structure of the data without considering
how it will be physically implemented.
 Features:
o Defines entities, attributes, relationships, primary keys, and foreign keys.
o Includes more detail than the conceptual model but is still technology-independent.
o Helps in designing the database structure in a way that supports business rules.
 Example: A more detailed diagram showing entities like "Customer" with attributes like "Customer ID,"
"Name," and "Address," along with relationships to the "Order" entity.
3. Physical Data Model: It explains how to set up a database using a specific system. It outlines all the necessary
components, such as tables, columns, and constraints like primary and foreign keys. Its main purpose is to guide
database creation and is designed by developers and database administrators (DBAs). This model helps create
the database schema and shows how the data model is put into practice, including constraints and features of
the relational database management system (RDBMS).
 Purpose: Describes how the data will be physically stored in a database system.
 Features:
o Includes table names, column data types, indexes, and constraints.
o Maps the logical model to the actual database structure, considering the database management
system (DBMS) being used.
o Focuses on performance optimization, storage, and implementation.
 Example: A database schema that defines specific tables, column types (e.g., VARCHAR, INT), and
indexing strategies.
In summary, conceptual models focus on understanding the data at a high level, logical models define the
structure in more detail, and physical models show how the data will be stored and implemented in the
database system.
Importance of the data modeling:
 A clear representation of data makes it easier to analyze the data properly .It provides a quick overview
of the data which can then be used by the developers in the varied applications.
 Data modeling represents the data properly in the model. it rules out any chances of data redundancy
and omission. This helps in clear analysis and processing.
 Data modeling improves data quality and enables the concerned stakeholder to make data driven
decisions.
Data modeling is important because it helps organizations structure and manages data effectively. Here are the
key reasons why data modeling is essential:

1. Understand Your Business Needs


 The main goal of data modeling is to support your business operations.
 To do this, you must clearly understand the needs of your business.
 Prioritize and discard data based on the situation.
 Key takeaway: Understand your organization’s requirements and organize your data properly.
2. Start Simple and Scale as You Grow
 Keep your data models small and simple at first to avoid complexity.
 As your business grows, you can introduce more data.
 Key takeaway: Start with simple models and use tools that can scale up as needed.
3. Organize Data by Facts, Dimensions, Filters, and Order
 To answer business questions effectively, organize data using these four elements:
o Facts (e.g., total sales data),
o Dimensions (e.g., store locations),
o Filters (e.g., last 12 months),
o Order (e.g., ranking of stores).
 Key takeaway: Organize your data properly to enable quick analysis.
4. Keep Data only You Need
 Don't store unnecessary data—it can slow down performance. Keep only the data that’s useful for
business analysis.
 Key takeaway: Keep only the essential datasets to avoid wasting resources.
5. Cross-Check Data Regularly
 Continuously review your data models as you build them.
 Ensure that attributes like product IDs are correctly used to identify records.
 Key takeaway: Use one-to-one or one-to-many relationships to avoid complexity.
6. Allow Data Models to Evolve: Data models need to be updated as your business changes. Store them
in a way that allows easy updates.
Key takeaway: Keep your data models updated to stay relevant.
7. Final Wrap-Up: Data modeling is crucial for making decisions based on facts. By modeling data correctly
and keeping the system simple, businesses can achieve meaningful insights.

This approach helps ensure that your data model adapts as your business grows while staying easy to
manage and effective.
Missing imputation: It is a technique used in data analysis to fill in missing values in a dataset. When data is
incomplete, it can lead to biased results or errors in analysis. Imputation helps to replace these missing values
with estimated ones, allowing for a more complete and accurate dataset.
I. Do nothing to missing data
II. Fill the missing values in the dataset using mean, median.

How It Works:
1. Identify Missing Values: Determine which values in the dataset are missing.
2. Choose an Imputation Method: Decide how to fill in the missing values. Common methods include:
o Mean Imputation: Replace missing values with the average of the available data.
o Median Imputation: Use the median (middle value) of the available data.
o Mode Imputation: Replace missing values with the most frequently occurring value.
o Prediction Models: Use statistical models to predict and fill in missing values based on other
data.
Example:
Imagine you have a dataset of students' test scores in a class:
Student Math Score Science Score
A 85 90
B 78
C 88
D 92 95
In this table:
 Student B is missing a Science score.
 Student C is missing a Math score.
Using Mean Imputation:
 For Student B, calculate the average Science score of the other students (90, 95) = (90 + 95) / 2 = 92.5.
 For Student C, calculate the average Math score of the other students (85, 78, 92) = (85 + 78 + 92) / 3 =
85.
After imputation, the updated dataset would look like this:
Math Science
Student
Score Score
A 85 90
B 78 92.5
C 85 88
D 92 95
Now, the dataset is complete, allowing for better analysis without losing valuable information.
Here’s another example of missing imputation using a dataset related to employee performance in a company:
Dataset:
Imagine you have the following table with employee performance ratings:
Employee Sales ($) Customer Satisfaction (%) Attendance (%)
1 50,000 85 90
2 60,000 95
3 78 80
4 55,000 92
5 70,000 88 100
In this table:
 Employee 2 is missing the Customer Satisfaction score.
 Employee 3 is missing the Sales figure.
 Employee 4 is missing the Attendance percentage.
Using Median Imputation:
Let’s use median imputation to fill in the missing values:
1. Calculate the Median for Each Column:
o Sales: The available Sales figures are 50,000, 60,000, 55,000, and 70,000. The median is (55,000
+ 60,000) / 2 = 57,500.
o Customer Satisfaction: The available scores are 85, 78, 92, and 88. The median is (85 + 88) / 2 =
86.5.
o Attendance: The available percentages are 90, 95, 80, and 100. The median is (90 + 95) / 2 =
92.5.
2. Impute the Missing Values:
o For Employee 2, replace the missing Customer Satisfaction score with 86.5.
o For Employee 3, replace the missing Sales figure with 57,500.
o For Employee 4, replace the missing Attendance percentage with 92.5.
Updated Dataset:
After performing the median imputation, the updated dataset would look like this:
Employee Sales ($) Customer Satisfaction (%) Attendance (%)
1 50,000 85 90
2 60,000 86.5 95
3 57,500 78 80
4 55,000 92 92.5
5 70,000 88 100
Now the dataset is complete, and you can perform a more thorough analysis of employee performance without
losing valuable information due to missing values. Using median imputation ensures that the filled values are
representative of the existing data distribution.
Data Imputation: Data imputation is a method for retaining the majority of the dataset's data and information
by substituting missing data with a different value. These methods are employed because it would be
impractical to remove data from a dataset each time. Additionally, doing so would substantially reduce the
dataset's size, raising questions about bias and impairing analysis.

Data Imputation Techniques

 Next or Previous Value


 K Nearest Neighbors
 Maximum or Minimum Value
 Missing Value Prediction
 Most Frequent Value
 Average or Linear Interpolation
 (Rounded) Mean or Moving Average or Median Value
 Fixed Value
Now that we understand data imputation and its importance, let's explore some key techniques with examples:

1. Next or Previous Value: This method fills missing values by using the nearest available value in a time
series.
Example: In a daily temperature record, if the temperature for day 3 is missing, you might use the
temperature from day 2 or day 4.
2. K Nearest Neighbors (KNN): This technique replaces missing values by finding the most common value
among the k nearest neighbors.
Example: If a person's age is missing in a dataset, KNN might use the ages of the three closest entries to
fill in the gap.
3. Maximum or Minimum Value: This method replaces missing values with the minimum or maximum
known values within a specified range.
Example: If the recorded temperatures range from 0°C to 40°C and a value is missing, you might use 0°C
or 40°C as the replacement.
4. Missing Value Prediction: A machine learning model predicts missing values based on other available
data.
Example: If a person's income is missing, the model might use their age, education, and occupation to
estimate it.
5. Most Frequent Value: This technique replaces missing values with the most commonly occurring value
in a column.
Example: In a dataset of favorite colors, if "blue" appears most frequently, missing color entries would
be filled with "blue."
6. Average or Linear Interpolation: This method fills in missing values by averaging the nearest known
values.
Example: If the values are 10, missing, and 30, the missing value could be filled with 20.
7. (Rounded) Mean, Moving Average, or Median Value: This approach uses the mean, rounded mean, or
median of a dataset to fill missing values.
Example: If a set of test scores is 70, 75, and missing, and 80, you might replace the missing score with
75 (the mean).
8. Fixed Value: This method replaces missing values with a predetermined fixed value.
Example: In a survey, if a response is missing, you might fill it with "not answered."
With these techniques in mind, we can now delve into Multiple Imputations.

Need for Business Modeling: Business modeling is essential for several reasons, as it helps organizations
understand, visualize, and improve their operations and strategies. Here are the key needs for business
modeling:
1. Clarity and Understanding
 Visual Representation: Business modeling provides a clear visual representation of processes, systems,
and relationships within the organization, making it easier for stakeholders to understand how the
business operates.
 Simplifies Complexity: It breaks down complex business processes into manageable components,
facilitating better comprehension.
2. Strategic Planning
 Informed Decision-Making: Business models help in analyzing various scenarios, allowing organizations
to make data-driven decisions regarding strategies, investments, and resource allocation.
 Alignment of Goals: They ensure that all parts of the organization are aligned with the overall business
objectives and strategies.
3. Identifying Opportunities and Risks
 Spotting Opportunities: Business modeling helps identify new market opportunities, potential
partnerships, and areas for growth or expansion.
 Risk Assessment: It allows organizations to assess potential risks and challenges, enabling proactive
planning and mitigation strategies.
4. Improving Processes and Efficiency
 Process Optimization: Business modeling aids in identifying inefficiencies and redundancies in
processes, helping organizations streamline operations for better performance.
 Resource Management: It helps in understanding resource allocation and utilization, ensuring that
resources are used effectively.
5. Facilitating Communication
 Common Language: Business models provide a common framework and language for discussing and
analyzing business processes among different stakeholders, including management, employees, and
investors.
 Stakeholder Engagement: They enhance communication and collaboration among stakeholders,
promoting a shared understanding of business goals and operations.
6. Support for Change Management
 Guide for Transformation: Business models serve as a roadmap for organizational change, helping to
guide the implementation of new processes, technologies, or strategies.
 Measuring Impact: They provide a baseline for measuring the impact of changes and assessing whether
the desired outcomes are achieved.
7. Enhancing Innovation
 Encouraging Creativity: Business modeling encourages innovative thinking by allowing organizations to
experiment with new ideas and approaches in a structured way.
 Prototype Development: It enables the development of prototypes or simulations to test new business
concepts before full-scale implementation.
8. Performance Monitoring and Evaluation
 Key Performance Indicators (KPIs): Business models help in defining KPIs and metrics to monitor
performance and evaluate success over time.
 Continuous Improvement: They provide a framework for ongoing assessment and improvement of
business processes and strategies.
In summary, business modeling is crucial for providing clarity, guiding strategic planning, improving efficiency,
facilitating communication, and supporting innovation and change management within an organization. It plays
a vital role in ensuring that businesses can adapt, grow, and succeed in a dynamic environment.

You might also like