0% found this document useful (0 votes)
14 views43 pages

Mod 2 Business Analytics

Uploaded by

bhavanamn1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views43 pages

Mod 2 Business Analytics

Uploaded by

bhavanamn1234
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 43

International Journal of Scientific & Engineering Research Volume 8, Issue 12, December-2017

ISSN 2229-5518 67

Influence of Structured, Semi-


Structured, Unstructured data on
various data models
Shagufta Praveen, Research Scholar, Umesh Chandra, Assistant Professor

Glocal University,

Abstract: Enormous growth of data from diversified sources changed the complete scenario of database world. Most of the
surveys say that data is very important for all the organizations and its proper handling will seek attention in future. Various forms of
data available in the digital world need different data models for their storage, processing and analysis. This paper discusses
various kinds of data with their characteristics with examples, and also represents that the growing data is responsible for the
numerous emerging data models and database evolution.

Keywords: Structured, Unstructured, Semi structured, Data Models

1. Introduction:

IJSER
Big Data is a term that catches attention of everyone
today. This attention can be justified through some
surveys and facts. These surveys and facts says that Structured
each and every second we all users are creating a new Data
data which gives a addition to the rate of data
growth. Most of the web applications like Facebook,
Twitter, Instagram, Youtube are the ones which
connects with 1 billion people every day and these Data
people not only survey but share and create new data Semi- Unstructured
Structured data
every single second [1]. Survey says that the amount data
of digital universe will double in every two years [2].
Most of the organizations are working on data driven
projects [3]. Most of the organization doesn’t consider
web data as dead data where as different research
Unstructured Data
center using this data for analysis purpose and trying
to utilize it for business intelligence and pattern
prediction. Data mining and data extraction deals Data Semi-structured Data
with various algorithms to extract data so that it Growth
could help us for betterment in IT industries. Structured Data

Fig 1. Kinds of Data

2. Various Kind of Data:

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 12, December-2017
ISSN 2229-5518 68

2.1. Structured data: Data consist of tags and which are self-describing are
generally semi-structured data. They are different
Structured data includes mainly text, these data are from structured and unstructured data. Data object
easily processed. These data are easily entered, stored Model [11], Objects Exchange Model [11], Data
and analyzed. Structured data are stored in the form Guide[11] are famous data model that express semi-
of rows and columns which is easily managed with structured data. Concepts for semi-structured data
the a language called “structured query model: document instance, document schema,
language”(SQL)[4].Relational model[5] is a data elements attributes, elements relationship sets[11].
model that supports structured data and manage it in
the form of row and table and process the content of
the table easily. XML also
XML DOE
Support structured data. Most of the content of the
web pages are in the XML forms. These content are
included in structured data, companies like Google
Semi-structured data
uses structured data to find on the web to understand
the content of the page [6]. This way most of the
Google search is done with the help of structured
data. Since starting of the revolution of database[7] E-mails OEM

IJSER
network[8], hierarchical[9], relational, object
relational[10] data model deals with structured data.

2.2. Characteristics of Fig.3 Attributes of Semi-Structured


Structured Data Data
2.4. Example of Semi-structured Data
1. Structured data has various data type: date, name,
number, characters, address
{
2. These data are arranged in a defined way
Row:{Emp_id:” 12345”,Emp_name:”Ram”},
3. Structured data are handle through SQL
Row:{Emp_id:” 56786”,Emp_name:”Hari”},
4. Structured data are dependent on schema, it is a
schema based Row:{Emp_id:” 67858”,Emp_name:”Shyam”},

5.These data can easily interact with computer Row:{Emp_id:” 90890”,Emp_name:”John”},

2.3. Semi-Structured Data }

Semi-structured data includes e-mails, XML and 2.5. Characteristics of Semi-structured


JSON. Semi structured data is not fit for relational Data
database where it is expressed with the help of edges,
1. It is not based on Schema
labels and tree structures. These are represented with
the help of trees and graphs and they have attributes, 2. It is represented through label and edges
labels. These are schema-less data. Data models
which are graph based can store semi-structured data. 3. It is generated from various web pages
MongoDB is a NOSQL model that support JSON
(semi-structured data). 4. It has multiple attributes

IJSER © 2017
http://www.ijser.org
International Journal of Scientific & Engineering Research Volume 8, Issue 12, December-2017
ISSN 2229-5518 69

3. Unstructured Data

Unstructured data includes videos, images, and


audios. Today, in our digital universe 90% of data
which is increasing is unstructured data. This data is
not fit for relational database and in order to make
them store, scenario came up with NoSQL database.
Today there are four family of NoSQL database: key- Fig.5. Example of Unstructured Data
value, column-oriented, graph-oriented, and
document-oriented. Most of the famous organization 4. Conclusion: This paper emphasize on
today(Amazon, linkedln, Facebook, Google, Youtube) the concept that growing data directly
is dealing with NoSQL data [12 ] and they are influence its related data models and
replaced their convention database to NoSQl database technologies, it represents that big
database. data concept not only deals with huge and
vast data but it gives a new gate to database
3.1. Characteristics of Unstructured
analyst and researchers to work on various
Data
data and data models for survival of new

IJSER
1. It is not based on Schema
kinds of data in upcoming and present
2. It is not suitable for relational database
scenario.
3. 90% of unstructured data is growing today
4. It includes digital media files, Word doc. References:
,pdf files,
5. It is stored in NoSQL database 1. https://www.forbes.com/sites/bernardmarr/2015/0
9/30/big-data-20-mind-boggling-facts-everyone-
must-read/#7e621bc417b1
2. https://insidebigdata.com/2017/02/16/the-
exponential-growth-of-data/
NoSQL 3. https://www.idgenterprise.com/resource/research/
2015-big-data-and-analytics-survey/
4. J. R. Groff, P. N. Weinberg SQL:The complete
reference second addition, 2002 , Mc-Graw Hills
Unstructured data Companies
5. E.F. CODD, 1970. A Relational Model of Data for
Large Shared Data Banks.
6. https://developers.google.com/search/docs/guides/
intro-structured-data
7. S. Praveen, Dr. U. Chandra, Arif ali wani , a
Audio, Videos
literature review on evolving database, IJCA,
images March 2017.
8. https://en.wikipedia.org/wiki/Network_model
9. http://www.edugrabs.com/hierarchical-model/
10. http://www.learn.geekinterview.com/it/data-
modeling/object-relational-model.html
Fig.4. Attributes of Unstructured Data
11. T.W Ling,., G. Dobbie, Semi-structured database
design,., 2005, Springer, 178,978-0-387-23567-7
12. S. Praveen , Dr. U. Chandra ,NoSQL: IT Giant
Prespectives , 2017, IJCIR

IJSER © 2017
http://www.ijser.org
UNIT 4 EXTRACT, TRANSFORM AND LOADING
4.0 Introduction
4.1 Objectives
4.2 ETL and its Need
4.2.1 Why do You Need ETL?
4.3 ETL Process
4.3.1 Data Extraction
4.3.2 Data Transformation
4.3.3 Data Loading
4.3.3.1 Types of Incremental Loads
4.3.3.2 Challenges in Incremental Loading
4.4 Working of ETL
4.4.1 Layered Implementation of ETL in a Data Warehouse
4.5 ETL and OLAP Data Warehouses
4.6 ETL Tools and their Benefits
4.7 Improving the Performance of ETL
4.8 ELT and its Need
4.8.1 Why do you Need ELT?
4.8.2 Benefits of ELT
4.8.3 ETL Vs ELT
4.9 Summary
4.10 Solutions / Answers
4.11 Further Readings

4.0 INTRODUCTION
A data warehouse is a digital storage system that connects and harmonizes large
amounts of data from many different sources. Data warehouses store current and
historical data in one place and act as the single source for an organization. A
typical data warehouse has four main components namely:
Central database: A database serves as the foundation of your data warehouse.
Traditionally, these have been standard relational databases running on premise or
in the cloud. But because of Big Data, the need for true, real-time performance, and
a drastic reduction in the cost of RAM, in-memory databases are rapidly gaining in
popularity.
Data integration: Data is pulled from source systems and modified to align the
information for rapid analytical consumption using a variety of data integration
approaches such as ETL (extract, transform, load) and ELT as well as real-time
data replication, bulk-load processing, data transformation, and data quality and
enrichment services.
Metadata: Metadata is data about your data. It specifies the source, usage, values,
and other features of the data sets in your data warehouse. There is business
metadata, which adds context to your data, and technical metadata, which describes
how to access data – including where it resides and how it is structured.
Etl, OLAP and Trends Data warehouse access tools: Access tools allow users to interact with the data in
your data warehouse. Examples of access tools include: query and reporting tools,
application development tools, data mining tools, and OLAP tools.
All these components are engineered for speed so that you can get results quickly
and analyze the data within no time.
In this unit, we will study about Data Integration component approach such as
Extract, Transform and Load (ETL) in detail.

4.1 OBJECTIVES
After going through this unit, you shall be able to:
yy understand the purpose ETL;
yy describe the ETL process, benefits and ETL tools;
yy know the complete working of the ETL;
yy discuss various layers involved in the ETL implementation;
yy to summarize the functionality of ELT, its need and benefits, and
yy to compare and contrast the ETL with ELT.

4.2 ETL AND ITS NEED


Extract, Transform, Load (ETL) as shown in Figure 1, is a process of data integration
that encompasses three steps - extraction, transformation, and loading. In a nutshell,
ETL systems take large volumes of raw data from multiple sources, convert it for
analysis, and load that data into your warehouse.

Figure 1: ETL in a Data Warehouse

4.2.1 Why Do You Need ETL?


ETL saves you significant time on data extraction and preparation - time that you
58 can better spend on evaluating your business. Practicing ETL is also part of a healthy
data management workflow, ensuring high data quality, availability, and reliability. Extract, Transform
Each of the three major components in the ETL saves time and development effort and Loading
by running just once in a dedicated data flow:
Extract: In ETL, the first link determines the strength of the chain. The extract stage
determines which data sources to use, the refresh rate (velocity) of each source, and
the priorities (extract order) between them — all of which heavily impact your time
to insight.
Transform: After extraction, the transformation process brings clarity and order
to the initial data swamp. Dates and times combine into a single format and
strings parse down into their true underlying meanings. Location data convert to
coordinates, zip codes, or cities/countries. The transform step also sums up, rounds,
and averages measures, and it deletes useless data and errors or discards them for
later inspection. It can also mask personally identifiable information (PII) to comply
with GDPR, CCPA, and other privacy requirements.
Load: In the last phase, much as in the first, ETL determines targets and refresh
rates. The load phase also determines whether loading will happen incrementally,
or if it will require “upsert” (updating existing data and inserting new data) for the
new batches of data.
Let us learn the whole process in the follwoing section.

4.3 ETL PROCESS


ETL collects and processes data from various sources into a single data store (a data
warehouse or data lake), making it much easier to analyze. The three steps in ETL
process are mentioned below:
4.3.1 Data Extraction
Data extraction involves the following four steps:
Identify the data to extract: The first step of data extraction is to identify the data
sources you want to incorporate into your data warehouse. These
sources might be from relational SQL databases like MySQL or non-relational
NoSQL databases like MongoDB or Cassandra. The information could also be
from a SaaS platform like Salesforce or other applications. After identifying the
data sources, you need to determine the specific data fields you want to extract.
Estimate how large the data extraction is: The size of the data extraction matters.
Are you extracting 50 megabytes, 50 gigabytes, or 50 petabytes of data? A larger
quantity of data will require a different ETL strategy. For example, you can make
a larger dataset more manageable by aggregating it to month-level rather than day-
level, which reduces the size of the extraction. Alternatively, you can upgrade your
hardware to handle the larger dataset.
Choose the extraction method: Since data warehouses need to update continually
for the most accurate reports, data extraction is an ongoing process that may need
to happen on a minute-by-minute basis. There are three principal methods for
extracting information:
(a) Update notifications: The preferred method of extraction involves update
notifications. The source system will send a notification when one of its 59
Etl, OLAP and Trends records has changed, and then the data warehouse updates with only the
new information.
(b) Incremental extraction: The second method, which you can use when
update notifications aren’t possible, is incremental extraction. This involves
identifying which records have changed and performing extraction of only
those records. A potential setback is that incremental extraction cannot
always identify deleted records.
(c) Full extraction: When the first two methods won't work, a complete update
of all the data through full extraction is necessary. Keep in mind that this
method is likely only feasible for smaller data sets.
Assess your SaaS platforms: Businesses formerly relied on in-house applications for
accounting and other record-keeping. These applications used OLTP transactional
databases that they maintained on an on-site server. Today, more businesses use
SaaS(software as a service) platforms like Google
Analytics, HubSpot, and Salesforce. To pull data from one of these, you’ll need a
solution that integrates with the unique API of the platform. Xplenty is one such
solution.
4.3.2 Data Transformation
In traditional ETL strategies, data transformation that occurs in a staging area
(after extraction) is “multistage data transformation”. In ELT, data transformation
that happens after loading data into the data warehouse is “in- warehouse
data transformation”. You may need to perform some of the following data
transformations:
Deduplication (normalizing): Identifies and removes duplicate information.
Key restructuring: Draws key connections from one table to another.
Cleansing: Involves deleting old, incomplete, and duplicate data to maximize data
accuracy - perhaps through parsing to remove syntax errors, typos, and fragments
of records.
Format revision: Converts formats in different datasets - like date/time, male/
female, and units of measurement - into one consistent format.
Derivation: Creates transformation rules that apply to the data. For example, maybe
you need to subtract certain costs or tax liabilities from business revenue figures
before analyzing them.
Aggregation: Gathers and searches data so you can present it in a summarized
report format.
Integration: Reconciles diverse names/values that apply to the same data elements
across the data warehouse so that each element has a standard name and definition.
Filtering: Selects specific columns, rows, and fields within a dataset.
Splitting: Splits one column into more than one column.
Joining: Links data from two or more sources, such as adding spend information
across multiple SaaS platforms.
60 Summarization: Creates different business metrics by calculating value totals. For
example, you might add up all the sales made by a specific salesperson to create Extract, Transform
total sales metrics for specific periods. and Loading

Validation: Sets up automated rules to follow in different circumstances. For


instance, if the first five fields in a row are NULL, then you can flag the row for
investigation or prevent it from being processed with the rest of the information.
4.3.3 Data Loading
Data loading is the process of loading the extracted information into your target
data repository. Loading is an ongoing process that could happen through “full
loading” (the first time you load data into the warehouse) or “incremental loading”
(as you update the data warehouse with new information). Because incremental
loads are the most complex, we'll focus on them in this section.
4.3.3.1 Types of Incremental Loads
Incremental loads extract and load information that has appeared since the last
incremental load. This can happen in two ways: (a) Batch incremental loads and
(b) Streaming incremental loads.
(a) Batch incremental loads: The data warehouse ingests information in
packets or batches. If it's a large batch, it's best to carry out a batch load
during off-peak hours - on a daily, weekly, or monthly basis - to prevent
system slowdowns. However, modern data warehouses can also ingest
small batches of information on a minute-by-minute basis with an ETL
platform like Xplenty. This allows them to achieve an approximation of
real-time updates for the end-user.
(b) Streaming incremental loads: The data warehouse ingests new data as
it appears in real-time. This method is particularly valuable when the end-
user requires real-time updates (for example: for up-to-the-minute decision-
making). Further, streaming incremental loads are only possible when the
updates involve a very small amount of data. In most cases, minute-by-
minute batch updates offer a more robust solution than real-time streaming.
4.3.3.2 Challenges in Incremental Loading
Incremental loads can disrupt system performance and cause a host of problems,
including:
Data structure changes: Data formats in your data sources or data warehouse
may need to evolve according to the needs of your information system. However,
changing one part of the system could lead to incompatibilities that interfere with
the loading process. To prevent problems relating to inconsistent, corrupt, or
incongruent data, it’s important to zoom out and review how slight changes affect
the total ecosystem before making the appropriate adjustments.
Processing data in the wrong order: Data pipelines can follow complex trajectories
that result in your data warehouse processing, updating, or deleting information in
the wrong order. That can lead to corrupt or inaccurate information. For this reason,
it’s vital to monitor and audit the ordering of data processing.
Failure to detect problems: Quick detection of any problems with your ETL
workflow is crucial: e.g. when an API goes down, when your API access credentials
are out-of-date, when system slowdowns interrupt dataflow from an API or when
the target data warehouse is down. The sooner you detect the problem, the faster 61
Etl, OLAP and Trends you can fix it, and the easier it is to correct the inaccurate/corrupt data that results
from it.

4.4 WORKING OF ETL


In this section, we'll dive a little deeper, taking an in-depth look at each of the three
steps in the ETL process.
You can use scripts to implement ETL (i.e. custom do it yourself code) or you
can use a dedicated ETL tool. An ETL system performs a number of important
functions, including:
(a) Parsing/Cleansing: Data generated by applications may be in various
formats like JSON, XML, or CSV. The parsing stage maps data into a table
format with headers, columns, and rows, and then extracts specified fields.
(b) Data Enrichment: Preparing data for analytics usually requires certain
data enrichment steps, including injecting expert knowledge, resolving
discrepancies, and correcting bugs.
(c) Setting Velocity: “Velocity” refers to the frequency of data loading, i.e.
inserting new data and updating existing data.
(d) Data Validation: In some cases, data is empty, corrupted, or missing
crucial elements. During data validation, ETL finds these occurrences and
determines whether to stop the entire process, skip the data or set the data
aside for human inspection.
4.4.1 Layered Implementation of ETL in a Data Warehouse
When an ETL process is used to move data into a data warehouse, a separate layer
represents each phase:
(a) Mirror/Raw layer: This layer is a copy of the source files or tables, with no
logic or enrichment. The process copies and adds source data to the target mirror
tables, which then hold historical raw data that is ready to be transformed.
(b) Staging layer: Once the raw data from the mirror tables transform, all
transformations wind up in staging tables. These tables hold the final form
of the data for the incremental part of the ETL cycle in progress.
(c) Schema layer: These are the destination tables, which contain all the data
in its final form after cleansing, enrichment, and transformation.
(d) Aggregating layer: In some cases, it's beneficial to aggregate data to a daily
or store level from the full dataset. This can improve report performance,
enable the addition of business logic to calculate measures, and make it
easier for report developers to understand the data.

4.5 ETL AND OLAP DATA WAREHOUSES


Data engineers have been using ETL for over two decades to integrate diverse types
of data into online analytical processing (OLAP) data warehouses. The reason for
doing this is simple: to make data analysis easier. Normally, business applications
use online transactional processing (OLTP) database systems. These are optimized
for writing, updating, and editing the information inside them. They are not good
62 at reading and analysis. However, online analytical processing database systems
are excellent at high-speed reading and analysis. That’s why ETL is necessary to Extract, Transform
transform OLTP information, so it can work with an OLAP data warehouse. and Loading

During the ETL process, information is:


i. Extracted from various relational database systems (OLTP or RDBMS) and
other sources.
ii. Transformed within a staging area, into a compatible relational format, and
integrated with other data sources.
iii. Loaded into the online analytical processing (OLAP) data warehouse server.
In the past, data engineers hand-coded ETL pipelines in R, Python, and SQL - a
laborious process that could take months to complete. Today, hand-coded ETL
continues to be necessary in many cases. However, modern ETL solutions like
Xplenty allow data teams to skip hand-coding and automatically integrate the most
popular data sources into their data warehouses. This has dramatically increased
the speed of setting up an ETL pipeline, while eliminating the risk of human error.

4.6 ETL TOOLS AND THEIR BENEFITS


ETL tools come in a wide variety both in open source or proprietary categories.
There are ETL frameworks and libraries that you can use to build ETL pipelines
in Python. There are tools and frameworks you can leverage for GO and Hadoop.
Really, there is an open-source ETL tool out there for almost any unique ETL need.
The downside, of course, is that you'll need lots of custom coding, setup, and man-
hours getting the ETL operational. Even then, you may find that you need to tweak
your ETL stack whenever you introduce additional tasks. Following are some of
the benefits of the ETL tools:
yy Scalability: Trying to scale-out hand-coded ETL solutions is difficult.
As schema complexity rises and your tasks grow more complex and
resource-hungry, establishing solid pipelines and deploying the necessary
ETL resources can become impossible. With cloud-based ETL tools like
Xplenty, you have unlimited scalability at the click of a button.
yy Simplicity: Going from a hand-coded ETL solution using SQLAlchemy
and pandas with rpy2 and parse to something as simple as a cloud-based
ETL can be life changing. The benefits of having all of your needs layered
into one tool save you time, resources, and lots of headaches.
yy Out-of-the-box: While open source ETL tools like Apache Airflow require
some customization, cloud-based ETL tools like Xplenty work out-of-the-box.
yy Compliance: The overwhelming nature of modern data compliance can
be frightening. Between GDPR, CCPA, HIPAA, and all of the other
compliance and privacy nets, using an ETL tool that bakes compliance into
its framework is an easy way to skip difficult and risky compliance setups.
yy Long-term costs: Hand-coded solutions may be cheaper up-front, but they
will cost you in the long run. The same thing could be said about open source
ETL tools. Since you have to spend time and energy on modification, you're
forced to onboard early or risk delaying project launches. Cloud-based ETL
tools handle maintenance and back-end caretaking for you.
63
Etl, OLAP and Trends
4.7 IMPROVING THE PERFORMANCE OF ETL
Ultimately tuning is very much required for the ETL to perform better. Following
are some of the factors to be considered to improve the ETL performance:
(i) Tackle the Bottlenecks
Before anything else, make sure you log metrics such as time, the number of records
processed, and hardware usage. Check how many resources each part of the process
takes and address the heaviest one. Usually, it will be the second part, building
facts, and dimensions in the staging environment.
(ii) Load Data Incrementally
Loading only the changes between the previous and the new data saves a lot of
time as compared to a full load. It’s more difficult to implement and maintain, but
difficult doesn’t mean impossible, so do consider it. Loading incrementally can
definitely improve the ETL performance.
(iii) Partition Large Tables
If you use relational databases and you want to improve the data processing window,
you can partition large tables. That is, cut big tables down to physically smaller
ones, probably by date. Each partition has its own indices and the indices tree is
shallower thus allowing for quicker access to the data. It also allows switching data
in and out of a table in a quick metadata operation instead of actual insertion or
deletion of data records.
(iv) Cut Out Extraneous Data
It’s important to collect as much data as possible, but not all of it is worthy enough
to enter the data warehouse. To improve the ETL performance, define exactly
which data should be processed and leave irrelevant rows/columns out. Better to
start small and grow as you go as opposed to creating a giant data that takes much
time to process.
(v) Cache the Data
Caching data can greatly speed things up since memory access performs faster than
do hard drives. Note that caching is limited by the maximum amount of memory
your hardware supports.
(vi) Process in Parallel using Hadoop
Instead of processing serially, optimize resources by processing in parallel.
Sort and aggregate functions (count, sum, etc.) block processing because they must
end before the next task can begin. Even if you can process in parallel, it won’t
help if the machine is running on 100% CPU the entire time. You could scale up
by upgrading the CPU, but it would scale only to a limit. Hadoop is a much better
solution.
Apache Hadoop is designed for the distributed processing of large data over a
cluster of machines. It uses HDFS, a dedicated file system that cuts data into small
chunks and optimally spreads them over the cluster. Duplicate copies are kept and
the system maintains integrity automatically.
MapReduce is used to process tasks (Hadoop 2 or YARN allows more applications).
64 Each MapReduce job works in 2 stages:
(a) Map - filtering and sorting data - tasks are divided into sub-tasks and Extract, Transform
processed in parallel by the cluster machines. and Loading

(b) Reduce - summary operations - data from the previous stage is combined.
Hadoop is optimized for distributed processing analytics. Sort and aggregate
functions execute in parallel on an entire cluster.

4.8 ELT AND ITS NEED


Extract/load/transform (ELT) is the process of extracting data from one or multiple
sources and loading it into a target data warehouse. Instead of transforming the
data before it’s written, ELT takes advantage of the target system to do the data
transformation. This approach requires fewer remote sources than other techniques
because it needs only raw and unprepared data. The process is illustrated in Figure 2.
ELT is an alternative to the traditional extract/transform/load (ETL) process. It
pushes the transformation component of the process to the target database for better
performance. This capability is very useful for processing the massive data sets
needed for business intelligence (BI) and big data analytics.
Because it takes advantage of the processing capability already built into a data storage
infrastructure, ELT reduces the time data spends in transit and boosts efficiency.
It is becoming increasingly common for data to be extracted from its source
locations, then loaded into a target data warehouse to be transformed into actionable
business intelligence. ELT process consists of three steps:
a. Extract - This step works similarly in both ETL and ELT data management
approaches. Raw streams of data from virtual infrastructure, software, and
applications are ingested either in their entirety or according to predefined
rules.
b. Load – The ELT differs here with the ETL. Rather than deliver this mass of
raw data and load it to an interim processing server for transformation, ELT
delivers it directly to the target storage location. This shortens the cycle
between extraction and delivery.
c. Transform - The database or data warehouse sorts and normalizes the data,
keeping part or all of it on hand and accessible for customized reporting.
The overhead for storing this much data is higher, but it offers more
opportunities to mine it for relevant business intelligence in near real-time.

Figure 2: ELT Process 65


Etl, OLAP and Trends 4.8.1 Why Do You Need ELT?
Transforming data after uploading it to modern cloud ecosystems is most effective
for:
yy Large enterprises with vast data volumes
yy Businesses that collect data from multiple source systems or in dissimilar
formats
yy Companies that require quick or frequent access to integrated data
yy Data scientists who rely on business intelligence
yy IT departments and data stewards interested in a low-maintenance solution
The ELT process improves data conversion and manipulation capabilities due to
parallel load and data transformation functionality. This schema allows data to be
accessed and queried in near real time.
However, you might want to stick with ETL if you have dirty data (e.g., duplicate
records, incomplete or inaccurate data) that will require data engineers to clean and
format it prior to data loading.
4.8.2 Benefits of ELT
With traditional ETL, relevant data is transformed before it is uploaded to a
data warehouse, and then it must be pushed out of the warehouse for analysis or
processing. This data pipeline works, but it can take more time to migrate data from
the source to the target system.
The ELT process saves you steps and time. Data is first loaded into the target
ecosystem, such as a data warehouse, and then transformed. Authorized users can
securely access the data without returning it to source systems. No downloading is
necessary for it. There are reasons to continue using ETL tools. For example, some
companies want to keep all their data on-premises. If there is a small amount of data,
and it is relational and structured, traditional ETL is effective for businesses that
favor hands-on data integration. However, the ELT approach has several benefits
for most industries which are listed below:
a) Get better results with more efficient effort
ELT allows you to integrate and process large amounts of data, both structured
and unstructured from multiple servers. And, both raw and cleansed data can
be accessed with artificial intelligence (AI) and machine learning (ML) tools in
addition to SQL and NoSQL processing.
b) Transform your data faster
ELT doesn’t have to wait for the data to be transformed and then loaded. The
transformation process happens where the data resides, so you can access your data
in a few seconds, a huge benefit when processing time-sensitive data.
c) Combine data from different sources and formats
Larger enterprises typically have multiple, disparate data sources like onsite servers,
cloud warehouses and log files. Using ELT means you can combine data from
various data sets regardless of the source or whether it is structured or unstructured,
66 related or unrelated.
d) Manage data at scale Extract, Transform
and Loading
Technological advances allow organizations to collect petabytes (a million
gigabytes!) of data. ELT streamlines the management of massive amounts of data
by allowing raw and cleansed data to be stored and accessed. If you’re planning to
use cloud-based data
warehousing or high-end data processing engines like Hadoop, ELT can take
advantage of the native processing power for greater scalability.
e) Save time and money
ELT reduces the time data spends in transit and doesn’t require an interim data
system or additional remote resources to transform the data outside the cloud. Plus,
there’s no need to move data in and out of cloud ecosystems for analysis. The more
your data moves around, the more the costs add up. The scalability of ELT makes
it cost-effective for businesses of any size.
4.8.3 ETL Vs ELT
The primary differences between ETL and ELT are how much data is retained in data
warehouses and where data is transformed. With ETL, the transformation of data is
done before it is loaded into a data warehouse. This enables analysts and business
users to get the data they need faster, without building complex transformations
or persistent tables in their business intelligence tools. Using the ELT approach,
data is loaded into the warehouse as is, with no transformation before loading. This
makes jobs easier to configure because it only requires an origin and a destination.
The ETL and ELT approaches to data integration differ in several key ways as
listed below:
yy Load time - It takes significantly longer to get data from source systems to
the target system with ETL.
yy Transformation time - ELT performs data transformation on-demand,
using the target system's computing power, reducing wait times for
transformation.
yy Complexity - ETL tools typically have an easy-to-use GUI that simplifies
the process. ELT requires in-depth knowledge of BI tools, masses of raw
data, and a database that can transform it effectively.
yy Data warehouse support - ETL is a better fit for legacy on-premise data
warehouses and structured data. ELT is designed for the scalability of the
cloud.
yy Maintenance - ETL requires significant maintenance for updating data in
the data warehouse. With ELT, data is always available in near real-time.
Both ETL and ELT processes have their importance in the data warehouse
architecture as per the understanding of business unique needs and strategies which
is key to determining which process will deliver the best results.

67
Measures of Central Tendency & Dispersion
Measures that indicate the approximate center of a distribution are called measures of central tendency.
Measures that describe the spread of the data are measures of dispersion. These measures include the mean,
median, mode, range, upper and lower quartiles, variance, and standard deviation.

A. Finding the Mean


The mean of a set of data is the sum of all values in a data set divided by the number of values in the set.
It is also often referred to as an arithmetic average. The Greek letter (“mu”) is used as the symbol for
population mean and the symbol ̅ is used to represent the mean of a sample. To determine the mean of
a data set:

1. Add together all of the data values.


2. Divide the sum from Step 1 by the number of data values in the set.
Formula:

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

The mean of this data set is 14.

B. Finding the Median


The median of a set of data is the “middle element” when the data is arranged in ascending order. To
determine the median:

1. Put the data in order from smallest to largest.


2. Determine the number in the exact center.
i. If there are an odd number of data points, the median will be the number in the absolute
middle.
ii. If there is an even number of data points, the median is the mean of the two center data
points, meaning the two center values should be added together and divided by 2.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Determine the absolute middle of the data. 9, 10, 12, 13, 14, 14, 17, 17, 20

Note: Since the number of data points is odd choose the one in the very middle.

The median of this data set is 14.


C. Finding the Mode
The mode is the most frequently occurring measurement in a data set. There may be one mode; multiple
modes, if more than one number occurs most frequently; or no mode at all, if every number occurs only
once. To determine the mode:

1. Put the data in order from smallest to largest, as you did to find your median.
2. Look for any value that occurs more than once.
3. Determine which of the values from Step 2 occurs most frequently.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Look for any number that occurs more than once. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 3: Determine which of those occur most frequently. 14 and 17 both occur twice.

The modes of this data set are 14 and 17.

D. Finding the Upper and Lower Quartiles


The quartiles of a group of data are the medians of the upper and lower halves of that set. The lower
quartile, Q1, is the median of the lower half, while the upper quartile, Q3, is the median of the upper
half. If your data set has an odd number of data points, you do not consider your median when finding
these values, but if your data set contains an even number of data points, you will consider both middle
values that you used to find your median as parts of the upper and lower halves.

1. Put the data in order from smallest to largest.


2. Identify the upper and lower halves of your data.
3. Using the lower half, find Q1 by finding the median of that half.
4. Using the upper half, find Q3 by finding the median of that half.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Identify the lower half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 3: Identify the upper half of your data. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 4: For the lower half, find the median. 9, 10, 12, 13
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q1.

Step 5: For the upper half, find the median. 14, 17, 17, 20
Since there are an even number of data points in this half, you will find the median by summing the
two in the center and dividing by two. This is Q3.

Q1 of this data set is 11 and Q3 of this data set is 17.


E. Finding the Range
The range is the difference between the lowest and highest values in a data set. To determine the range:

1. Identify the largest value in your data set. This is called the maximum.
2. Identify the lowest value in your data set. This is called the minimum.
3. Subtract the minimum from the maximum.

Example:
Consider the data set: 17, 10, 9, 14, 13, 17, 12, 20, 14

Step 1: Put the data in order from smallest to largest. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Identify your maximum. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 2: Identify your minimum. 9, 10, 12, 13, 14, 14, 17, 17, 20

Step 3: Subtract the minimum from the maximum. 20 – 9 = 11

The range of this data set is 11.

F. Finding the Variance and Standard Deviation


The variance and standard deviation are a measure based on the distance each data value is from the
mean.

1. Find the mean of the data. ( if calculating for a population or ̅ if using a sample)
2. Subtract the mean ( or ̅ ) from each data value (xi ).
3. Square each calculation from Step 2.
4. Add the values of the squares from Step 3.
5. Find the number of data points in your set, called n.
6. Divide the sum from Step 4 by the number n (if calculating for a population) or n – 1(if using a
sample). This will give you the variance.
7. To find the standard deviation, square root this number.

Formulas:
Sample Variance, : Population Variance, :
∑ ̅ ∑

Sample Standard Deviation, s: Population Standard Deviation, :

̅
√∑ √∑

Example: Calculate the sample variance and sample standard deviation


Consider the sample data set: 17, 10, 9, 14, 13, 17, 12, 20, 14.

Step 1: The mean of the data is 14, as shown previously in Section A.


Step 2: Subtract the mean from each data value. 17 – 14 = 3; 10 – 14 = -4; 9 – 14 = -5; 14 – 14 = 0

13 – 14 = -1; 17 – 14 = 3; 12 – 14 = -2; 20 – 14 = 6; 14 – 14 = 0

Step 3: Square these values. 32 = 9; (-4)2 = 16; (-5)2 = 25; 02 = 0; (-1)2 = 1; 32 = 9; (-2)2 = 4; 62 = 36

Step 4: Add these values together. 9 + 16 + 25 + 0 + 1 + 9 + 4 + 36 = 100

Step 5: There are 9 values in our set, so we will divide by 9 – 1 = 8. = 12.5

Note: This is your variance.

Step 6: Square root this number to find your standard deviation. √ = 3.536

The variance is 12.5 and the standard deviation is 3.536.

G. Using the TI-84


1. To enter the data values, press 2. Enter the data values
3. Press the key.
in the L1 column.
the key. Select Edit… Under CALC, select 1-VarStats

4. Make sure the List is


L1 then select Calculate.
Mean
Sum of all data values

Sample Standard Deviation


Population Standard Deviation

Number of data values

Lower Quartile
These could be
subtracted to
Median find the range.

Upper Quartile Smallest data value

Largest data value


Probability
UNIT 14 PROBABILITY DISTRIBUTIONS Distributions

STRUCTURE

14.0 Objectives
14.1 Introduction
14.2 Types of Probability Distribution
14.3 Concept of Random Variables
14.4 Discrete Probability Distribution
14.4.1 Binomial Distribution
14.4.2 Poisson Distribution
14.5 Continuous Probability Distribution
14.5.1 Normal Distribution
14.5.2 Characteristics of Normal Distribution
14.5.3 Importance and Application of Normal Distribution
14.6 Let Us Sum Up
14.7 Key Words
14.8 Answers to Self Assessment Exercises
14.9 Terminal Questions/Exercises
14.10 Further Reading

14.0 OBJECTIVES
After studying this unit, you should be able to:

l differentiate between frequency distribution and probability distribution,

l become aware of the concepts of random variable and probability distribution,

l appreciate the usefulness of probability distributions in decision-making,

l identify situations where discrete probability distributions can be applied,

l fit a binomial distribution and poisson distribution to the given data,

l identify situations where continuous probability distributions can be applied, and

l appreciate the usefulness of continuous probability distributions in decision-


making.

14.1 INTRODUCTION
A probability distribution is essentially an extension of the theory of probability
which we have already discussed in the previous unit. This unit introduces the
concept of a probability distribution, and to show how the various basic
probability distributions (binomial, poisson, and normal) are constructed. All these
probability distributions have immensely useful applications and explain a wide
variety of business situations which call for computation of desired probabilities.

By the theory of probability

P(H1) + P(H2) + ……P(Hn) = 1

This means that the unity probability of a certain event is distributed over a set
of disjointed events making up a complete group. In general, a tabular recording
of the probabilities of all the possible outcomes that could result if random 2 9
Probability and (chance) experiment is done is called “Probability Distribution”. It is also
Hypothesis Testing termed as theoretical frequency distribution.

Frequency Distribution and Probability Distribution


One gets a better idea about a probability distribution by comparing it with a
frequency distribution. It may be recalled that the frequency distributions are
based on observation and experimentation. For instance, we may study the
profits (during a particular period) of the firms in an industry and classify the
data into two columns with class intervals for profits in the first column, and
corresponding classes’ frequencies (No. of firms) in the second column.

The probability distribution is also a two-column presentation with the values of


the random variable in the first column, and the corresponding probabilities in
the second column. These distributions are obtained by expectations on the
basis of theoretical or past experience considerations.Thus, probability
distributions are related to theoretical or expected frequency distributions.

In the frequency distribution, the class frequencies add up to the total number
of observations (N), where as in the case of probability distribution the possible
outcomes (probabilities) add up to ‘one’. Like the former, a probability
distribution is also described by a curve and has its own mean, dispersion, and
skewness.

Let us consider an example of probability distribution. Suppose we toss a fair


coin twice, the possible outcomes are shown in Table 14.1 below.

Table 14.1: Possible Outcomes from Two-toss Experiment of a Fair Coin

No. of Ist 2nd No. of Probability


possible toss toss Heads on two of the possible
outcomes tosses outcomes

1 Head Head 2 0.5 × 0.5 = 0.25


2 Head Tail 1 0.5 × 0.5 = 0.25
3 Tail Head 1 0.5 × 0.5 = 0.25
4 Tail Tail 0 0.5 ×0.5 = 0.25
Total = 1.00

Now we are intestered in framing a probability distribution of the possible


outcomes of the number of Heads from the two-toss experiment of a fair coin.
We would begin by recording any result that did not contain a head, i.e., only
the fourth outcome in Table 14.1. Next, those outcomes containing only one
head, i.e., second and third outcomes (Table 14.1), and finally, we would record
that the first outcome contains two heads (Table 14.1). We recorded the same
in Table 14.2 to highlight the number of heads contained in each outcome.

Table 14.2: Probability Distribution of the Possible No. of Heads from Two-toss
Experiment of a Fair Coin
No. of Tosses Probability of
Heads (H) outcomes P (H)
0 (T, T) 1/4 = 0.25
1 (H, T) + (T, H) 1/2 = 0.50
2 (H, H) 1/4 = 0.25
3 0
We must note that the above tables are not the real outcome of tossing a fair Probability
Distributions
coin twice. But, it is a theoretical outcome, i.e., it represents the way in which
we expect our two-toss experiment of an unbaised coin to behave over time.

14.2 TYPES OF PROBABILITY DISTRIBUTION


Probability distributions are broadly classified under two heads: (i) Discrete
Probability Distribution, and (ii) Continuous Probability Distribution.

i) Discrete Probability Distribution: The discrete probability is allowed to take


on only a limited number of values. Consider for example that the probability of
having your birthday in a given month is a discrete one, as one can have only 12
possible outcomes representing 12 months of a year.

ii) Continuous Probability Distribution: In a continuous probability distribution,


the variable of interest may take on any values within a given range. Suppose
we are planning to release water for hydropower generation. Depending on
how much water we have in the reservoir viz., whether it is above or below the
normal level, we decide on the amount and time of release. The variable
indicating the difference between the actual reservoir level and the normal
level, can take positive or negative values, integer or otherwise. Moreover, this
value is contingent upon the inflow to the reservoir, which in turn is uncertain.
This type of random variable which can take an infinite number of values is
called a continuous random variable, and the probability distribution of such a
variable is called a continuous probability distribution.

Before we attempt discrete and continuous probability distributions, the concept


of random variable which is central to the theme, needs to be elaborated.

14.3 CONCEPT OF RANDOM VARIABLES


A random variable is a variable (numerical quantity) that can take different
values as a result of the outcomes of a random experiment. When a random
experiment is carried out, the totality of outcomes of the experiment forms a
set which is known as sample space of the experiment. Similar to the
probability distribution function, a random variable may be discrete or continuous.

The example given in the Introduction, we have seen that the outcomes of the
experiment of two-toss of a fair coin were expressed in terms of the number
of heads. We found in the example, that H (head) can assume values of 0, 1
and 2 and corresponding to each value, a probability is associated. This
uncertain real variable H, which assumes different numerical values depending
on the outcomes of an experiment, and to each of whose value a possibility
assignment can be made, is known as a random variable. The resulting
representation of all the values with their probabilities is termed as the
probability distribution of H.

It is customary to present the distribution as shown in Table 14.3 below.

Table 14.3: Probability Distribution of No. of Heads

H: 0 1 2

P (H: 0.25 0.50 0.25


3 1
Probability and In this case, as we find that H takes only discrete values, the variable H is
Hypothesis Testing called a discrete random variable, and the resulting distribution is a discrete
probability distribution. The function that specifies the probability distribution
of a discrete random variable is called the probability mass function (p.m.f.).

In the above situations, we have seen that the random variable takes a limited
number of values. There are certain situations where the variable under
consideration may have infinite values. Consider for example, that we are
interested in ascertaining the probability distribution of the weight of one kg.
coffee packs. We have reasons to believe that the packing process is such that
a certain percentage of the packs slightly below one kg., and some packs are
above one kg. It is easy to see that it is essentially by chance that the pack
will weigh exactly 1 kg., and there are an infinite number of values that the
random variable ‘weight’ can take. In such cases, it makes sense to talk of the
probability that the weight will be between two values, rather than the
probability of the weight taking any specific value. These types of random
variables which can take an infinitely large number of values are called
continuous random variables, and the resulting distribution is called a
continuous probability distribution. The function that specifies the probability
distribution of a continuous random variable is called the probability density
function (p.d.f.).

Sometimes, for the sake of convenience, a discrete situation with a large


number of outcomes is approximated by a continuous distribution. For example,
if we find that the demand of a product is a random variable taking values of
1, 2, 3, …to 1,000, it may be worthwhile to treat it as a continuous variable.

In a nutshell, if the random variable is restricted to take only a limited number


of values, it is termed as discrete random variable and if it is allowed to take
any value within a given range it is termed as continuous random variable.

It should be clear, from the above discussion, that a probability distribution is


defined only in the context of a random variable or a function of random
variable. Thus in any situation, it is important to identify the relevant random
variable and to find the probability distribution to facilitate decision making.

Expected Value of a Random Variable


Expected value is the fundamental idea in the study of probability distributions.
For finding the expected value of a discrete random variable, we multiply each
value that the random variable can assume by its corresponding probability of
occurrence and then sum up all the products. For example to find out the
expected value of the discrete random variable (RV) of “Daily Visa Cleared”
given in the following table.

Table 14.4

Possible Nos. of the RV Probability Product

100 0.3 30
110 0.6 66
120 0.1 12

Hence the expected values of RV “Daily Visa Cleared” = 108

Now, we will examine situations involving discrete random variables and discuss
3 2 the methods for assessing them.
Probability
14.4 DISCRETE PROBABILITY DISTRIBUTION Distributions

In the previous sections, we have seen that a representation of all possible


values of a discrete random variable together with their probabilities of
occurrence. It is called a discrete probability distribution. There are two kinds
of distributions in the discrete probability distribution. i) Binomial Distribution,
and (ii) Poisson Distribution. Let us discuss these two distributions in detail.

14.4.1 Binomial Distribution

It is the basic and the most common probability distribution. It has been used to
describe a wide variety of processes in business. For example, a quality control
manager wants to know the probability of obtaining defective products in a
random sample of 10 products. If 10 per cent of the products are defective, he/
she can quickly obtain the answer, from tables of the binomial probability
distributions. It is also known as Bernoulli Distribution, as it was originated
by Swiss Mathematician James Bernoulli (1654-1705).

The binomial distribution describes discrete, not continuous, data resulting from
an experiment known as Bernoulli Process. Binomial distribution is a probability
distribution expressing the probability of one set of dichotomous alternatives, i.e.,
success or failure.

As per this distribution, the probability of getting 0, 1, 2, …n heads (or tails) in


n tosses of an unbiased coin will be given by the successive terms of the
expansion of (q + p)n, where p is the probability of success (heads) and q is
the probability of failure (i.e. = 1– p).

Binomial law of probability distribution is applicable only when:

a) A trial results in either success or failure of an event.

b) The probability of success ‘p’ remains constant in each trial.

c) The trials are mutually independent i.e., the outcome of any trial is neither
affected by others nor affects others.

Assumptions i) Each trial has only two possible outcomes either Yes or No,
success or failure, etc.

ii) Regardless of how many times the experiment is performed, the probability of
the outcome, each time, remains the same.

iii) The trials are statistically independent.

iv) The number of trials is known and is 1, 2, 3, 4, 5, etc.

Binomial Probability Formula:


P(r) = nCr pr qn–r

where, P (r) = Probability of r successes in n trials; p = Probability of success;


q = Probability of failure = 1–p; r = No. of successes desired; and n = No. of
trials undertaken. 3 3
Probability and The determining equation for nCr can easily be written as:
Hypothesis Testing
n!
n
Cr =
r! (n − r )!
n! can be simplified as follows:

n! = n (n–1)! = n (n–1) (n–2) ! = n (n–1) (n–2) (n–3) ! and so on.

Hence the following form of the equations, for carrying out computations of the
binomial probability is perhaps more convenient.

n!
P( r ) = prqn – r
r! (n − r ) !

The symbol ‘!’ means ‘factorial’, which is computed as follows: 5! means 5 ×


4 × 3 × 2 × 1 = 120. Mathematicians define 0! as 1.

If n is large in number, say, 50C3, then we can write (with the help of the
above explanation)

50 ! (50) (49) (48) (47) !


50
C3 = =
3 ! (50 − 3) ! 3 ! (47) !

50 × 49 × 48
=
3× 2 ×1

Similarly,

75 C 75 ! (75) (74) (73) (72) (71) (70) !


5 = =
5 ! (75 − 5) ! 5 ! (70) !

75 × 74 × 73 × 72 × 71
= , and so on.
5× 4 × 3× 2 ×1

Characteristics of a Binomial Distribution


i) The form of the distribution depends upon the parameters ‘p’ and ‘n’.
ii) The probability that there are ‘r’ successes in ‘n’ no. of trials is given by
n!
P (r) = n C r p r q n − r = p rq n−r
r ! (n − n ) !

iii) It is mainly applied when the population being sampled is infinite.


iv) It can also be applied to a finite population, if it is not very small or the units
sampled are replaced before the next trial is attempted. The point worth noting
is ‘p’ should remain unchanged.

Let us consider the following illustration to undertand the application of binomial


distribution.

Illustration 1

A fair coin is tossed six times. What is the probability of obtaining four or more
heads?

Solution: When a fair coin is tossed, the probabilities of head and tail in case
3 4 of an unbiased coin are equal, i.e.,
p = q = ½ or 0.5 Probability
Distributions

∴ The probabilities of obtaining 4 heads is : P (4) = 6C4 (1/2)4 (1/2)6-4

n!
P( r ) = prqn −r
r! (n − r ) !

6!
P (4) = (0.5) 4 (0.5) 2
4 ! (6 − 4) !

6 × 5× 4 × 3× 2 ×1
= (0.625) (0.25)
(4 × 3 × 2 × 1) (2 × 1)

720
= (0.625) (0.25) = 15 × 0.625 × 0.25
(24) (2)

= 0.234
The probability of obtaining 5 heads is :
P(5) = 6C5(1/2)5 (1/2)6-5

6!
P (5) = (0.5) 5 (0.5)1
5 ! (6 − 5) !

6 × 5 × 4 × 3 × 2 × 1
= (0.03125) (0.5)
5 × 4 × 3 × 2 × 1 (1 × 1)

= 6 × (0.03125) (0.5)
= 0.094
The probability of obtaining 6 heads is : P(6) = 6C6 (1/2)6 (1/2)6-6

6!
P (6) = (0.5) 2 (0.5) 0
6 ! (6 − 6) !

6 × 5 × 4 × 3 × 2 × 1
= (0.015625) (1)
6 × 5 × 4 × 3 × 2 × 1 (1)

= 1 × 0.015625 × 1
= 0.016
∴ The probability of obtaining 4 or more heads is :
0.234 + 0.094 + 0.016 = 0.344

Illustration 2

The incidence of a certain disease is such that on an average 20% of workers


suffer from it. If 10 workers are selected at random, find the probability that

i) Exactly 2 workers suffer from the disease


ii) Not more than 2 workers suffer from the disease
iii) At least 2 workers suffer from the disease
20 1
Solution: Probability that a worker suffering from the disease = =
100 5
1
i.e., p = , and 3 5
5
Probability and The probability of a worker not suffering from the disease i.e.,
Hypothesis Testing

 1 4
q = 1 −  =
 5 5

By binomial probability law, the probability that out of 10 workers, ‘x’ workers
suffer from a disease is given by:
P(r) = nCr pr qn–r

10 − r
10 C . 1 r. 4 ; r = 0, 1, 2, …10
r
5 5

i) The required probability that exactly 2 workers will suffer from the disease is
given by :

2 10 − 2
1 4
P(2) = 10C 2    
5 5

10 ! (10) (9) (8) !


= (0.2) 2 (0.8)8 = (0.04) (0.16777)
2 ! (10 − 2) ! (2 × 1) (8) !

= 45 (0.04) (0.16777) = 0.302

ii) The required probability that not more than 2 workers will suffer from the
disease is given by :

P (0) + P(1) + P(2)

0 10 − 0
1 4
P (0) = 10C 0     = 0.107
5 5

1 10 −1
1 4
P (1) = 10C1     = 0.269
5 5

2 10 − 2
1 4
P (2) = 10 C 2     = 0.302
5 5

Probability of not more than 2 workers suffering from the disease


= 0.107 + 0.269 + 0.302 = 0.678
iii) We have to find P (r ≥ 2)
i.e., P (r ≥ 2) = 1–P (0) – P (1)
= 1 – 0.107 – 0.269 = 0.624
Thus, the probability at least two workers suffering from the disease is 0.624.
Measures of Central Tendency and Dispersion for the Binomial
Distribution
As discussed in the Introduction, the binomial distribution has expected values of
mean (µ) and a standard deviation (σ). We now see the computation of both
these statistical measures.

3 6
We can represent the mean of the binomial distribution as : Probability
Distributions
Mean (µ) = np.
where, n = Number of trials; p = probability of success
And, we can calculate the standard deviation by :
σ = npq
where, n = Number of trials; p = probability of success; and q = probability of
failure = 1–p

Illustration 3
If the probability of defective bolts is 0.1, find the mean and standard deviation
for the distribution of defective bolts in a total of 50.

Solution: P = 0.1, n = 500


∴ Hence (µ) = np = 500 × 0.1 = 50
Thus, we can expect 50 bolts to be defective.

Standard Deviation (σ) = npq


n = 500, p = 0.1, q = 1 – p = 1 – 0.1 = 0.9

∴ σ = 500 × .1 × .9 = 6.71

Fitting a Binomial Distribution


When a binomial distribution is to be fitted to observed data, the following
procedure is adopted:

i) Determine the values of ‘p’ and ‘q’. If one of these values is known, the other
can be found out by the simple relationship p = 1–q and q = 1–p. If p and q are
equal, we can say, the distribution is symmetrical. On the other hand if ‘p’ and
‘q’ are not equal, the distribution is skewed. The distribution is positively
skewed, in case ‘p’ is less than 0.5, otherwise it is negatively skewed.
ii) Expand the binomial (p + q)n. The power ‘n’ is equal to one less than the
number of terms in the expanded binomial. For example, if 3 coins are tossed
(n = 3) there will be four terms, when 5 coins are tossed (n = 5) there will be 6
terms, and so on.
iii) Multiply each term of the expanded binomial by N (the total frequency), in
order to obtain the expected frequency in each category.
Let us consider an illustration for fitting a binomial distribution.

Illustration 4
Eight coins are tossed at a time 256 times. Number of heads observed at each
throw is recorded and the results are given below. Find the expected
frequencies. What are the theoretical values of mean and standard deviation?
Also calculate the mean and standard deviation of the observed frequencies.

No. of Heads f No. of heads f


at a throw at a throw
0 2 5 56
1 6 6 32
2 30 7 10
3 52 8 1
4 67 3 7
Probability and Solution: The chance of getting a head is a single throw of one coin is ½.
Hypothesis Testing Hence, p = ½, q = ½, n = 8, N = 256

By expanding 256 (½ + ½)8. We shall get the expected frequencies of 1, 2, …


8 heads (successes).

No. of Head (X) Expected Frequency = N × nCr pr qn–r


(Frequencies approximated)

0 256 × 8C0 (0.5)0 (0.5)8 = 1


1 256 × 8C1 (0.5)1 (0.5)7 = 8
2 256 × 8C2 (0.5)2 (0.5)6 = 28
3 256 × 8C3 (0.5)3 (0.5)5 = 56
4 256 × 8C4 (0.5)4 (0.5)4 = 70
5 256 × 8C5 (0.5)5 (0.5)3 = 56
6 256 × 8C6 (0.5)6 (0.5)2 = 28
7 256 × 8C7 (0.5)7 (0.5)1 = 8
8 256 × 8C8 (0.5)8 (0.5)0 = 1
Total = 256

If we compare the above expected frequencies with the observed frequencies,


given in the illustration, we find that the two frequencies are in close
agreement. This provides the basis to conclude that the observed distribution
will fits the expected distribution.
The mean of the above distribution is:
1
µ = np = 8 × =4
2
The Standard Deviation is (σ) = npq

1 1
= 8× × = 2 = 1.414
2 2

If we compute the mean and standard deviation of the observed frequencies,


we will obtain the following values

X = 4.062; S.D. = 1.462

Note: The procedure for computation of mean and standard deviation of the
observed frequencies has been already discussed in Units 8 and 9 of this
course. Check these values by computing on your own.

Remark: To determine binomial probabilities quickly we can use the Binomial


Tables given at the end of this block (Appendix Table 1).

Self Assessment Exercise A

1) State whether the following statements are true or false:

a) By the theory of probability P (H1) + P (H2) + …P (Hn) = 1


3 8
b) Frequency distribution is obtained by expectations on the basis of Probability
Distributions
theoretical considerations.
c) In a continuous probability distribution the variable under consideration can
take on any value within a given range.
d) Binomial distribution is a probability distribution expressing the probability of
one set of dichotomous alternatives.
e) Binomial distribution may not be applied, when the population being sampled
is infinite.
f) Random variable is a numerical quantity whose value is determined by
the outcome of a random experiment.
2) Determine the following by using binomial probability formula.
a) If n = 4 and P = 0.12, then what is P (0) ?
b) If n = 10 and P = 0.40, then what is p (9) ?
c) If n = 6 and P = 0.83, then what is P (5)?
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

3) The following data shows the result of the experiment of throwing 5 coins at a
time 3,100 times and the number of heads appearing in each throw. Find the
expected frequencies and comment on the results. Also calculate mean and
standard deviation of the theoretical values.

No. of heads: 0 1 2 3 4 5
frequency: 32 225 710 1,085 820 228
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

14.4.2 Poisson Distribution

Poisson distribution, developed by a French mathematician Simeon Poisson, is so


known by his name. It deals with counting the number of occurances of a
particular event in a specific time interval or region of space. It is used in 3 9
Probability and practice where there are infrequently occuring events with respect to time,
Hypothesis Testing volume (similar units), area, etc. For instance, the number of deaths or
accidents occuring in a specific time, the number of defects in production, the
number of workers absent per day etc.

The binomial distribution, as discussed above, is determined by two parameters


‘p’ and ‘n’. In a number of cases ‘p’ (the probability of success) may happen
to be very small (even less than 0.01) and the ‘n’ (the no. of trials) is large
enough (like more than 50) so that their product “np” remains a constant, the
situation is termed as “Poisson Distribution”, and it gives an approximation for
binomial probability distribution formula, i.e., P(r) = nCr pr qn–r
The Poisson distribution process corresponds to a Bernoulli process with a very
large number of trials (n) and a very low probability of success.

This would comparatively be simpler in dealing with and is given by the Poisson
distribution formula as follows:

mre–m
p (r ) = ,
r!
where, p (r) = Probability of successes desired

r = 0, 1, 2, 3, 4, … ∞ (any positive integer)

e = a constant with value: 2.7183 (the base of natural logarithms)

m = The mean of the Poisson Distribution, i.e., np or the average


number of occurrences of an event.

Characteristics of the Poisson Distribution


a) It is also a discrete probability distribution and it is the limiting form of the
binomial distribution.

b) The range of the random variable is 0 ≤ r < ∞

c) It consists of a single parameter “m” only. So, the entire distribution can be
obtained by knowing this value only.

d) It is a positively skewed distribution. The skewness, therefore, decreases when


‘m’ increases.

Measures of Central Tendency and Dispersion for Poisson Distribution

In poisson distribution, the mean (m) and the variance (s2) represent the same
value, i.e.,
Mean = variance = np = m

S.D. (σ) = Variance = np

Let us consider the following illustrations to understand the application of the


poisson distribution.
Illustration 5

2% of the electronic toys produced in a certain manufacturing process turnout


to be defective. What is the probability that a shipment of 200 toys will contain
4 0 exactly 5 defectives ? Also find the mean and standard deviation.
Solution: In the given illustration n = 200; Probability
Distributions
2
Probabiliity of a defective toy (P) = = 0.02
100

Since, n is large and p is small, the poisson distribution is applicable. Apply the
formula:
mre–m
p (r ) =
r!

The probability of 5 defective pieces in 200 toys is given by:

m 5e – m
p (5) = , where m = np = 200 × 0.02 = 4;
5!
e = 2.7183 (constant)
1
5 − 4 (1024)
∴ P (5) =
4 2.7183
= 2.71834
5 × 4 × 3 × 2 ×1 120

(1024) 0.0183
= = 0.156
120

Mean = np = 200 × 0.02 = 4; σ = np = 4 = 2

Illustration 6

Find the probability of exactly 4 defective tools in a sample of 30 tools chosen


at random by a certain tool producing firm by using i) Binomial distribution and
ii) Poisson distribution. The probability of defects in each tool is given to be
0.02.

Solution: i) When binomial distribution is used, the probability of 4 defectives in


30 tools is given by:

P (4) = 30C4 (0.02)4 (0.98)26

= 27405 × 0.00000016 × 0.59 = 0.00259

ii) When poisson distribution is used, the probability of 4 defectives in 30 tools is


given by:

m 4 e−m
P (4) = , where, m = np = 30 (0.02) = 0.6
4!

e = 2.7183 (constant)

0.6 2 2.7183 −0.6


∴ P ( 4) =
4 × 3 × 2 ×1

Re ciprocal [anti log (0.6 × log . 2.7183)] 0.1296


24

Re c. [anti log 0.2606] 0.1296 0.5485 × 0.1296


= =
24 24
= 0.00296
Remark: In general the Poisson distribution can be used as an approximation
to binomial with parameter m = np, is good if n ≥20 and p ≤ 0.05. 4 1
Probability and Fitting of a Poisson Distribution
Hypothesis Testing
To fit a poisson distribution to a given observed data (frequency distribution),
the procedure is as follows:

1) We must obtain the value of its mean i.e., m = np

2) The probabilities of various values of the random variables (r) are to be


r –m
computed by using p.m.f. i.e., p ( r ) = m e
r!
3) Each probability so obtained in step 2 is then multiplied by N (the total
frequency) to get expected frequencies.

Let us consider an illustration to understand for fitting poisson distribution.

Illustration 7

The number of defects per unit in a sample of 330 units of manufactured


product was found as follows:

No. of defects No. of units


0 214
1 92
2 20
3 3
4 1

Fit a poisson distribution to the above given data.

Solution: The mean of the given frequency distribution is:

(0 × 214) + (1× 92) + (2 × 20) + (3 × 3) + (4 ×1) 145


m= = = 0.439
214 + 92 + 20 + 3 + 1 330

0.439 r × e –0.439
We can write P( r ) = . Substituting r = 0, 1, 2, 3, and 4, we get
r!
the probabilities for various values of r, as shown below:

m r e −m 0.439 0 × 2.7183−0.439
(P0) = =
r! 0!

1 (0.6443)
= = 0.6443
1
N(P0) = (P0) × N = 0.6443 × 330 = 212.62

N(P1) = (P0) × m/1 = 212.62 × 0.439/1 = 93.34

N(P2) = (P1) × m/2 = 93.34 × 0.439/2 = 20.49

N(P3) = (P2) × m/3 = 20.49 × 0.439/3 = 3.0

N(P4) = (P3) × m/4 = 3 × 0.439/4 = 0.33

4 2
Thus, the expected frequencies as per poisson distribution are : Probability
Distributions
No. of defects (x) 0 1 2 3 4
Expected
frequencies (No. 212.62 93.34 20.49 3.0 0.33
of units) (f)

Note: We can use Appendix Table-2, given at the end of this block, to
determine poisson probabilities quickly.

Self Assessment Exercise B

1) What are the features of binomial and poisson distributions?


.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

2) Suppose on an average 2% of electric bulbs manufactured by a company


are defective. If they produce 100 bulbs in a day, what is the probability
that 4 bulbs will have defects on that day ?
.............................................................................................................
.............................................................................................................
.............................................................................................................

3) Four hundred car air-conditioners are inspected as they come off the
production line and the number of defects per set is recorded below. Find
the expected frequencies by assuming the poisson model.

No. of defects : 0 1 2 3 4 5

No. of ACs: 142 156 69 27 5 1


.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................
.............................................................................................................

4 3
Probability and
Hypothesis Testing 14.5 CONTINUOUS PROBABILITY DISTRIBUTION
In the previous sections, we have examined situations involving discrete random
variables and the resulting probability distributions. Let us now consider a
situation, where the variable of interest may take any value within a given
range. Suppose that we are planning to release water for hydropower
generation and irrigation. Depending on how much water we have in the
reservoir, viz., whether it is above or below the ‘normal’ level, we decide on
the quantity of water and time of its release. The variable indicating the
difference between the actual level and the normal level of water in the
reservoir, can take positive or negative values, integer or otherwise. Moreover,
this value is contingent upon the inflow to the reservoir, which in turn is
uncertain. This type of random variable which can take an infinite number of
values is called a continuous random variable, and the probablity distribution
of such a variable is called a continuous probability distribution.

Now we present one important probability density function (p.d.f), viz., the
normal distribution.

14.5.1 Normal Distribution

The normal distribution is the most versatile of all the continuous probability
distributions. It is useful in statistical inferences, in characterising uncertainities
in many real life situations, and in approximating other probability distributions.

As stated earlier, the normal distribution is suitable for dealing with variables
whose magnitudes are continuous. Many statistical data concerning business
problems are displayed in the form of normal distribution. Height, weight and
dimensions of a product are some of the continuous random variables which are
found to be normally distributed. This knowledge helps us in calculating the
probability of different events in varied situations, which in turn is useful for
decision-making.

To define a particular normal probability distribution, we need only two


parameters i.e., the mean (µ) and standard deviation (σ).

Now we turn to examine the characteristics of normal distribution with the help
of the figure 14.1, and explain the methods of calculating the probability of
different events using the distribution.

Mean
Median
Mode
Normal probability distribution
is symmetrical around a vertical
line erected at the mean
Left hand tail extends
indefinitely but never Right hand tail extends
reaches the horizontal indefinitely but never
axis reaches the horizontal
axis

Figure 14.1: Frequency Curve for the Normal Probability Distribution


4 4
14.5.2 Characteristics of Normal Distribution Probability
Distributions

1) The curve has a single peak, thus it is unimodal i.e., it has only one mode and
has a bellshape.

2) Because of the symmetry of the normal probability distribution (skewness = 0),


the median and the mode of the distribution are also at the centre. Thus, for a
normal curve, the mean, median and mode are the same value.

3) The two tails of the normal probability distribution extend indefinitely but never
touch the horizontal axis.

Areas Under the Normal Curve


The area under the normal curve (Fig. 14.1) gives us the proportion of the
cases falling between two numbers or the probability of getting a value between
two numbers.

Irrespective of the value of mean (µ) and standard deviation (σ), for a normal
distribution, the total area under the curve is 1.00. The area under the normal
curve is approximately distributed by its standard deviation as follows:

µ±1σ covers 68% area, i.e., 34.13% area will lie on either side of µ.

µ ± 2σ covers 95.5% area, i.e., 47.75% will lie on either side of µ.

µσ ± 3σ covers 99.7% area, i.e., 49.85% will lie on either side of µ.

Using the Standard Normal Table


The areas under the normal curve are shown in the Appendix Table-3 at the
end of this block. To use the standard normal table to find normal probability
values, we follow two steps. They are:

Step 1: Convert the normal distribution to a standard normal distribution.


The standard random variable Z, can be computed as follows:

X−µ
Z=
σ
Where,

X = Value of the random variable with which we are concerned.

µ = mean of the distribution of this random variable

σ = standard deviation of this distribution.

Z = Number of standard deviations from X to the mean of this


distribution.

Step 2: Look up the probability of z value from the Appendix Table-3, given at
the end of this block, of normal curve areas. This Table is set up to
provide the area under the curve to any specified value of Z. (The
area under the normal curve is equal to 1. The curve is also called
the standard probability curve).
4 5
Probability and Let us consider the following illustration to understand as to how the table
Hypothesis Testing should be consulted in order to find the area under the normal curve.

Illustration 8
(a) Find the area under the normal curve for Z = 1.54.

Solution: Consulting the Appendix Table-3 given at the end of this block, we
find the entry corresponding to Z = 1.54 the area is 0.4382 and this measures
the Shaded area between Z = 0 and Z = 1.54 as shown in the following figure.

0.4382

µ 1.54

(b) Find the area under normal curve for Z = –1.46

Solution: Since the curve is symmetrical, we can obtain the area between z =
–1.46 and Z = 0 by considering the area corresponding to Z = 1.46. Hence,
when we look at Z of 1.46 in Appendix Table-3 given at the end of this block,
we see the probability value of 0.4279. This value is also the probability value
of Z = –1.46 which must be shaded on the left of the µ as shown in the
following figure.
0.4279

-1.46 µ

(c) Find the area to the right of Z = 0.25

Solution: If we look up Z = 0.25 in Appendix table, we find the probable area


of 0.0987. Subtract 0.0987 (for Z = 0.25) from 0.5 getting 0.4013 (.5–.0987 =
0.4013).

4 6
0.987 Probability
Distributions

0.4013

µ 0.25
d) Find the area to the left of Z = 1.83.
Solution: If we are interested in finding the area to the left of Z (positive
value), we add 0.5000 to the table value given for Z. Here, the table value for
Z (1.83) = 0.4664. Therefore, the total area to the left of Z = 0.9664 (0.5000 +
0.4664) i.e., equal to the shaded area as shown below:

5.000

0.4664

µ 1.83

Now let us take up some illustrations to understand the application of normal


probability distribution.
Illustration 9
Assume the mean height of soliders to be 68.22 inches with a variance of 10.8
inches. How many soldiers in a regiment of 1,000 would you expect to be over
six feet tall ?

X−µ
Solution: Z=
σ
X = 72 inches; µ = 68.22 inches; and σ = 10.8 = 3.286

72 − 68 .22
∴Z = = 1.15
3.286

for Z = 1.15 the area is .3749 (Appendix Table-3).

4 7
Probability and
Hypothesis Testing
0.3749

0.1251

68.22
µ 72

Area to the right of the ordinate at 1.16 from the normal table is (0.5–0.3749)
= 0.1251. Hence, the probability of getting soldiers above six feet is 0.1251 and
out of 1,000 soliders, the expectation is 1,000 × 0.1251 = 125.1 or 125. Thus,
the expected number of soldiers over six feet tall is 125.

Illustration 10
(a) 15,000 students appeared for an examination. The mean marks were 49 and the
standard deviation of marks was 6. Assuming the marks to be normally
distributed, what proportion of students scored more than 55 marks?

X−µ
Solution: Z =
σ
X = 55; µ = 49; σ = 6

55 − 49
∴Z = =1
6
For Z = 1, the area is 0.3413 (as per Appendix Table-3).

∴ The proportion of students scoring more than 55 marks is

0.5–0.3413 = 0.1587 or 15.87%

(b) If in the same examination, Grade ‘A’ is to be given to students scoring more
than 70 marks, what proportion of students will receive grade ‘A’?

X−µ
Solution: Z =
σ
X = 70; µ = 49; σ = 6

70 − 49
∴Z = = 3.5
6

The table gives the area under the standard normal curve corresponding to
Z = 3.5 is 0.4998

Therefore, 0.02% (0.5–0.4998 = 0.0002 × 100) would score more than 70


marks. Since, there are 15,000 candidates, 3 candidates (15,000 × 0.02% = 3)
will receive Grade ‘A’.

4 8
Illustration 11 Probability
Distributions
In a training programme (self-administered) to develop marketing skills of marketing
personnel of a company, the participants indicate that the mean time on the
programme is 500 hours and that this normally distributed random variable has a
standard deviation of 100 hours. Find out the probability that a participant selected
at random will take:
i) fewer than 570 hours to complete the programme, and
ii) between 430 and 580 hours to complete the programme.
Solution: (i) To get the Z value for the probability that a candidate selected at
random will take fewer than 570 hours, we have

x −µ 570 − 500
Z = =
σ 100

70
= = 0.7
100

Consulting the Appendix Table-3 for a Z value of 0.7, we find a probability of


0.2580 (this probability will lie between the mean, 500 hours and 570 hours. As
explained in illustration 8(d), we must add 0.5 to this probability (0.2580) that the
random variable will be between the left-hand tail and the mean.
Therefore, we obtain the probability that the random variable will lie between the
left-hand tail and 570 hours is 0.7580 (0.5 + 0.2580).
This situation is shown below:

0.2580
P( less than 570) Z= 0.7
= 0.7580

(µ) 570

Thus, the probability of a participant taking less than 570 hours to complete the
programme, is marginally higher than 75 per cent.

ii) In order to get the probability, of a participant chosen at random, that he will take
between 430 and 580 hours to complete the programme, we must, first, compute
the Z value for 430 and 580 hours.

x −µ
Z=
σ

430 − 500 −70


Z for 430 = = = – 0. 7
100 100

580 − 500 80
Z for 580 = = = 0 .8
100 100 4 9
Probability and The table shows the probability values of Z values of –0.7 and 0.8 are 0.2580
Hypothesis Testing and 0.2881 respectively. This situation is shown in the following figure.

P(430 to 580) = 0.5461

Z= –0.8
Z= –0.7

430 (µ) 580


500

Thus, the probability that the random variables lie between 430 and 580 hours is
0.5461 (0.2580 + 0.2881).

14.5.3 Importance and Application of Normal Distribution


This distribution was initially discovered for studying the random errors in
measurements, which are encountered during the calculations of orbits of
heavenly bodies. It happens because of the fact that the normal distribution
follows the basic principle of errors. It is mainly for this quality that the
distribution has a wide range of applications in the theory of statistics. To count
a few :

– Industrial quality control


– Testing of significance
– Sampling distribution of various statistics
– Graduation of non-normal curve
– length of the leaves observed at particular times of the year.
The main purpose for using a normal distribution are:

(i) To fill a distribution of measurement for same sample data,

(ii) To approximate the distributions like Binomial, Poisson etc.

(iii) To fit sampling distribution of various statistics like mean or variance etc.

Self Assessment Exercise C

1) Given a standardized normal distribution area between the mean and


positive value of Z as in Appendix Table 2) what is the probability
that:
a) Z is less than +1.08?
b) Z is greater than – 0.21?
c) Z is between the mean and +1.08?
d) Z is less than – 0.27 the mean and greater than +1.06?
e) Z is between – 0.21 and the mean?
5 0
2) Give a normal distribution with µ = 100 and σ = 10, what is the Probability
Distributions
probability that:
a) X > 75?
b) X < 70?
c) X > 112?
d) 75 < X > 85?
e) X < 80 or X > 110?

14.6 LET US SUM UP


In this unit, we have discussed the meaning of frequency distrbution and
probability distribution, and the concepts of random variables and probability
distribution. In any uncertain situation, we are often interested in the behaviour
of certain quantities that take different values in different outcomes of
experiments. These quantities are called random variables and a representation
that specifies the possible values a random variable can take, together with the
associated probabilities, is called a probability distribution. The distribution of a
discrete variable is called a discrete probability distribution and the function that
specifies a discrete distribution is termed as a probability mass function (p.m.f.).
In the discrete distribution we have considered the binomial and poisson
distributions and discussed how these distributions are helpful in decision-making.
We have shown the fitting of such distributions to a given observed data.

In the final section, we have examined situations involving continuous random


variables and the resulting probability distributions. The random variable which
can take an infinite number of values is called a continuous random variable
and the probability distribution of such a variable is called a continuous
probability distribution. The function that specifies such distribution is called the
probability density function (p.d.f.). One such important distribution, viz., the
normal distribution has been presented and we have seen how probability
calculations can be done for this distribution.

14.7 KEY WORDS


Binomial Distribution: It is a type of discrete probability distribution function
that includes an event that has only two outcomes (success or failure) and all
the trials are mutually independent.

Continuous Probability Distribution: In this distribution the variable under


consideration can take any value within a given range.

Continuous Random Variable: If the random variable is allowed to take any


value within a given range, it is termed as continuous random variable.

Discrete Probability Distribution: A probability distribution in which the


variable is allowed to take on only a limited number of values.

Discrete Random Variable: A random variable that is allowed to take only a


limited number of values.

Normal Distribution: It is a type of continuous probability distribution with a


single peaked, bell-shaped curve. The curve is symmetrical around a vertical
line erected at the mean. It is also known as Gaussian distribution. 5 1
Probability and Poisson Distribution: It is the limiting form of the binomial distribution.
Hypothesis Testing Hence, the probability of success is very low and the total number of trials is
very high.

Probability: Any numerical value between 0 and 1 both inclusive, telling about
the likelihood of occurrence of an event.

Probability Distribution: A curve that shows all the values that the random
variable can take and the likelihood that each will occur.

Random Variable: It is a variable, that can take different values as a result of


the outcomed of a random experiment.

14.8 ANSWERS TO SELF ASSESSMENT


EXERCISES
A) 1. a) True; b) False; c) True
d) True; e) False; f) True
2. a) 0.5997; b) 0.0016; c) 0.4018
3. Expected frequencies approximated
97 484 969 969 484 97
B) 2. 0.11
3. Expected frequencies (No. of ACs) approximated
147, 147, 74, 25, 6, 1.
C) 1. a) 0.8599; b) 0.5832; c) 0.3599
d) 0.4618; e) 0.0832
2. a) 0.9938; b) 0.00135; c) 0.1151
d) 0.0606; e) 0.1815.

14.9 TERMINAL QUESTIONS/EXERCISES


1) Distinguish between frequency distribution and probability distribution.

2) Explain the concept of random variable and probability distribution.

3) Define a binomial probability distribution. State the conditions under which the
binomial probability model is appropriate by illustrations.

4) Explain the characteristics of a poisson distribution. Give two examples, the


distribution of which will conform to the poisson form.

5) What do you mean by continuous probability distribution? How does it differ


from binomial distribution?

6) Explain the procedure involved in fitting binomial and poisson distributions.

7) If the average number of defects items in the manufacturing of certain items is


10%, what is the probability of a) 0, b) 2, c) at most 2 items, d) at least two
items are found to be defective in a sample of 12 items taken at random

5 2 Ans: a) 0.2824, b) 0.2301, c) 0.8891, d) 0.3410.


Statistics:
Bayes’ Theorem
Bayes’ Theorem (or Bayes’ Rule) is a very famous theorem in statistics. It was originally stated by the
Reverend Thomas Bayes.

If we have two events A and B, and we are given the conditional probability of A given B, denoted
P(A|B), we can use Bayes’ Theorem to find P(B|A), the conditional probability of B given A.

P(A|B)P(B)
Bayes’ Theorem: P(B|A) =
P(A|B)P(B) + P(A|B0 )P(B0 )

where P(B0 ) is the probability of B not occurring.

Example:
Q: In a factory there are two machines manufacturing bolts. The first machine manufactures 75%
of the bolts and the second machine manufactures the remaining 25%. From the first machine 5%
of the bolts are defective and from the second machine 8% of the bolts are defective. A bolt is se-
lected at random, what is the probability the bolt came from the first machine, given that it is defective?

A:
Let A be the event that a bolt is defective and let B be the event that a bolt came from Machine 1.
Check that you can see where these probabilites come from!
P(B) = 0.75 P(B0 ) = 0.25 P(A|B) = 0.05 P(A|B0 ) = 0.08
Now, use Bayes’ Theorem to find the required probability:

P(A|B)P(B)
P(B|A) =
P(A|B)P(B) + P(A|B0 )P(B0 )

0.05 × 0.75
=
0.05 × 0.75 + 0.08 × 0.25

= 0.3846

Try this:
Exercise: Among a group of male pensioners, 10% are smokers and 90% are nonsmokers. The proba-
bility of a smoker dying in the next year is 0.05 while the probability for a nonsmoker is 0.005. Given
one of these pensioners dies in the next year, what is the probability that he is a smoker?

You might also like