0% found this document useful (0 votes)
4 views

randomnumbers-5

The document discusses advanced topics in data analysis, focusing on the generation and application of random numbers in statistical inference. It covers methods for generating both discrete and continuous random variables, including the use of pseudo-random number generators and the inverse transformation method. Additionally, it explores the use of simulations to test complex models and the properties of various distributions, such as exponential and normal distributions.

Uploaded by

Jacob Thomas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

randomnumbers-5

The document discusses advanced topics in data analysis, focusing on the generation and application of random numbers in statistical inference. It covers methods for generating both discrete and continuous random variables, including the use of pseudo-random number generators and the inverse transformation method. Additionally, it explores the use of simulations to test complex models and the properties of various distributions, such as exponential and normal distributions.

Uploaded by

Jacob Thomas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 42

Advanced Topics in Data Analysis

2019

Shai Carmi
Outline

• Introduction
• Basic method
• Discrete random variables
• Continuous random variables
Why do we need the computer to generate random numbers?

• Statistical inference: We are often interested in hypothesis testing or estimation for


problems without known “recipe”

• Using randomness allows us to perform statistical inference even under those


conditions – more later
Testing methods

• Create realistic new datasets for testing new statistical methods

• Example:
o A method estimates gene expression from sequencing reads
o We would like to test the method, but for any real data we don’t know the true levels
o We generate many synthetic datasets, for which we know the true levels, by randomly drawing
an expression levels, reads, and errors
Simulations to test complex models

• Predict the behavior of complex stochastic models

• Example:
o An RNA molecule is traveling in the nucleus
o How long will it take it to exist? Nucleus
o Sometimes the theory is too complicated
Pore
o Instead, we simulate a random walk
RNA
Generating random numbers

• Traditional methods:
o Flip a coin
o Roll a die
o Pick a card from a deck
o Pick a ball from an urn
o Roulette

• But computers are deterministic…


• For a given input, the output is always the same (i.e., non-random!)
Pseudo-random numbers

• We can generate a sequence of apparently random numbers

• Different sequences can be generated with different seeds


Outline

• Introduction
• Basic method
• Discrete random variables
• Continuous random variables
Linear Congruential Generator

• Provide a seed:

• Generate a sequence of numbers:

• mod ( “modulo”) is the remainder after division by


Linear Congruential Generator

2 1 2 5
3 5 10 13
• Example:
4 3 6 9
o The sequence: 1,5,3,9,1,5,3,9,1,5,… 5 9 18 21
o Highly repetitive! 6 1 2 5
7 5 10 13
• The method generates integers in the 8 3 6 9
range 9 9 18 21
• We can’t draw more than numbers 10 1 2 5
without repeating ourselves
Large

• To avoid repetitions, we can use very large and


• ,,,
• The obtained sequence is
1, 1234568, 5551574, 3874565, 2920147, 819239, 1729179, 4857915, 797392, 3971134, 116874, 4832709,
4770776, 3345234, 3645166, 4361914, 10825, 7397631, 3194813, 324240, 5633065, 1578738, 7462933,
7580275, 807622, 3961894, 5310405, 2364000, 333911, 4389762, 4677030, 960772, 860602, 5148609,
1522484, 5785348, 7367118, 7419904, 6384572, 7072797, 6698088, 7530362, 4803322, 6031887, 2206166,
7390051, 6411536, 7295356, 3534104, 2733833, 3703572, 6778296, 5729521, 4701493, 5876948, 1111543,
7638002, 6928321, 2384138, 809549, 2478672, 3934040, 891119, 5861786, 5075855, 5291901, 6084417, …
x = numeric(100)
x[1] = 1
for (i in 2:100)
Code in R {
x[i] = (1234567 * x[i-1] + 1) %%
7654321
cat(sprintf("%d, ",x[i]))
}
How to generate random numbers in [0,1]?
• We can simply divide the resulting numbers by m
• Example: , , ,

n=200000
x = numeric(n)
y = 1
x[1]=y/54321
for (i in 2:n)
{
y = (12345*y)%%54321
x[i] = y/54321
}
hist(x,breaks=25)
Practice

• Practically used generator: Mersenne-Twister


• The seed can be either fixed for reproducibility
• Or equal the time to generate a new sequence in each run

• All common software packages have a method for a random number uniformly
distributed between [0,1]
o This means that will take any value between [0,1] with equal probability
o In R: runif

• Using draws of , we can generate numbers with any distribution


Real story

• Mersenne-Twister in the US lottery draw


• https://www.nytimes.com/interactive/2018/05/03/magazine/money-issue-iowa-
lottery-fraud-mystery.html
Are those numbers random?

• Subsequent numbers should be independent of each other:

• The probability of each number (or a combination of numbers) should be the same
as any other number (or combination)

set.seed(1)
n = 10000
x = runif(n)
y = runif(n)
plot(x,y,pch=".”)
Is it possible to generate true random numbers?

• Physical phenomena can be exploited to create truly random numbers

o Thermal noise (random voltage on a resistor)


o Nuclear decay (number of emitted particles)
o Human behavior (moving a mouse)

• Such methods are typically not used in bioinformatics


Outline

• Introduction
• Basic method
• Discrete random variables
• Continuous random variables
Generating integers using

• In practice, we often only have a method to generate


• What if we need random integer between and ?

• Draw a random number uniformly distributed in (0,1]


• Calculate
• The desired number is the ceiling of

• Ceiling of : The smallest integer that’s larger than or equal to (rounding upwards)
• ceil[1.1]=2, ceil[4]=4, ceil[0.8]=1, ceil[5.23]=6
Generating integers within a range

• Draw a random number , uniformly distributed in (0,1]


• The desired number is the ceiling of

• Example:

o  
o  
o  
o  
o  
Random variables: reminder

• A random variable can take different numbers with certain probabilities based on
the results of an experiment

• Example: roll a die. can be:


o The number drawn
o 1 if the number is even, 0 if odd
o The square of the number drawn
o The sum of the numbers drawn in two rolls
Random variables: reminder

• Discrete random variables can take only specific values:

• The distribution of is the list of possible values along with their probabilities

• To draw discrete random variables with an arbitrary distribution, we use the


inverse transformation method
The inverse transformation method: example

• Suppose we are given the distribution

• To draw numbers:
• First, draw a number uniformly distributed between [0,1]
• Then, set:
The inverse transformation method: why?

• Suppose is a random variable with equal probability to take any value in [0,1]

• The probability of to take value in an interval is equal to the length of the


interval

Possible values of

0 𝑎 𝑏 1

Prob ⁡( 𝑎<𝑈 < 𝑏 )=(𝑏 − 𝑎)


The inverse transformation method: why?

• Back to the example:

• Why does have the desired distribution?

Possible values of

1 3 6
0 0.4 0.9 1

0.4 0.5 0.1


Probabilities
Values of
The inverse transformation method: formal definition

• Denote the values as and the probabilities as

• To generate a sample from :

• Draw a number uniformly distributed between [0,1]


• Set if

• (By convention, for , we define )


The inverse transformation method: why?

• Draw a number uniformly distributed between [0,1]


• Set if

Event Probability Possible values of U

𝒙 𝟏 𝒙 𝟐 𝒙 𝟑𝒙 𝟒
0 𝑝 1 𝑝 1 +𝑝 2 𝑝 1 +𝑝 2 +𝑝 3 1

𝑝1 𝑝2 𝑝3 𝑝4
Example

• To generate a sample from :


1 1 0.4
o Draw a number uniformly distributed between [0,1] 2 3 0.5
o Set if 3 6 0.1

Value of Condition on
1 1
2 3
3 6
Example: tossing coins

• Suppose we would like to draw a random number , equal to the number of “heads”
out of three coin tosses
• is binomial with and

Event Probability
TTT 0 1/8
TTH 1 1/8
THT 1 1/8
THH 2 1/8
HTT 1 1/8
HTH 2 1/8
HHT 2 1/8
HHH 3 1/8
The inverse transformation method

• To generate a sample from :

o Draw a number uniformly distributed between [0,1]


o Set if

Possible values of U

1 0 1/8 0 1 2 3
2 1 3/8 0 1/8 1/2 7/8 1
3 2 3/8
1/8 3/8 3/8 1/8
4 3 1/8
The inverse transformation method

• To generate a sample from :


1 0 1/8
2 1 3/8
o Draw a number uniformly distributed between [0,1]
o Set if 3 2 3/8
4 3 1/8

Value of Condition on Probability


1 0 1/8
2 1 3/8
3 2 3/8
4 3 1/8
“Trick”

• For some special cases, we can draw in a simpler way

• Consider the example of three coin tosses


• To toss a coin, draw a uniformly distributed random number
• If , set the coin as “T”, otherwise set the coin as “H”
• Repeat three times
• Set as the number of coins that came out as “H”

• Limited to specific cases where the problem has some structure, symmetry, etc.
Continuous random variables

• Can take any value within a range

• Examples:
o Standard uniform variables can have any value between [0,1]
o Normal variables can take any real value

• Continuous random variables have probability


density function (PDF)
Examples

Normal distribution with mean and variance


Standard uniform variable
𝑓 ( 𝑥)
1

𝑥
0 1
The cumulative distribution function
Standard uniform variable
𝑓 ( 𝑥) 𝐹 (𝑥)
1 1
= 𝑥
𝑥 𝑦 𝑥
0 1 0 1

Normal distribution with mean and variance


The inverse transformation method for continuous variables

• To draw a continuous random variable with density :

• Draw a standard uniform random number


• Set to satisfy

• Intuition: 𝐹 (𝑥)
o The random number is a position on the -axis
o We find for which 𝑈 𝑓 ( 𝑥)
o The larger in a region, the more values of
are covered by

𝑥
Example: non-standard uniform variable

• Consider a variable that is uniform in the range [0,4]

f(x)
• for
1/4
x
0 4

• To draw a variable , we draw a standard uniform F(x)


1
•  
• An obvious solution, after all x
0 4
Drawing normal variables

• The Box-Muller method:


o Draw two standard uniform variables, and
o Set and
o Set and
o and are standard normal variables (mean 0, variance 1)

• How to generate a normal variable with mean and variance ?


o Generate standard normal variable
o Set
Example: normal variable

• Generate random normal variables with


set.seed(1)
n = 10000
mu = 7
sigma = 5
Z = numeric(2*n)
for (i in 1:n)
{
U1 = runif(1)
U2 = runif(1)
R = sqrt(-2*log(U1))
theta = 2*pi*U2
Z[2*i-1] = R*cos(theta)
Z[2*i] = R*sin(theta)
}
X = Z*sigma+mu
h = hist(X,breaks=50,prob=T)
curve(dnorm(x,mu,sigma),add=T)
An exponential random variable

• An exponential variable has PDF


• It represents the time to an event that happens at rate per unit time, independently
of past events

• Examples:
o The number of generations until a mutation happens
o The time until a protein binds to DNA
o The time until a phone call in a call center
o The time until a lightbulb stops working
o The time to cancer, heart attack, injury
 At least as a null model
Properties of exponential random variables

• The mean of (“mean waiting time”) is

• is memoryless:
o Interpretation: the waiting time is the same regardless of whether we start counting from time
zero or from any other time
How to draw an exponential random variable

• Can also use


set.seed(1)
U = runif(10000)
• Example: X = -log(U)
h =
hist(X,breaks=50,prob=T)
curve(dexp,xlim=c(0,5),add
=T)
Drawing from complex/non-standard distributions

• Rejection sampling
• Markov chain Monte Carlo (MCMC): Metropolis-Hastings algorithm
o For drawing points in high dimension

• Basic idea:
o We need to draw a number from , but there is no direct method
o We draw instead numbers from , from which sampling is easy
 can be uniform, normal, etc.
o We then accept or reject the new sample depending on the relationship between and
o MCMC generates a sequence of correlated samples, which we need to “thin” to obtain
independent draws

You might also like