Sunday, April 4, 2021

Beta distribution - learning statistics with Python

Why? 

So I'm reading this nice book on Bayesian statistics:


I'm in the chapter on the Beta distribution, and things start to get new and difficult for me:


So I follow the advice of Nassim Taleb and start playing around with some Python code. I'm not a rich guy like Taleb, so I don't have Mathematica. And this is too complex to do for free in Wolfram Alpha.

I copy and paste a crappy Python program and things become wonderfully intuitive.

Sampling and the beta function

I make an urn with 1000 good balls and 9000 bad balls.
I draw 5 samples from the urn.
I make a histogram and the corresponding probability distribution.
Then I plot the beta function for each draw. This shows the distributions that were most likely to produce this draw.
With increasing sample size both the histograms and the beta functions converge to 0.1. (Duh!)

Sample size 10:
Sample size 100:
Sample size 400:

Is 400 enough?

Many years ago I read this excellent book:


It says that a sample of 400 is sufficient in most situations. For the hypergeometric distribution that we draw from this seems to be the case. Going to 1000 samples does not seem to add that much certainty anymore.
The crappy Python code

import numpy
import matplotlib.pyplot
from matplotlib.pyplot import hist
import seaborn
from scipy.stats import beta

ngood, nbad, sample, runs = 1000, 9000, 1000, 5
samples = numpy.random.hypergeometric(ngood, nbad, sample, runs)
print(samples)

seaborn.distplot(samples)
matplotlib.pyplot.show()

# Generate the value between 
x = numpy.linspace(0,1, num=400)

# Plot the beta distribution

matplotlib.pyplot.figure(figsize=(7,7))
matplotlib.pyplot.xlim(0, 1)
for s in samples:
    matplotlib.pyplot.plot(x, beta.pdf(x, s, sample-s), 'r-')

matplotlib.pyplot.title("Beta: " + str(sample) +
                        " samples from " + str(ngood) +
                        " good " + str(nbad) + " bad" +
                        "\n" + str(samples))
matplotlib.pyplot.xlabel('Values of Beta Dist.')
matplotlib.pyplot.ylabel('Probability')
matplotlib.pyplot.show()

No comments:

Post a Comment