Proba / statistiques - TP0

The goal of this lab exercise is to learn:

  • basics of Python
  • how to use the numpy library
  • code a few things related to probabilities :)

This lab exercise is not evaluated, but do it seriously it will help you for the nextons.

Python and numpy

You need to install the following libraries: numpy and matplotlib. Please refer to the internet for instruction if you don't know how to do that.

Numpy is one of the most popular numerical computation library in Python. For this lab exercise we are mainly interested in tensor computation. To start, some reading on python and numpy: https://cs231n.github.io/python-numpy-tutorial/

Take time to do a few test, understand the different operation, the different between in-place and out-of-place operations, etc. The most important resource you must use is the numpy documentation. As we usually say in computer science: Read The F*cking Manual https://numpy.org/doc/stable/reference/index.html

In [ ]:
# import libraries
import numpy as np
import sklearn.datasets

import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# create a 2D tensor (i.e. a matrix) of shape (2, 5) full of zeros
# by default the tensor will contain elements of type float
t = np.zeros((2, 5))
print("Shape of the tensor: ", t.shape)
print("Content:")
print(t)
In [ ]:
# A vector is a one dimension tensor.
# its usual to think about them as "column vector",
# i.e. a vector of 5 elements is similar in practice to a matrix of shape (5, 1)

t = np.zeros((5, ))
print("Shape of the tensor: ", t.shape)
# yes, its displayed as "row", but mentally you need to think about it as a column :)
print("Content:")
print(t)
In [ ]:
# You can reshape tensors
# When you reshape a tensor, it does not change the data order in the underlying memory.
# By default, this is the "C array order", also called the row-major format.
# If you don't know about this, check the wikipedia page:
# https://en.wikipedia.org/wiki/Row-_and_column-major_order

# for example, the following operation will reshape t as a vector with ten elements
t = np.zeros((2, 5))
t = t.reshape(-1) # -1 means "put every there"
print(t.shape)

# here instead of having a vector we build a tensor with a single row with ten elements
t = t.reshape(1, -1)
# of cours we could have done t = t.reshape(1, 10)
print(t.shape)

# here a tensor with a single column with ten elements
t = t.reshape(-1, 1)
# of cours we could have done t = t.reshape(10, 1)
print(t.shape)

# reshape into the original shape
t = t.reshape(2, -1)
print(t.shape)
In [ ]:
# this creates a vector with values from 0 to 3 (not included)
t = np.arange(4)
print(t)

# reshape
t = t.reshape((2, 2))
print(t)

# .T returns the transposed tensor
print(t.T)
In [ ]:
# multiply the content of the tensor by two
# this is an out-of-place operation: it does not modify the tensory memory but creates a new one
t2 = 2 * t

print("Original tensor:")
print(t)
print("New tensor:")
print(t2)
In [ ]:
# do the same thing but in-place
t *= 2
print("New content:")
print(t)
In [ ]:
# There are two multiplication operators:
# * is the element wise multiplication operator (also called the Hadamard product)
# @ is the matrix multiplication operator

a = np.arange(4).reshape(2, 2)
# one_like create a tensor with the same properties (i.e. type and shape) than the argument
# but filled with ones
b = 2 * np.ones_like(a) 

print("a:")
print(a)
print("b:")
print(b)
print()

# element wise multiplication
c = a * b
print("a * b:")
print(c)
print()

# matrix multiplication
c = a @ b
print("a @ b:")
print(c)
In [ ]:
# you can easily retrieve one row or one column of a tensor
print("a:")
print(a)
print()

print("first row of a:")
print(a[0])
print()

print("first column of a:")
print(a[:, 0])
In [ ]:
# the same approach can be used to update the data in-place
print("a:")
print(a)
print()

# set the second colums elements to 10
a[:, 1] = 10.
print("after update:")
print(a)

One of the most important feature you have to understand is broadcasting. You can read the following article to understand operation broadcasting: https://numpy.org/devdocs/user/theory.broadcasting.html

It is a very important concept that is really helpful in numpy and other numerical computation library. The documentation often explain of broadcasting is implemented for a given operation, check for example the matrix multiplication page: https://numpy.org/doc/stable/reference/generated/numpy.matmul.html

In [ ]:
a = np.arange(6).reshape(2, -1)
print("a: ")
print(a)
print()

# we will multpliy the first row by 2 and the second row by 4 by using operation broadcasting
# np.array can be used to create a tensor from python data
b = np.array([2, 4]).reshape((2, 1))
c = a * b

print("new tensor:")
print(c)

Of course, you can also retrieve a specific index or write a specif value into a tensor.

In [ ]:
t = np.zeros((2, 2))
print(t)
t[0, 1] = 10
print(t)
t *= 2
print(t)
print(t[0, 1])

Empirical distribution, mean and variance

To sample from a Gaussian you can use normal(loc=0.0, scale=1.0, size=None). Note that the parameters are the mean and the standard deviation, not the variance (again, read the doc!).

You can compute the empirical mean and varince of the samples using np.var and np.mean.

In [ ]:
normal_samples = np.random.normal(size=(100))

print(np.mean(normal_samples), np.var(normal_samples))
print(normal_samples.mean(), normal_samples.var())
In [ ]:
# TODO: compute the mean and variance without using mean and var
# instead, use sum(), addition and division

todo todo todo

Compute the empirical mean and variance using different number of samples, for example 2, 5, 10, 20, 50, 100, 500, 1000, etc (use a loop! with range() or with a list of predefined values). What do you observe? Try when sampling from Gaussian with larger variance.

In [ ]:
todo todo todo

PDF of a Gaussian

The goal of this section is to visualize the probability density function of a Gaussian distribution.

In [ ]:
def gaussian_pdf(x, mean = 0, variance = 1):
    # ** is the power operator
    return np.exp( -( (x - mean)**2 ) / (2 * variance) ) / np.sqrt(variance * 2 * np.pi)
In [ ]:
# display the PDF of a gaussian,
# try to play a little bit with the  
mean = 0
variance = 1

# the gaussian_pdf function works with numpy vectors!
# x is a vector with values from -4 to 4, taking steps of 0.1
x = np.arange(-4, 4, 0.1)
y = gaussian_pdf(x, mean, variance)

# print the first five values in vectors x and y
print("x: ", x[:5])
print("y: ", y[:5])

plt.plot(x, y)
In [ ]:
# as a second example, we can plot the log of the PDF and see that it has a nice quadratic form!
plt.plot(x, np.log(y))
plt.show()

We are now going to approximate the Gaussian PDF from samples using bins:

  1. We break the x axis into bins of same size
  2. We count the number of samples in each bins
  3. We renormalize the count
  4. We using the bin counts to approximate the Gaussian function
In [ ]:
"""
inputs:
 - values: samples from a distribution
 - lower_bound: left limit on the x axis
 - upper_bound: right limit on the x axis
 - number of bins to create
outputs:
 - list: number of elements in each bins
 - list: list of tuples that indicates the left and right boundary of each bin
 - float: normalization constant (used to make the area represented by the bins sum to one)
"""
def count_elements(values, lower_bound, upper_bound, n_bins):
    if upper_bound <= lower_bound or min(values) < lower_bound and max(values) > upper_bound:
        raise RuntimeError("Invalid bounds")

    # given the upper bound, lower bound and n_bins,
    # what is the size of each bin?
    # put this value in delta
    delta = TODO TODO TODO
    
    # vector that will contain the number of value in each bin
    hist = np.zeros((n_bins))
    # bin_span is a list containing the span in each bin.
    # for example, if you have 10 spans ranging in x axis from 1 to 10,
    # the bin_spans would be a list [(1, 2), (2, 3), etc etc, (9, 10)]
    bin_spans = TODO TODO TODO
    
    # now you must fill bin_spans, ie count the number of samples in values
    # in each bin
    # smart solution
    #for v in values:
    #    idx = int(np.floor((v-lower_bound)/delta))
    #    hist[idx] += 1
    TODO TODO TODO
        
    # normalization constant
    norm = sum(hist) * delta
    return hist, bin_spans, norm

# test the function with uniform samples in [0, 1]
uniform_samples = np.random.random((10000))
h1, h1_spans, norm = count_elements(uniform_samples,0,1,10)

# raw count
print(h1)
# normalized count, every one must be close to one...
print(h1 / norm)
In [ ]:
# test the function with uniform samples in [0, 0.1]
# what do you observe compared to the example in the cell above? why?
uniform_samples = np.random.random((10000)) / 10
h2, h2_spans, norm = count_elements(uniform_samples,0,0.1,10)

print(h2)
print(h2 / norm)
In [ ]:
# we now sample from a Gaussian and print a plot with the "bins"

# samples from a Gaussian
normal_samples = np.random.normal(size=(10000))
# create the bins
h2,h2_spans, norm = count_elements(normal_samples, min(normal_samples)-0.01, max(normal_samples)+0.01, 50)

# compute the x and y values
# xs: correspond to a bin, you can take the "center" of each bin,
#    use the spans computed by the function!
xs = TODO TODO TODO

# ys: values for the y axes. what should it be?
ys = TODO TODO TODO

plt.plot(xs, ys, color="red")
plt.plot(xs,gaussian_pdf(xs), color="blue")

Try to print the graph with different number of samples (i.e. a small number of values, and then more and more). What do you observe?

In [ ]:
todo todo todo

Conditional distribution

We are now going to manipulate discrete distributions.

In [ ]:
size_x = 10
size_y = 5

logits = np.random.rand(size_x, size_y)
logits

What does the following two cells compute? Why do we need to do that? Try to understand this precisely!

Bonus: try to stabilize the softmax function, see https://timvieira.github.io/blog/post/2014/02/11/exp-normalize-trick/

In [ ]:
def softmax(v):
    if len(v.shape) != 1:
        raise RuntimeError("v must be a vector")
        
    return np.exp(v) / np.exp(v).sum()
In [ ]:
joint_distrib = softmax(logits.reshape(size_x * size_y)).reshape(size_x, size_y)
joint_distrib
In [ ]:
# check that it is a valid distribution
print(joint_distrib.sum(), np.alltrue(joint_distrib >= 0))

From the joint distribution parameters, compute:

  • the parameters of the marginals distribution p(y)
  • the parameters of the conditional distribution p(x | y)
  • transform p(y) and p(x | y) to get back the joint distribution parameters, i.e. joint_distrib2
  • check that joint_distrib and joint_distrib2 are (almost) equal (use np.allclose)

(check the documentation of the numpy sum function!!!)

In [ ]:
todo todo todo

Generate a joint distribution where X and Y are statistically independent. How can you do that? Do the same exercise, as above. What do you observe for the distribution p(x | y) ?

In [ ]:
todo todo todo