Proba / statistiques - TP1

The problem we will look at is sentiment analysis, i.e. sentences annotated with their expressed sentiment (negative or positive). We will work with the following probabilistic model:

  1. $p(Y)$ is a prior over sentiments, $Y \in \{0, 1\}$. Therefore we will model $p(Y ; \mu)$ as a Bernoulli where $\mu \in [0, 1]$ is the single parameter of the prior
  2. $p(X | Y ; \phi)$ is a conditional distribution over words present in a sentence, $X \in \{0, 1\}^{|V|}$ where $V$ is the vocabulary, i.e. words in the language. We assume statistical independence over observed variables: $p(x | y ; \phi) = \prod_i p(x_i | y ; \phi)$. The probability $p(X_i = 1 | y ; \phi)$ is interpreted as "the probability that a sentence with sentiment $y$ contains the word $i \in V$". To parameterize this conditional probability distribution, we need $2 \times |V|$ parameters, one for each Bernoulli.

The data for this lab exercise can be downloaded here: https://archive.ics.uci.edu/ml/datasets/Sentiment+Labelled+Sentences

Rapport

À la fin du TP, il faut également me rendre un rapport d'une ou deux pages, soit manuscrit soit au format PDF (je n'accepte QUE le format PDF, pas de word ou autre). Dans ce rapport, il faut expliquer ce que vous avez fait, ce que vous avez eux besoin de coder, donner des intuitions, utiliser des schéma, etc. Et bien sur, les différents résultats et un commentaire. Les questions tout au long du notebook vous donne des idées de quoi écrire dans le rapport.

ATTENTION : ne pas répondre aux questions directement dans le notebook

Je vous conseil fortement de l'écrire à la main, ça prendra trop de temps sinon...

Le rapport et le code doivent être rendu à midi, par email ou en main propre.

In [ ]:
# import libraries
import numpy as np
import string
import collections

import matplotlib.pyplot as plt
%matplotlib inline
In [ ]:
# this is the NLTK stopword list,
# see: https://gist.github.com/sebleier/554280#gistcomment-2639949
#
# stopword are common words that may introduce noise in the probabilistic model

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]
# remove punctuations from stopwords
stopwords = [w.lower().translate(str.maketrans('', '', string.punctuation)) for w in stopwords]

This class is used to read the data. The constructor takes several parameters. For now, don't change the way I call it, but once you finished the code, run it with different stopword parameters (see at the end).

In [ ]:
# a class to read the dataset!
class Dataset:
    # read a dataset:
    # - path: path to the dataset
    # - stopwords: a set of stopwords, if not given, words will not be filtered
    # - stopwords_only: if set to true, then only stopwords will be selected
    #   (it is a weird concept, but we will play with that! :) )
    # - min_occ: minimum occurences of a word to be taken into account
    def __init__(self, path, stopwords=None, stopwords_only=False, min_occ=2):
        self.original_sentences = []
        self.original_labels = []
        
        # not the smartest way to do this, but who cares
        word_count = collections.defaultdict(lambda: 0)
        with open(path) as inf:
            for line in inf:
                line = line.strip()
                line = line.split()

                sentence = list()
                for word in line[:-1]:
                    word = word.lower().translate(str.maketrans('', '', string.punctuation))
                    if len(word) > 0:
                        word_count[word] += 1

        with open(path) as inf:
            for line in inf:
                line = line.strip()
                line = line.split()

                sentence = list()
                for word in line[:-1]:
                    word = word.lower().translate(str.maketrans('', '', string.punctuation))
                    if len(word) > 0 and word_count[word] >= min_occ:
                        if stopwords is None:
                            sentence.append(word)
                        elif stopwords_only == False:
                            if word not in stopwords:
                                sentence.append(word)
                        else:
                            if word in stopwords:
                                sentence.append(word)
                if len(sentence) > 0:
                    self.original_sentences.append(sentence)
                    self.original_labels.append(int(line[-1]))

        # create dicts
        self.vocabulary = set()
        for sentence in self.original_sentences:
            self.vocabulary.update(sentence)

        self.id_to_word = dict() # this could be a list
        self.word_to_id = dict()
        for i, w in enumerate(self.vocabulary):
            self.id_to_word[i] = w
            self.word_to_id[w] = i
            
        # create numpy array
        # this could be a tensor of ints,
        # but we don't want to mess with operations like division
        self.X_samples = np.zeros((len(self.original_sentences), len(self.vocabulary)), dtype=np.float)
        self.Y_samples = np.array(self.original_labels, dtype=np.int)
        
        for i, sentence in enumerate(self.original_sentences):
            for word in sentence:
                j = self.word_to_id[word]
                self.X_samples[i, j] = 1

dataset = Dataset(
    "/Users/corro/corpus/probas/sentiment labelled sentences/imdb_labelled.txt",
    stopwords,
    #stopwords_only=True
)
print(dataset.X_samples.shape, dataset.Y_samples.shape)
print(len(dataset.vocabulary))
In [ ]:
print(dataset.original_sentences[0])
print(dataset.original_labels[0])
print(dataset.X_samples[0])
print(dataset.X_samples[0].sum()) # very sparse vector!
print(dataset.Y_samples[0])

Compute the parameter mu of the prior distribution over sentiments. This should be easy! It is the frequence of the positive class.

In [ ]:
# TODO

Compute the parameters of the conditional distribution.

How many parameters are there? How do you store them? $p(X_i = 1 | Y = y)$ is the probability that word i appears in a sentence labeled y. How do you compute this? (its a ration, its not more complicated than for the prior)

Parameter smoothing: Remember that we don't like null probabilities, we don't want $p(X_i = 1 |y) = 0$ or $p(X_i = 0 |y)$. A trick to not have this problem is to assume there is an "extra phantom datapoint" containing word $i$ and one not containing word $i$ in the data.

Let's take some concrete example for a Bernoulli RV: let's the dataset be {1, 1}. Without parameter smoothing, we would set mu=1 and have p(X = 1) = 1 and p(X = 0) = 0. With parameter smoothing, we will artificially assume the data is actually {1, 1, 1, 0} and set mu=3/4 to have p(X = 1) = 3/4 and p(X = 0) = 1/4.

Of course you don't want to add this directly to the data. How do you update the rule to compute the parameter?

In [ ]:
# separate the dataset into two,
# i.e. create a tensor with negative datapoint
# and one with positive datapoints
data_negative = dataset.X_samples[dataset.Y_samples == 0]
data_positive = dataset.X_samples[dataset.Y_samples == 1]

#create the array that will contains the conditional distribution parameters
conditional_distribution = TODO TODO

# TODO compute the probabilities!
In [ ]:
# TODO
# check that all you parameters are between 0 and 1
# (0 and 1 not included! we don't want null probabilities!)
# you will need np.alltrue and np.logical_and

TODO

 Prior classification

For the moment, we will only look at the prior and we ask ourselves the following questions:

  • what is the entropy of the prior? how do you compute it what can you say about it?
  • if we would assign to each sample of the dataset the most probable class under the prior distribution, what would be the error rate?

To answer this, code the computation!

In [ ]:
# entropy of the prior
In [ ]:
# TODO
# compute the classification error rate if we always assign
# to each sentence the most probable class given the prior distribution

compute the posterior distribution

We now need to compute the posterior distribution p(y | x). This is a Bernoulli distribution, so it as a single parameter $\nu \in \mathbb R$ for a given x. We first try to compute it for a single point.

To this end, use Bayes rule! However, if you try to directly implement Bayes rule, it will fail! Why? Because when you compute $p(x | y) = \prod_i p(x_i | y)$ you multiply a lot of small value together, so it will eventually underflow to 0.

To avoir this problem, it is common to compute in the log domain. That is, instead of computing $p(x | y) = \prod_i p(x_i | y)$, you compute $\log p(x | y) = \sum_i \log p(x_i | y)$. Compute everything in the log domain, and in the end you can do posterior_param = np.exp(log_posterior_param)

This is bit difficult...take your time. Remember how the Bernoulli distribution is defined, i.e. what is $p(X_i = 1 | y)$ and $p(X_i = 0 | y)$, think about how you can compute $\log p(x|y)$ from this.

Tip: the datapoint x is a "boolean" vector (with values ones and zero), so you can use it as a mask and use (1-x) as a mask too. This will help! :) If you don't get this, you can also just use standard python loop.

In [ ]:
# p(y | x) = p(y) p(x | y) / sum_y p(y) p(x | y)
x = dataset.X_samples[0]

# log p(x | Y = 0)
todo
# log p(x | Y = 1)
todo 
# log p(Y = 1 | x)

# then the posterior param is just exp(log p(Y = 1 | x))
posterior_param = np.exp(log_posterior_param)
posterior_param

Posterior classification

For the moment, we will only look at the posterior and we ask ourselves the same questions:

  • what is the conditional entropy of the posterior distribution? how do you compute it what can you say about it? Remember that we are interested in H[p(Y | X)], which can be seen as an expectation that we estimate using Monte-Carle method => the samples for MC method are just the datapoints!
  • if we would assign to each sample of the dataset the most probable class under the posterior distribution, what would be the error rate?

To answer this, code the computation!

In [ ]:
todo todo

Data preprocessing

Its almost over! Now you can compare the results in three different settings:

1. just read the dataset as it is (word occuring a single time are deleted, but that's all)
dataset = Dataset( "data/imdb_labelled.txt" )

2. remove stopwords
dataset = Dataset( "data/imdb_labelled.txt", stopwords )

3. use only stopwords!!!
dataset = Dataset( "data/imdb_labelled.txt", stopwords, stopwords_only=True )