m ai.

blog

Nov 1, 2023·6 min read

Gaussian Naive Bayes

Article Cover

Gaussian Naive Bayes

This article is based on my personal lecture notes from my studies at Machine Learning for Data Science 2023/2024 course: Bayesian Learning. The content reflects my understanding and interpretation of the course material.

Objective

Given a data point x, what is the probability of x belonging to some class c?

We have C class and need to find which group of xx, instead of finding exactly the label, given a data point x, we find the probability of x belonging to some class c or find P(y=cx)P(y=c|x) or P(cx)P(c|x)

c=arg maxcP(cX)c = \argmax_c P(c|X)

We define the class of each point by choosing the class c with the highest probability

Bayes’ theorem

P(AB)=P(BA)P(A)P(B)P(A|B) = \frac{P(B|A)P(A)}{P(B)}

Naive Bayes Classifier formula

P(cx)=p(xc)P(c)p(x)=p(xc)P(c)cp(xc)p(c)P(c|x) = \frac{p(x|c)P(c)}{p(x)} = \frac{p(x|c)P(c)}{\sum_{c}p(x|c)p(c)}

or

Posterior=Likelihood×PriorEvidencePosterior = \frac{Likelihood \times Prior}{Evidence}

where

  • xx consists xijx_{ij} with i=1,ni=\overline{1,n} with nn = observation and j=1,nj=\overline{1,n} with kk is the number of features
  • P(cx)P(c|x)is Posterior Probability
  • p(xc)p(x|c) is Likelihood of feature xjx_j given that their class is T
  • P(c)P(c) is Prior Probability
  • P(x)P(x) is Marginal Probability or called Evidence

We have

c=arg maxcP(cX)=arg maxcp(xc)P(c)p(x)=arg maxcp(xc)P(c)c = \argmax_c P(c|X)=\argmax_c \frac{p(x|c)P(c)}{p(x)} = \argmax_c p(x|c)P(c)

Implementation

We choose Iris as a dataset.

from sklearn.datasets import load_iris
import matplotlib.pyplot as plt

x, y = load_iris( return_X_y=True )

There are

  • 3 classes c1,c2,c3c_1, c_2, c_3
  • 150 observations with 4 features x1,x2,x3,x4x_1, x_2, x_3, x_4
dataset

Likelihood of each feature per class

Computing P(xc)P(x|c) depends on the type of data. There are three commonly used types: Gaussian Naive Bayes, Multinomial Naive Bayes, and Bernoulli Naive Baye

We will consider the Gaussian distribution, with two common distributions: the univariate normal distribution (or Gaussian density) and the multivariate normal distribution (or multivariate density).

Univariate normal distribution or Gaussian density

p(x)=12πσ2e((xμ)22σ2)p(x) = \frac{1}{\sqrt{2\pi \sigma^2}}e^{(-\frac{(x-\mu)^2}{2\sigma^2})}

This distribution is described by two parameters: the mean μ\mu and the variance σ2\sigma^2.

  • The value of μ\mu can be any real number, indicating the location of the peak, where the probability density function reaches its highest value.

    Example: Plot the gaussian distributions (likelihood) with the μ1\mu_1 < μ2\mu_2 and same variance

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate the functions' domain
x = np.linspace(0,20,100)

# Define two gaussian distributions

p1 = norm.pdf(x, 2.5, 1.5)
p2 = norm.pdf(x, 5.5, 1.5)

# Plot the gaussian distributions (likelihood)
plt.plot( x, p1, 'b')
plt.plot( x, p2, 'g')
plt.title('Likelihood distributions')
plt.ylabel('$p(x|c)$')
plt.xlabel('$x$')
plt.show()

  • The value of σ2\sigma^2 is a positive value, with σ\sigma representing the spread of this distribution. A large σ\sigma indicates a wide range of output values, and vice versa.

    Example: Plot the gaussian distributions (likelihood) with the σ1\sigma_1 < σ2\sigma_2 and same variance

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import norm

# Generate the functions' domain
x = np.linspace(0,20,100)

# Define two gaussian distributions

p1 = norm.pdf(x, 5.5, 1.5)
p2 = norm.pdf(x, 5.5, 2.5)

# Plot the gaussian distributions (likelihood)
plt.plot( x, p1, 'b')
plt.plot( x, p2, 'g')
plt.title('Likelihood distributions')
plt.ylabel('$p(x|c)$')
plt.xlabel('$x$')
plt.show()

Multivariate normal distribution (or multivariate density)

This is the general case of the normal distribution when the variable is multidimensional, assuming it has dd dimensions.

p(x)=1(2π)d/2Σ1/2exp(12(xμ)TΣ1(xμ))p(x) = \frac{1}{(2\pi)^{d/2}|\Sigma|^{1/2}}\exp\left(-\frac{1}{2}(x - \mu)^T \Sigma^{-1} (x - \mu)\right)

where

  • xx is dd-component column vector
  • μRd\mu \in \mathbb{R}^d is dd-component mean vector
  • ΣS++D\Sigma \in \mathcal{S}^D_{++} is d×dd \times d matrix and a positive definite symmetric matrix.

Back to our example with Iris dataset, we compute the likelihood for each sample and each class.

import numpy as np
from scipy.stats import multivariate_normal

# Identify members of each class -> 4 column vector with 150 obs
cl1 = x[ y==0, :]
cl2 = x[ y==1, :]
cl3 = x[ y==2, :]

# Compute the mean (centroid) of each class -> return a 4 row vector
m1 = np.mean(cl1, axis = 0)
m2 = np.mean(cl2, axis = 0)
m3 = np.mean(cl3, axis = 0)

# Compute the covariance matrix for each class -> return a 4x4 matrix (4 features)
C1 = np.cov(cl1.T) # or using C1 = np.cov(cl1, rowvar=False)
C2 = np.cov(cl2.T)
C3 = np.cov(cl3.T)

# compute the likelihood for each sample and each class
lik1 = multivariate_normal.pdf(x, mean=m1, cov=C1)
lik2 = multivariate_normal.pdf(x, mean=m2, cov=C2)
lik3 = multivariate_normal.pdf(x, mean=m3, cov=C3)

Prior Probability

Each value P(c)P(c), where c=1,2,,Cc = 1, 2, \ldots, C, can be determined as the frequency of occurrence of class cc in the training data.

# Define the priors
P_c1 = np.size(cl1)/np.size(x) # 1/3
P_c2 = np.size(cl2)/np.size(x) # 1/3
P_c3 = np.size(cl3)/np.size(x) # 1/3

Marginal distribution or Evidence

We compute p(x)p(x) in this case of 3 classes

p(x)=c=13p(xc)p(c)=p(xc1)p(c1)+p(xc2)p(c2)+p(xc3)p(c3){p(x)} = {\sum_{c=1}^{3}p(x|c)p(c)} = p(x|c_1)p(c_1) + p(x|c_2)p(c_2) + p(x|c_3)p(c_3)
p_x = (lik1 * P_c1) + (lik2 * P_c1) + (lik3 * P_c1)

Posterior Probabilities

P(cx)=p(xc)P(c)p(x)P(c|x) = \frac{p(x|c)P(c)}{p(x)}
# Compute posterior probabilities
post1 = (lik1 * P_c1) / p_x # shape = (150,)
post2 = (lik2 * P_c2) / p_x # shape = (150,)
post3 = (lik3 * P_c3) / p_x # shape = (150,)

post = np.vstack((post1,post2,post3)) # shape = (3,150)

Making Prediction

c=arg maxc{1,,C}P(cx)=arg maxc{1,,C}i=1dp(xic)p(c)c = \argmax_{c \in \{1, \ldots, C\}}P(c|x)=\argmax_{c \in \{1, \ldots, C\}} \prod_{i=1}^{d} p(x_i|c)p(c)

When dd is large and the probabilities are small, the expression on the right-hand side of equation above becomes a very small number, leading to potential numerical inaccuracies during computation. To address this, equation above is often rewritten in an equivalent form by taking the logarithm of the right-hand side

c=arg maxc{1,,C}logP(cx)=arg maxc{1,,C}i=1dlog(p(xic))+log(p(c))c = \argmax_{c \in \{1, \ldots, C\}}logP(c|x)=\argmax_{c \in \{1, \ldots, C\}} \sum_{i=1}^{d} log(p(x_i|c)) + log(p(c))

In this case, with d=3d=3, the results of the logarithm of the posterior probabilities and the posterior probabilities are equal.

logpost = np.log(post)
prediction = np.argmax(logpost,axis=0)
accuracy = np.sum(prediction == y)/len(y)

# Classifier accuracy: 98.00%
print('Classifier accuracy: ' + "{0:.2f}".format(accuracy*100) + '%')

# Visualize confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y,prediction)
from sklearn.metrics import ConfusionMatrixDisplay
cm_display = ConfusionMatrixDisplay(cm).plot()

Reference

  1. Duda, R. O., Hart, P. E., & Stork, D. G. (2001). Chapter 2: Bayesian decision theory. In Pattern Classification. Wiley.
  2. Naive Bayes Classifier Implementation. GitHub.
  3. Chapter 11: Naive Bayes Classifier. Github
  4. Understanding by Implementing: Gaussian Naive Bayes. Towards Data Science.