Do you really know the linear classifier basics?

Let’s talk about a simple linear classifier: $$f(x_i, W, b) = Wx_i + b$$

  1. What does one row of the weight matrix represent?

    Each row of the weight matrix is a classifier for one certain class. Or we can say each row is a template. In this way, linear classifiers can be interpreted as template matching.

  2. If the (binary) linear classifier is a line in a 2d plane, what does the weight matrix and bias control?
    If we change the weight matrix the line will rotate in the 2d plane. If we change the bias, the line will translate in the 2d plane. Weights and bias are all learnable.

  3. Why we need the bias term?
    Without the bias terms, plugging in $x_i=0$ would always give score of zero regardless of the weights, so that all lines would be forced to cross the origin.

  4. What data preprocessing you usually do before training the classifier?
    (1) Center the data by subtracting the dataset mean from every feature (e.g. normalize the images by subtracting the mean per channel calculated over all images).
    (2) Normalize (scale) the data so that its values range from [-1, 1].

  5. Why we want to penalize large weights using, for example, L2 norm?
    The most appealing property is that penalizing large weights tends to improve generalization, because it means that no input dimension can have a very large influence on the scores all by itself. Since the L2 penalty prefers smaller and more diffuse weight vectors, the final classifier is encouraged to take into account all input dimensions to small amounts rather than a few input dimensions and very strongly.

  6. What is the form of a cross-entropy loss?
    $$L_i = -log( \frac{e^{f_{y_i}}}{ \sum_j e^{f_j}})$$

  7. What problem you might encounter when coding a Softmax function?
    Overflow or underflow. the intermediate terms may be very large due to the exponentials. Dividing large numbers can be numerically unstable, so it is important to use a normalization trick.

1
2
3
4
5
6
7
8
9
>>> import numpy as np
>>> f = np.array([123, 456, 789])
>>> p = np.exp(f) / np.sum(np.exp(f))
__main__:1: RuntimeWarning: overflow encountered in exp
__main__:1: RuntimeWarning: invalid value encountered in divide
>>> np.exp(f)
array([ 2.61951732e+053, 1.09215367e+198, inf])
>>> np.sum(np.exp(f))
inf

We can solve this problem by multiplying the top and bottom of the fraction by a constant C and push it into the sum. A common choice for C is to set $$logC = -max_j f_j$$
This simply states that we should shift the values inside the vector f so that the highest value is zero.

1
2
3
4
>>> f -= np.max(f)
>>> p = np.exp(f) / np.sum(np.exp(f))
>>> p
array([ 5.75274406e-290, 2.39848787e-145, 1.00000000e+000])
  1. Final challenge: can you write down back propagation of a 2-layer neral network.