Machine Learning Coursera’s Notes – Week 3

01 September 2020


My notes from the Machine Learning Course provided by Andrew Ng on Coursera

Logistic Regression

Example of classification problem:

  • Email: Spam/Not spam?
  • Online transaction: Fraudulent (Yes/No)?
  • Tumour: Malingnat/Benign?

y{0,1}y \in \{0,1\} where 00 - "Ngative class" 11 - "Positive class"

Multiclass classification problem y{0,1,2,3,...}y \in \{0,1,2,3,...\}

How to develop a classification algorithm?

Example of a training set for classification task: ClassificationExampleTask

For this data set applying linear regression algorithm: hθ(x)=θTxh_{\theta}(x)=\theta^Tx and it could finish like below:


For prediction, it could be useful to try to do the threshold:

Threshold classifier output hθ(x)h_{\theta}(x) at 0.50.5: (vertical axis value)

  • If hθ(x)0.5h_{\theta}(x) \geqslant 0.5, predict "y=1y=1"
  • If hθ(x)<0.5h_{\theta}(x) < 0.5, predict "y=0y=0"


What happens if a problem will change a bit?


By adding one positive example cause linear regression to change, it's a straight line to fit the new data. It causes (in this example) a worse hypothesis.

Conclusion: Applying linear regression to classification problem often isn't a good idea

Funny bit of using linear regression on classification problem: Classification: y=0y=0 or y=1y=1 but hθ(x)h_{\theta}(x) can be >1>1 or <0<0 So even when if we know that label should be 00 or 11, the linear regression could back with output values much larger than 11 or much smaller than 00.

Logistic Regression: 0hθ(x)10 \leqslant h_{\theta}(x) \leqslant 1 - it's classification algorithm

Hypothesis Representation

Logistic Regression Model

Want 0hθ(x)10 \leqslant h_{\theta}(x) \leqslant 1

hθ(x)=g(θTx)h_{\theta}(x) = g(\theta^Tx)

g(z)=11+ezg(z) = {1 \over {1 + e^{-z}}} (two different names for the same function: sigmoid function and logistic function )

hθ(x)=g(θTx)g(z)=11+ez}hθ(x)=11+eθTx\begin{rcases} h_{\theta}(x) = g(\theta^Tx) \\ g(z) = {1 \over {1 + e^{-z}}} \end{rcases}⇒h_{\theta}(x)={1 \over {1 + e^{-\theta^Tx}}}

zz - real number

Sigmoid function


Parameters: θ\theta

Interpretation of Hypothesis Output

hθ(x)h_{\theta}(x) - estimated probability that y=1y=1 on input xx


if x=[x0x1]=[1tumor size]\text{if } x = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 1 \\ \text{tumor size} \end{bmatrix}

hθ(x)=0.7h_{\theta}(x) = 0.7 Tell patient that 7-% chance of tumor being malignant

hθ(x)=P(y=1x;θ)h_{\theta}(x) = P(y=1|x;\theta) - probability that y=1y=1, give xx, parameterized by θ\theta

Because this is a classification problem, then yy can get only two values: 00 or 11

P(y=0x;θ)+P(y=1x;θ)=1P(y=0|x;\theta) + P(y=1|x;\theta) = 1

P(y=0x;θ)=1P(y=1x;θ)P(y=0|x;\theta) = 1 - P(y=1|x;\theta)

Decision Boundary

Logistic Regresion:

hθ(x)=g(θTx)h_{\theta}(x) = g(\theta^Tx)

g(z)=11+ezg(z) = {1 \over {1 + e^{-z}}}

Predict "y=1""y=1" if hθ(x)0.5h_{\theta}(x) \geqslant 0.5

Predict "y=0""y=0" if hθ(x)<0.5h_{\theta}(x) < 0.5


g(z)0.5g(z) \geqslant 0.5 when z0z \geqslant 0

hθ(x)=g(θTx)0.5h_{\theta}(x) = g(\theta^Tx) \geqslant 0.5 where θTx0\theta^Tx \geqslant 0

Training set


h(θ(x)=q(θ0+θ1x1+θ2x2)h(_{\theta}(x) = q(\theta_0 + \theta_1x_1 + \theta_2x_2)

Based on the plot: h(θ(x)=q(3+1x1+1x2)h(_{\theta}(x) = q(-3 + 1x_1 + 1x_2)

so: θ=[311]\theta = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix}

Predict: "y=1""y=1" if 3+x1+x20-3 + x_1 + x_2 \geqslant 0

3+x1+x20-3 + x_1 + x_2 \geqslant 0

x1+x23x_1 + x_2 \geqslant 3 - will create a sraight line



θ0=5;θ1=1;θ2=0;\theta_0=5; \theta_1=-1; \theta_2=0;

hθ(x)=g(5x1)h_{\theta}(x) = g(5-x_1)

so: x15x_1 \geqslant 5


Non-linear decision boundaries


hθ(x)=g(θ0+θ2x2+θ3x12+θ4x22)h_{\theta}(x) = g(\theta_0 +\theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2)

θ=[10011]\theta = \begin{bmatrix} -1 \\ 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}

Predict "y=1""y=1" if 1+x12+x220-1 + x_1^2 + x_2^2 \geqslant 0

x12+x22=1x_1^2 + x_2^2 = 1


More complex decision boundaries

hθ(x)=g(θ0+θ2x2+θ3x12+θ4x22+θ5x12x22+θ6x13x2+...)h_{\theta}(x) = g(\theta_0 +\theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2 + \theta_5x_1^2x_2^2 + \theta_6x_1^3x_2 + ...)


Cost Function

Supervised learning problem:

Training set: {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... , (x^{(m)}, y^{(m)})\}

mm examples:

x[x0x1xn]x \in \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} x0=1,y{0,1}x_0 =1, y \in \{0,1\}

hθ(x)=11+eθTxh_{\theta}(x) = {1 \over 1 + e - \theta^Tx}

How to choose parameter θ\theta?

Cost function

Linear regression: J(θ)=1mi=1m12(hθ(x(i))y(i))2J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m \frac{1}{2} (h_θ (x^{(i)})-y^{(i)})^2

An alternative way of writing this function:

J(θ)=1mi=1mcost(hθ(x(i)),y)J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m cost (h_θ (x^{(i)}), y)

so: Cost(hθ(x(i)),y)=12(hθ(x(i))y(i))Cost (h_θ (x^{(i)}), y) = \frac{1}{2} (h_θ (x^{(i)})-y^{(i)})

This function is fine for linear regression.

For logistic regression would also work fine, but as well would be a non-convex function of the parameters θ\theta

Example: convexVsNonConvex

Logistic regression cost function

Cost(hθ(x(i))=log(hθ(x))if y=1log(1hθ(x))if y=0Cost (h_θ (x^{(i)}) = \begin{array}{cc} \log(h_{\theta}(x)) \text{if } y = 1 \\ \log(1 - h_{\theta}(x)) \text{if } y = 0 \end{array}


Cost=0Cost = 0 if y=1,hθ(x)=1y = 1, h_{\theta}(x) = 1

But as hθ(x)0h_{\theta}(x) \to 0 and CostCost \to \infty

Capture intuition that if hθ(x)=0h_{\theta}(x) = 0 (predict P(y=1x;θ)=0P(y=1|x;\theta)=0), but y=1y = 1, we will penalize learning algorithm by a very large cost


Simplified cost function and gradient descent

Logistic regression cost function

J(θ)=1mi=1mCost(hθ(x(i)),y(i))J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m Cost (h_θ (x^{(i)}), y^{(i)})

Cost(hθ(x(i))=log(hθ(x))if y=1log(1hθ(x))if y=0Cost (h_θ (x^{(i)}) = \begin{array}{cc} \log(h_{\theta}(x)) \text{if } y = 1 \\ \log(1 - h_{\theta}(x)) \text{if } y = 0 \end{array}

Always y=0 or y=1y=0 \text{ or } y=1

Cost(hθ(x),y)=ylog(hθ(x))((y)log(1hθ(x))Cost (h_θ (x),y) = -y\log(h_{\theta}(x)) - ((-y)\log(1-h_{\theta}(x)) (compact way to write in one line)

if y=1y=1: Cost(hθ(x),y)=log(hθ(x))Cost (h_θ (x),y) = -\log(h_{\theta}(x))

if y=0y=0: Cost(hθ(x),y)=log(1hθ(x))Cost (h_θ (x),y) = -\log(1 - h_{\theta}(x))

Logistic regression cost function

J(θ)=1mi=1mCost(hθ(x(i)),y(i))=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m Cost (h_θ (x^{(i)}), y^{(i)}) = -\frac{1}{m} \lbrack\displaystyle\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)})+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \rbrackJ(θ)=1mi=1mCost(hθ(x(i)),y(i))J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m Cost (h_θ (x^{(i)}), y^{(i)})

To fit parameter θ\theta: minθJ(θ)\begin{matrix} min \\ \theta \end{matrix} J(\theta)

To make a prediction given new xx:

Output hθ(x)=11+eθTxh_{\theta}(x) = \frac{1}{1 + e^{-\theta T} x}

Gradient Descent

J(θ)=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]J(θ)=-\frac{1}{m} \lbrack\displaystyle\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)})+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \rbrack

Want minθJ(θ)min_{\theta}J(\theta):

Repeat {
θj:=θjαθjJ(θ)θ_j := θ_j - \alpha \frac{∂}{∂θ_j} J(θ)
(simultanously update for every θj\theta_j)

αθjJ(θ)=1mi=1m(hθ(x(i))y(i))xj(i)\alpha \frac{∂}{∂θ_j} J(θ) = \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)}) - y^{(i)})x_j^{(i)}


Repeat {
θj:=θjαi=1m(hθ(x(i))y(i))xj(i)θ_j := θ_j - \alpha \displaystyle\sum_{i=1}^m (h_θ (x^{(i)}) - y^{(i)})x_j^{(i)}
(simultanously update all θj\theta_j)

θ\theta = [θ0θ1θ2θn]hθ(x)=11+eθTx\begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} h_{\theta}(x)=\frac{1}{1 + e^{\theta^Tx}}

Algorithms look identical like linear regression but have a different hθ(x)h_{\theta}(x) definition.

Advanced Optimalisation

Optimalisation algotithms

Cost funtion J(θ)J(\theta). Want minθJ(θ)min_{\theta}J(\theta).

Given θ\theta, we already have a code that can compute:

  • J(θ)J(\theta)
  • θjJ(θ)\frac{∂}{∂θ_j} J(θ) (for j=0,1,...,nj=0,1,...,n)

Gradient descent

Repeat {
θj:=θjαθjJ(θ)θ_j := θ_j - \alpha \frac{∂}{∂θ_j} J(θ)

Other (than Gradient descent) optimazation algorithms:

  • Conjugate gradient
  • BFGS
  • L-BFGS

Advantages of using those algorithms:

  • No need to manually peek α\alpha

    • have inline search algorithm that automatically tries different values for the learning rate α\alpha, and automatically picks a good learning rate α\alpha
  • Often faster than gradient descent

Disadvantages of using those algorithms:

  • More complex and advanced

Example: θ=[θ1θ2]\theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}

minθJ(θ)    θ1=5;θ2=5\begin{matrix} min \\ \theta \end{matrix} J(\theta) \implies \theta_1=5; \theta_2=5

J(θ)=(θ15)2+(θ25)2J(\theta) = (\theta_1 - 5)^2 + (\theta_2 - 5)^2 (cost function)

θ1J(θ)=2(θ15)\frac{∂}{∂θ_1}J(\theta) = 2(\theta_1-5)

θ2J(θ)=2(θ25)\frac{∂}{∂θ_2}J(\theta) = 2(\theta_2-5)

Prove in Octave:

>> options = optimset('GradObj', 'On', 'MaxIter', 100);
>> initialTheta = zeros(2,1);
>> function [jVal, gradient] = costFunction(theta)

  jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
  gradient = zeros(2,1);
  gradient(1) = 2*(theta(1)-5);
  gradient(2) = 2*(theta(2)-5);
>> [optTheta, functionValm exitFlag] = fminunc(@costFunction, initialTheta, options)
optTheta =


functionValm =    1.5777e-30
exitFlag =  1

fminunc - Advance Optimalization function in Octave

Multi-class classification: ONE-VS-ALL

Multiclass classification


  • Email foldering/tagging: Work (y=1)(y=1), Friends (y=2)(y=2), Family (y=3)(y=3), Hobby (y=4)(y=4)
  • Medical diagrams: Not ill (y=1)(y=1), Cold (y=2)(y=2), Flu (y=3)(y=3)
  • Weather: Sunny (y=1)(y=1), Cloudy (y=2)(y=2), Rain (y=3)(y=3), Snow (y=4)(y=4)

Binary classification


Multi-class classification


Question: How do we get a learning algorithm to work for the setting?

One-vs-all (one-vs-rest)


Step 1: Createing a new training set to fit the classifier hθ(1)(x)h_{\theta}^{(1)}(x)


Step 2: Createing a new training set to fit the classifier hθ(2)(x)h_{\theta}^{(2)}(x)


Step 3: Createing a new training set to fit the classifier hθ(3)(x)h_{\theta}^{(3)}(x)


We fit three classifiers what is a probablity of one of the three classes:

hθ(i)(x)=P(y=ix;θ) h_{\theta}^{(i)}(x) = P(y=i|x;\theta)


Train a logistoc regression classifier hθ(i)(x)h_{\theta}^{(i)}(x) for each class ii to predict the probability that y=iy=i On a new input xx, to make a prediction, pick the class ii ythat maximaizes: maxihθ(i)(x)\begin{matrix} max\\ i \end{matrix} h_{\theta}^{(i)}(x)