Machine Learning Coursera’s Notes – Week 3

01 September 2020

ml

My notes from the Machine Learning Course provided by Andrew Ng on Coursera

Logistic Regression

Example of classification problem:

  • Email: Spam/Not spam?
  • Online transaction: Fraudulent (Yes/No)?
  • Tumour: Malingnat/Benign?

y{0,1}y \in \{0,1\} where 00 - "Ngative class" 11 - "Positive class"

Multiclass classification problem y{0,1,2,3,...}y \in \{0,1,2,3,...\}

How to develop a classification algorithm?

Example of a training set for classification task: ClassificationExampleTask

For this data set applying linear regression algorithm: hθ(x)=θTxh_{\theta}(x)=\theta^Tx and it could finish like below:

ClassificationExampleTaskLinearRegression

For prediction, it could be useful to try to do the threshold:

Threshold classifier output hθ(x)h_{\theta}(x) at 0.50.5: (vertical axis value)

  • If hθ(x)0.5h_{\theta}(x) \geqslant 0.5, predict "y=1y=1"
  • If hθ(x)<0.5h_{\theta}(x) < 0.5, predict "y=0y=0"

ClassificationExampleTaskLinearRegressionThreshold

What happens if a problem will change a bit?

ClassificationExampleTaskLinearRegressionThresholdExtraValue

By adding one positive example cause linear regression to change, it's a straight line to fit the new data. It causes (in this example) a worse hypothesis.

Conclusion: Applying linear regression to classification problem often isn't a good idea

Funny bit of using linear regression on classification problem: Classification: y=0y=0 or y=1y=1 but hθ(x)h_{\theta}(x) can be >1>1 or <0<0 So even when if we know that label should be 00 or 11, the linear regression could back with output values much larger than 11 or much smaller than 00.

Logistic Regression: 0hθ(x)10 \leqslant h_{\theta}(x) \leqslant 1 - it's classification algorithm

Hypothesis Representation

Logistic Regression Model

Want 0hθ(x)10 \leqslant h_{\theta}(x) \leqslant 1

hθ(x)=g(θTx)h_{\theta}(x) = g(\theta^Tx)

g(z)=11+ezg(z) = {1 \over {1 + e^{-z}}} (two different names for the same function: sigmoid function and logistic function )

hθ(x)=g(θTx)g(z)=11+ez}hθ(x)=11+eθTx\begin{rcases} h_{\theta}(x) = g(\theta^Tx) \\ g(z) = {1 \over {1 + e^{-z}}} \end{rcases}⇒h_{\theta}(x)={1 \over {1 + e^{-\theta^Tx}}}

zz - real number

Sigmoid function

SigmoidFunction

Parameters: θ\theta

Interpretation of Hypothesis Output

hθ(x)h_{\theta}(x) - estimated probability that y=1y=1 on input xx

Example:

if x=[x0x1]=[1tumor size]\text{if } x = \begin{bmatrix} x_0 \\ x_1 \end{bmatrix} = \begin{bmatrix} 1 \\ \text{tumor size} \end{bmatrix}

hθ(x)=0.7h_{\theta}(x) = 0.7 Tell patient that 7-% chance of tumor being malignant

hθ(x)=P(y=1x;θ)h_{\theta}(x) = P(y=1|x;\theta) - probability that y=1y=1, give xx, parameterized by θ\theta

Because this is a classification problem, then yy can get only two values: 00 or 11

P(y=0x;θ)+P(y=1x;θ)=1P(y=0|x;\theta) + P(y=1|x;\theta) = 1

P(y=0x;θ)=1P(y=1x;θ)P(y=0|x;\theta) = 1 - P(y=1|x;\theta)

Decision Boundary

Logistic Regresion:

hθ(x)=g(θTx)h_{\theta}(x) = g(\theta^Tx)

g(z)=11+ezg(z) = {1 \over {1 + e^{-z}}}

Predict "y=1""y=1" if hθ(x)0.5h_{\theta}(x) \geqslant 0.5

Predict "y=0""y=0" if hθ(x)<0.5h_{\theta}(x) < 0.5

DecisionBoundary

g(z)0.5g(z) \geqslant 0.5 when z0z \geqslant 0

hθ(x)=g(θTx)0.5h_{\theta}(x) = g(\theta^Tx) \geqslant 0.5 where θTx0\theta^Tx \geqslant 0

Training set

TrainingSet

h(θ(x)=q(θ0+θ1x1+θ2x2)h(_{\theta}(x) = q(\theta_0 + \theta_1x_1 + \theta_2x_2)

Based on the plot: h(θ(x)=q(3+1x1+1x2)h(_{\theta}(x) = q(-3 + 1x_1 + 1x_2)

so: θ=[311]\theta = \begin{bmatrix} -3 \\ 1 \\ 1 \end{bmatrix}

Predict: "y=1""y=1" if 3+x1+x20-3 + x_1 + x_2 \geqslant 0

3+x1+x20-3 + x_1 + x_2 \geqslant 0

x1+x23x_1 + x_2 \geqslant 3 - will create a sraight line

TrainingSetDB

Excercise:

θ0=5;θ1=1;θ2=0;\theta_0=5; \theta_1=-1; \theta_2=0;

hθ(x)=g(5x1)h_{\theta}(x) = g(5-x_1)

so: x15x_1 \geqslant 5

TrainingSetDBExercise

Non-linear decision boundaries

NonLinearDB01

hθ(x)=g(θ0+θ2x2+θ3x12+θ4x22)h_{\theta}(x) = g(\theta_0 +\theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2)

θ=[10011]\theta = \begin{bmatrix} -1 \\ 0 \\ 0 \\ 1 \\ 1 \end{bmatrix}

Predict "y=1""y=1" if 1+x12+x220-1 + x_1^2 + x_2^2 \geqslant 0

x12+x22=1x_1^2 + x_2^2 = 1

NonLinearDB02

More complex decision boundaries

hθ(x)=g(θ0+θ2x2+θ3x12+θ4x22+θ5x12x22+θ6x13x2+...)h_{\theta}(x) = g(\theta_0 +\theta_2x_2 + \theta_3x_1^2 + \theta_4x_2^2 + \theta_5x_1^2x_2^2 + \theta_6x_1^3x_2 + ...)

NonLinearDB03

Cost Function

Supervised learning problem:

Training set: {(x(1),y(1)),(x(2),y(2)),...,(x(m),y(m))}\{(x^{(1)}, y^{(1)}), (x^{(2)}, y^{(2)}), ... , (x^{(m)}, y^{(m)})\}

mm examples:

x[x0x1xn]x \in \begin{bmatrix} x_0 \\ x_1 \\ \vdots \\ x_n \end{bmatrix} x0=1,y{0,1}x_0 =1, y \in \{0,1\}

hθ(x)=11+eθTxh_{\theta}(x) = {1 \over 1 + e - \theta^Tx}

How to choose parameter θ\theta?

Cost function

Linear regression: J(θ)=1mi=1m12(hθ(x(i))y(i))2J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m \frac{1}{2} (h_θ (x^{(i)})-y^{(i)})^2

An alternative way of writing this function:

J(θ)=1mi=1mcost(hθ(x(i)),y)J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m cost (h_θ (x^{(i)}), y)

so: Cost(hθ(x(i)),y)=12(hθ(x(i))y(i))Cost (h_θ (x^{(i)}), y) = \frac{1}{2} (h_θ (x^{(i)})-y^{(i)})

This function is fine for linear regression.

For logistic regression would also work fine, but as well would be a non-convex function of the parameters θ\theta

Example: convexVsNonConvex

Logistic regression cost function

Cost(hθ(x(i))=log(hθ(x))if y=1log(1hθ(x))if y=0Cost (h_θ (x^{(i)}) = \begin{array}{cc} \log(h_{\theta}(x)) \text{if } y = 1 \\ \log(1 - h_{\theta}(x)) \text{if } y = 0 \end{array}

LogisticRegression01

Cost=0Cost = 0 if y=1,hθ(x)=1y = 1, h_{\theta}(x) = 1

But as hθ(x)0h_{\theta}(x) \to 0 and CostCost \to \infty

Capture intuition that if hθ(x)=0h_{\theta}(x) = 0 (predict P(y=1x;θ)=0P(y=1|x;\theta)=0), but y=1y = 1, we will penalize learning algorithm by a very large cost

LogisticRegression02

Simplified cost function and gradient descent

Logistic regression cost function

J(θ)=1mi=1mCost(hθ(x(i)),y(i))J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m Cost (h_θ (x^{(i)}), y^{(i)})

Cost(hθ(x(i))=log(hθ(x))if y=1log(1hθ(x))if y=0Cost (h_θ (x^{(i)}) = \begin{array}{cc} \log(h_{\theta}(x)) \text{if } y = 1 \\ \log(1 - h_{\theta}(x)) \text{if } y = 0 \end{array}

Always y=0 or y=1y=0 \text{ or } y=1

Cost(hθ(x),y)=ylog(hθ(x))((y)log(1hθ(x))Cost (h_θ (x),y) = -y\log(h_{\theta}(x)) - ((-y)\log(1-h_{\theta}(x)) (compact way to write in one line)

if y=1y=1: Cost(hθ(x),y)=log(hθ(x))Cost (h_θ (x),y) = -\log(h_{\theta}(x))

if y=0y=0: Cost(hθ(x),y)=log(1hθ(x))Cost (h_θ (x),y) = -\log(1 - h_{\theta}(x))

Logistic regression cost function

J(θ)=1mi=1mCost(hθ(x(i)),y(i))=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m Cost (h_θ (x^{(i)}), y^{(i)}) = -\frac{1}{m} \lbrack\displaystyle\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)})+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \rbrackJ(θ)=1mi=1mCost(hθ(x(i)),y(i))J(θ)=\frac{1}{m} \displaystyle\sum_{i=1}^m Cost (h_θ (x^{(i)}), y^{(i)})

To fit parameter θ\theta: minθJ(θ)\begin{matrix} min \\ \theta \end{matrix} J(\theta)

To make a prediction given new xx:

Output hθ(x)=11+eθTxh_{\theta}(x) = \frac{1}{1 + e^{-\theta T} x}

Gradient Descent

J(θ)=1m[i=1my(i)loghθ(x(i))+(1y(i))log(1hθ(x(i)))]J(θ)=-\frac{1}{m} \lbrack\displaystyle\sum_{i=1}^m y^{(i)} \log h_{\theta}(x^{(i)})+(1-y^{(i)})\log(1-h_{\theta}(x^{(i)})) \rbrack

Want minθJ(θ)min_{\theta}J(\theta):

Repeat {
θj:=θjαθjJ(θ)θ_j := θ_j - \alpha \frac{∂}{∂θ_j} J(θ)
}
(simultanously update for every θj\theta_j)

αθjJ(θ)=1mi=1m(hθ(x(i))y(i))xj(i)\alpha \frac{∂}{∂θ_j} J(θ) = \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)}) - y^{(i)})x_j^{(i)}

so:

Repeat {
θj:=θjαi=1m(hθ(x(i))y(i))xj(i)θ_j := θ_j - \alpha \displaystyle\sum_{i=1}^m (h_θ (x^{(i)}) - y^{(i)})x_j^{(i)}
}
(simultanously update all θj\theta_j)

θ\theta = [θ0θ1θ2θn]hθ(x)=11+eθTx\begin{bmatrix} \theta_0 \\ \theta_1 \\ \theta_2 \\ \vdots \\ \theta_n \end{bmatrix} h_{\theta}(x)=\frac{1}{1 + e^{\theta^Tx}}

Algorithms look identical like linear regression but have a different hθ(x)h_{\theta}(x) definition.

Advanced Optimalisation

Optimalisation algotithms

Cost funtion J(θ)J(\theta). Want minθJ(θ)min_{\theta}J(\theta).

Given θ\theta, we already have a code that can compute:

  • J(θ)J(\theta)
  • θjJ(θ)\frac{∂}{∂θ_j} J(θ) (for j=0,1,...,nj=0,1,...,n)

Gradient descent

Repeat {
θj:=θjαθjJ(θ)θ_j := θ_j - \alpha \frac{∂}{∂θ_j} J(θ)
}

Other (than Gradient descent) optimazation algorithms:

  • Conjugate gradient
  • BFGS
  • L-BFGS

Advantages of using those algorithms:

  • No need to manually peek α\alpha

    • have inline search algorithm that automatically tries different values for the learning rate α\alpha, and automatically picks a good learning rate α\alpha
  • Often faster than gradient descent

Disadvantages of using those algorithms:

  • More complex and advanced

Example: θ=[θ1θ2]\theta = \begin{bmatrix} \theta_1 \\ \theta_2 \end{bmatrix}

minθJ(θ)    θ1=5;θ2=5\begin{matrix} min \\ \theta \end{matrix} J(\theta) \implies \theta_1=5; \theta_2=5

J(θ)=(θ15)2+(θ25)2J(\theta) = (\theta_1 - 5)^2 + (\theta_2 - 5)^2 (cost function)

θ1J(θ)=2(θ15)\frac{∂}{∂θ_1}J(\theta) = 2(\theta_1-5)

θ2J(θ)=2(θ25)\frac{∂}{∂θ_2}J(\theta) = 2(\theta_2-5)

Prove in Octave:

>> options = optimset('GradObj', 'On', 'MaxIter', 100);
>> initialTheta = zeros(2,1);
>> function [jVal, gradient] = costFunction(theta)

  jVal = (theta(1)-5)^2 + (theta(2)-5)^2;
  gradient = zeros(2,1);
  gradient(1) = 2*(theta(1)-5);
  gradient(2) = 2*(theta(2)-5);
endfunction
>> [optTheta, functionValm exitFlag] = fminunc(@costFunction, initialTheta, options)
optTheta =

   5.0000
   5.0000

functionValm =    1.5777e-30
exitFlag =  1
>>

fminunc - Advance Optimalization function in Octave

Multi-class classification: ONE-VS-ALL

Multiclass classification

Examples:

  • Email foldering/tagging: Work (y=1)(y=1), Friends (y=2)(y=2), Family (y=3)(y=3), Hobby (y=4)(y=4)
  • Medical diagrams: Not ill (y=1)(y=1), Cold (y=2)(y=2), Flu (y=3)(y=3)
  • Weather: Sunny (y=1)(y=1), Cloudy (y=2)(y=2), Rain (y=3)(y=3), Snow (y=4)(y=4)

Binary classification

BinaryClassification

Multi-class classification

MulticlassClassification

Question: How do we get a learning algorithm to work for the setting?

One-vs-all (one-vs-rest)

onevsall

Step 1: Createing a new training set to fit the classifier hθ(1)(x)h_{\theta}^{(1)}(x)

onevsall01

Step 2: Createing a new training set to fit the classifier hθ(2)(x)h_{\theta}^{(2)}(x)

onevsall02

Step 3: Createing a new training set to fit the classifier hθ(3)(x)h_{\theta}^{(3)}(x)

onevsall03

We fit three classifiers what is a probablity of one of the three classes:

hθ(i)(x)=P(y=ix;θ) h_{\theta}^{(i)}(x) = P(y=i|x;\theta)

One-vs-all

Train a logistoc regression classifier hθ(i)(x)h_{\theta}^{(i)}(x) for each class ii to predict the probability that y=iy=i On a new input xx, to make a prediction, pick the class ii ythat maximaizes: maxihθ(i)(x)\begin{matrix} max\\ i \end{matrix} h_{\theta}^{(i)}(x)