Machine Learning Coursera’s Notes – Week 2

19 July 2020

ml

My notes from the Machine Learning Course provided by Andrew Ng on Coursera

Linear Regression with Multiple Variables

Linear Regression for a single feature

hθ(x)=θ0+θ1xh_θ(x)=θ_0+θ_1 x

Linear Regression for multiple variables

hθ(x)=θ0x0+θ1x1+θ2x2+...+θnxnh_θ(x)=θ_0x_0+θ_1x_1+θ_2x_2+...+θ_nx_n
where x0=1x_0=1

Parameters: θ={θ0,θ1,θ2,...,θn}Rn+1θ=\{θ_0,θ_1,θ_2,...,θ_n\} \in \R^{n+1}
Features: x={x0,x1,x2,...,xn}Rn+1x=\{x_0,x_1,x_2,...,x_n\} \in \R^{n+1}

Notation:

  • nn - number of features
  • x(i)x^{(i)} - input (features) of ithi^{th} training example
  • xj(i)x_j^{(i)} - values of feature jj in ithi^{th} training example

Example:

Predict house price:

Size (feet2feet^2) No. of bedrooms No. of floors Age of house (years) Price($1000)
x1\textcolor{#c91862}{x_1} x2\textcolor{#c91862}{x_2} x3\textcolor{#c91862}{x_3} x4\textcolor{#c91862}{x_4} hθ(x)=y\textcolor{#c91862}{h_θ(x)=y}
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178

n=4n=4

x(i)x^{(i)}:

x(2)=[14163240]x^{(2)}=\begin{bmatrix} 1416 \\ 3 \\ 2 \\ 40 \end{bmatrix} x(4)=[8522136]x^{(4)}=\begin{bmatrix} 852 \\ 2 \\ 1 \\ 36 \end{bmatrix}

Vector is RnR4\R^n \Rightarrow \R^4

xj(i)x_j^{(i)}:
x1(2)=1416;x2(2)=3;x3(2)=2;x4(2)=40;x_1^{(2)}=1416; x_2^{(2)}=3; x_3^{(2)}=2; x_4^{(2)}=40;
x1(4)=852;x2(4)=2;x3(4)=1;x4(4)=36;x_1^{(4)}=852; x_2^{(4)}=2; x_3^{(4)}=1; x_4^{(4)}=36;

so hypothesis function will use like below:

hθ(x)=θ0+2140θ1+5θ2+θ3+45θ4=460h_θ(x)=θ_0+2140 θ_1 + 5 θ_2 + θ_3 + 45 θ_4 = 460
hθ(x)=θ0+1416θ1+3θ2+2θ3+40θ4=232h_θ(x)=θ_0+1416 θ_1 + 3 θ_2 + 2 θ_3 + 40 θ_4 = 232
hθ(x)=θ0+1534θ1+3θ2+2θ3+30θ4=315h_θ(x)=θ_0+1534 θ_1 + 3 θ_2 + 2 θ_3 + 30 θ_4 = 315
hθ(x)=θ0+852θ1+2θ2+θ3+36θ4=178h_θ(x)=θ_0+852 θ_1 + 2 θ_2 + θ_3 + 36 θ_4 = 178

Cost Function

J(θ)=12mi=1m(hθ(x(i))y(i))2J(θ)=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)})-y^{(i)})^2
where
(hθ(x(i))y(i))2(h_θ (x^{(i)})-y^{(i)})^2 - Square Error of data ii
hθ(x(i))h_θ (x^{(i)}) - Predicted value
y(i)y^{(i)} - True value

The function is called Mean Error Square (MSE)

Gradient Descent

Repeat {
θj:=θjαθjJ(θ)θ_j := θ_j - \alpha \frac{∂}{∂θ_j} J(θ)
}
(simultanously uptae for every j=0,...,nj=0,...,n)

Pseudocode for gradient descent

  1. n=1n=1 (only one variable)

repeat until convergence {
θ0:=θ0α1mi=1m(hθ(x(i))y(i))θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})
θ1:=θ1α1mi=1m(hθ(x(i))y(i))x(i)θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x^{(i)}
} simonously update θ0,θ1θ_0, θ_1

  1. n1n\geqslant1

repeat until convergence {
θj:=θjα1mi=1m(hθ(x(i))y(i))xj(i)θ_j := θ_j - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x_j^{(i)}
} simonously update θjθ_j for j=0,...,nj = 0,...,n

Example for more than two features

  • θ0:=θ0α1mi=1m(hθ(x(i))y(i))x0(i)θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x_0^{(i)}
      and because x0(i)=1x_0^{(i)}=1 this pseudocode for θ0θ_0 parameter can be written:
    θ0:=θ0α1mi=1m(hθ(x(i))y(i))θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})
  • θ1:=θ1α1mi=1m(hθ(x(i))y(i))x1(i)θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x_1^{(i)}
  • θ2:=θ2α1mi=1m(hθ(x(i))y(i))x2(i)θ_2 := θ_2 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x_2^{(i)}

Gradient Descent in Practice

Idea: Make sure features are on a similar scale, so gradient descent coverages fast

Example:
x1=x_1 = size (02000feet20-2000 feet^2)
x2=x_2 = number of bedrooms (151-5)

Problem: The x1x_1 have a larger range of values than x2x_2, so J(θ)J(θ) can write very skewed elliptical shape and gradient descent on this cost function may end up with a long time of execution.

Feature Scaling

Feature scaling involves dividing the input values by the range (i.e. the maximum values minus the minimum value) of the input variable, resulting in a new range of just 1

x1=size(feet2)2000x_1 = \frac{size(feet^2)}{2000}

x2=no.ofbedrooms5x_2 = \frac{no. of bedrooms}{5}

new tange for this example:

0x110 \leqslant x_1 \leqslant 1
0x210 \leqslant x_2 \leqslant 1

The scale for the features should be approximately a 1xi1\textcolor{#c91862}{-1 \leqslant x_i \leqslant 1} range

Mean Normalization

Replace xix_i with x1μ1x_1 - \mu_1 to make features have approximetley zero mean (don't do for x0=1x_0=1)

Example:

x1=size10002000x_1 = \frac{size-1000}{2000} (where average=1000average = 1000 and max=2000max=2000)
x2=no.ofbedrooms25x_2 = \frac{no. of bedrooms - 2}{5} (where average=2average = 2 and max=5max = 5)

The new range is:

0.5x10.5-0.5 \leqslant x_1 \leqslant 0.5
0.5x20.5-0.5 \leqslant x_2 \leqslant 0.5

xi=xiμisix_i = \frac{x_i-\mu_i}{s_i} where μi=\mu_i =average of xx in training set and si=s_i = range (maxmin)(max - min) or standard derivation

Learning Rate α\alpha

"Debugging" - how to make sure gradient descent is working correctly.

Plot min value of cost function after some number of iterations.

It's difficult to tell how many iterations cost function need to coverage.

The function works correctly because J(θ)J(\theta) decrease after each iterration$

In below example, the function is flattened and (less or more) the function is coverage as a cost function learningrate coverages

If J(θ)J(\theta) increasing, probably it should use smaller α\alpha learningrate not coverages 01

learningrate not coverages 02

Automatic coverage test

Example: Declare coverage if J(θ)J(\theta) decreases by less then the small value (103)(10^{-3}) in one iteration.

Summary

  • for sufficiently small α\alpha, J(θ)J(\theta) should decrease on each iteration
  • if α\alpha is too small: slow coverage
  • if α\alpha is too large: J(θ)J(\theta) may not decrease on every iteration; may not coverage
  • find best α\alpha for speed + convergence by ploting different α\alpha values, e.g. 0.01,0.03,0.1,0.3...0.01, 0.03, 0.1, 0.3...

Features and Polynomial Regression

Understanding features can allow us to create a better model e.g. by combining them:

x1=widthx_1 = width
x2=depthx_2 = depth
hθ(x)=θ0+θ1x1+θ2x2h_\theta(x)=\theta_0 + \theta_1 x_1 + \theta_2 x_2

\rightarrow combine to area

area=widthdeptharea = width * depth
x=x1x2x = x_1 * x_2

so

hθ(x)=θ0+θ1xh_\theta(x) = \theta_0 + \theta_1 x (using only one feature)

Polynomial regression

Hypothesis function doesn't need to be a straight line.

It can be polynomial by choosing features

The hypothesis function can change into quadratic, cubic or square root function (or any other form)

e.g.

hypothesis function: hθ(x)=θ0+θ1x1h_\theta(x) = \theta_0 + \theta_1 x_1

  • create an additional feature based on x1x_1

hθ(x)=θ0+θ1x1+θ2x12h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 (quadratic function)
or
hθ(x)=θ0+θ1x1+θ2x12+θ3x13h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3 (cublic function)

The feature scaling is very important!

e.g. if x1x_1 has range 110001-1000 than range of x12x_1^2 becomes 11000001-100 000 and x13x_1^3 becomes 11000000001 - 100 000 000

Normal Equation

Normal equation - Methond to solve θ\theta analytically \rightarrow finding the optimal value in just one go

Intuition: If 1D(θR)1D (\theta \in \R)

Cost function: J(θ)=aθ2+bθ+cJ(\theta) = a\theta^2 + b\theta+c (quadratic function)

Solve for minJ(θ)(θ0,θ1,...,θn)minJ(\theta) \rArr (\theta_0, \theta_1, ..., \theta_n)

  • take the derivetives with respect to the θs\theta's and set them to 0
  • θRn+1\theta \in \R^{n+1} J(θ0,θ1,...,θm)=12mi=1m(hθ(x(i))y(i))2J(\theta_0,\theta_1,...,\theta_m) = \frac{1}{2m}\displaystyle\sum_{i=1}^m (h_\theta (x^{(i)})-y^{(i)})^2

θjJ(θ)=...=0\frac{\partial}{\partial \theta_j} J(\theta) = ... = 0 (for every jj)

Example: m=4m = 4 (House price prediction)

Size (feet2feet^2) No. of bedrooms No. of floors Age of house (years) Price($1000)
x1\textcolor{green}{x_1} x2\textcolor{green}{x_2} x3\textcolor{green}{x_3} x4\textcolor{green}{x_4} y\textcolor{green}{y}
2104 5 1 45 460
1416 3 2 40 232
1534 3 2 30 315
852 2 1 36 178
\Darr

Add extra column that corresponds to extra feature xx that always takes value of 11

Size (feet2feet^2) No. of bedrooms No. of floors Age of house (years) Price($1000)
x0\textcolor{#c91862}{x_0} x1\textcolor{green}{x_1} x2\textcolor{green}{x_2} x3\textcolor{green}{x_3} x4\textcolor{green}{x_4} y\textcolor{green}{y}
1\textcolor{#c91862}{1} 2104 5 1 45 460
1\textcolor{#c91862}{1} 1416 3 2 40 232
1\textcolor{#c91862}{1} 1534 3 2 30 315
1\textcolor{#c91862}{1} 852 2 1 36 178

so:

note a+b+c ​

Xfeatures=[12104514511416324011534323018522136]m x (n + 1)\underbrace{X}_{\text{features}} = \underbrace{\begin{bmatrix} 1 & 2104 & 5 & 1 & 45 \\ 1 & 1416 & 3 & 2 & 40 \\ 1 & 1534 & 3 & 2 & 30 \\ 1 & 852 & 2 & 1 & 36 \end{bmatrix}}_{\textcolor{#c91862}{\text{m x (n + 1)}}}

y=[460232315178]m - dimensional vectory = \underbrace{\begin{bmatrix} 460 \\ 232 \\ 315 \\ 178 \end{bmatrix}}_{\textcolor{#c91862}{\text{m - dimensional vector}}}

Than:

θ=(XTX)1XTy\theta = (X^TX)^{-1}X^Ty (normal equation)

In general case:

  • mm examples (x(1),y(1)),...,(x(m),y(m))(x^{(1)}, y^{(1)}), ... , (x^{(m)}, y^{(m)})

x(i)=[x0(i)x1(i)x2(i)xn(i)]vector of all features values in the exampleRn+1x^{(i)} = \underbrace{\begin{bmatrix} x_0^{(i)} \\ x_1^{(i)} \\ x_2^{(i)} \\ \vdots \\ x_n^{(i)} \end{bmatrix}}_{\textcolor{#c91862}{\text{vector of all features values in the example}}} \in \R^{n+1}

Design matrix:

X=[(x(1))T(x(2))T(x(m))T]m x (n + 1)X = \underbrace{\begin{bmatrix} --- & (x^{(1)})^T & --- \\ --- & (x^{(2)})^T & --- \\ \vdots \\ --- & (x^{(m)})^T & --- \\ \end{bmatrix}}_{\textcolor{#c91862}{\text{m x (n + 1)}}}

E.g.

If x(i)=[1x1(i)]x^{(i)} = \begin{bmatrix} 1 \\ x_1^{(i)} \end{bmatrix} X=[1x(1)1x2(1)1xm(1)]X = \begin{bmatrix} 1 & x^{(1)} \\ 1 & x_2^{(1)} \\ \vdots \\ 1 & x_m^{(1)} \end{bmatrix} y=[y(1)y(2)y(m)]\vec{y} = \begin{bmatrix} y^{(1)} \\ y^{(2)} \\ \vdots \\ y^{(m)} \end{bmatrix}

θ=(XTX)1XTy\theta = (X^TX)^{-1}X^Ty
where:
(XTX)1(X^TX)^{-1} - is inverse of matrix XTXX^TX

No need to do a feature scaling

Gradient Descent vs Normal Equation

for mm training examples and nn features

Gradient Descent Normal Equation
Need to choose α\alpha No need to choose α\alpha
Needs many iterations Don't need to iterate
O(kn2)\mathcal{O}(kn^2) Need to compute (XTX)1O(n3)(X^TX)^{-1} \Rightarrow \mathcal{O}(n^3)
Works well even when nn is large Slow if nn is very large

What if XTXX^TX is non-invertible?

Non-invertible XTXX^TX:

  • singular
  • degenerate

If XTXX^TX is non-invertible - there are usually two common causes for this:

  1. Redundant features (linear dependent)

E.g. x1=x_1 = size in feet2^2
x2=x_2 = size in m2^2

  1. Too many features (e.g. mnm \leqslant n)

delete features or use some regularization