Machine Learning Coursera’s Notes – Week 1

12 July 2020

ml

My notes from the Machine Learning Course provided by Andrew Ng on Coursera

What is Machine Learning?

"The field of study that gives computers the ability to learn without being explicitly programmed"
~Arthur Samuel (1959)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"
~Tom Mitchell (1998)

Supervised learning

  • We know how the correct output should look like
  • We know the relationship between the input and output
  • Two types of supervised learning:

    • Regression
    • Classification

Regression

  • Predict values based on labelled data
  • Continuous valued output

Example: House size -> Sell prize

Classification

  • Predict discrete valued label

Example: Email -> Spam or not

Binary or Binary Classification – group data in two types of label

Multi-class or Multinomial Classification – group data in more than two kinds of label

Unsupervised learning

  • Have only dataset given (without labelled data)
  • Algorithms finding structure in data
  • No feedback based on the prediction results
  • Two types of unsupervised learning:

    • Clustering
    • Non-clustering

Clustering

Example: Collection on 1 000 000 different genes and find in automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, etc.

Non-clustering

Example: Cocktail Party Algorithm – allows to find structure in a chaotic environment

Supervised Learning vs Unsupervised Learning

  • Supervised Learning will ALWAYS have an input-output pair
  • Unsupervised Learning will have ONLY data with no label and will try to find some structure

Model Representation

  • x(i)x^{(i)}- input variable (input feature)
  • y(i)y^{(i)} - output variable (target variable that we are trying to predict)
  • x(i),y(i)x^{(i)},y^{(i)} - training example
  • x(i),y(i));=1,,mx^{(i)},y^{(i)});ⅈ=1,…,m - training set
  • Function h:XYh:X→Y - hypothesis, the output of the learning algorithm, where h(x)h(x) is a good predictor for yy

trainingset

Regression -> if yy is continuous

Classification - if yy is a small set of discrete values

Regression Problem (Linear Regression)

Linear Regression with one variable xx and two parameters θθ

hθ(x)=θ0+θ1xh_θ (x)=θ_0+θ_1 x - linear function

linear regression

Regression problem predict a continuous-valued label

Cost Function

The occurrence of Hypothesis Function can be measured by using Cost Function

  • Hypothesis: hθ(x)=θ0+θ1xh_θ (x)=θ_0+θ_1 x
  • Parameters: θ0,θ1θ_0,θ_1
  • Cost function: J(θ0,θ1)=12mi=1m(hθ(xi)yi)2J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2
  • Goal: minimizeθ0,θ1J(θ0,θ1)\frac{minimize}{θ_0,θ_1} J(θ_0,θ_1)

Other names for this J(θ0,θ1)J(θ_0,θ_1) cost function:

  • Squared error function
  • Mean squared error

Gradient descent

Finds local minimum for cost function J(θ0,θ1)J(θ_0,θ_1)

Gradient descent algorithm

repeat until convergence {
θjθjαθ0J(θ0,θ1)θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
}

Parameter θθ should be simultaneously updated at each iteration

Example:

temp0θ0αθ0J(θ0,θ1)temp0 ≔ θ_0-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
temp1θ1αθ0J(θ0,θ1)temp1 ≔ θ_1-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
θ0θ_0 := temp0
θ1θ_1 := temp1

Gradient Descent Intuition

  • Finds local minimum
  • Coverages to a local minimum with fixed α because derivative becomes smaller (baby steps)

Learning Rate α

  • If α is too small, Gradient Descent is slow
  • If α is too large, Gradient Descent can overstep minimum and not coverage

small large alpha

Choosing a proper learning rate α in the beginning and stick to it at each iteration since Gradient Descent will automatically take smaller steps when approaching local minimum.

fix alpha

Gradient Descent for Linear Regression

Linear Regression with one variable

Gradient descent algorithm

repeat until convergence {
θjθjαθ0J(θ0,θ1)θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
(for j=1 and j=0)
}

Linear Regression Model

hθ(x)=θ0+θ1xh_θ (x)=θ_0+θ_1 x
J(θ0,θ1)=12mi=1m(hθ(xi)yi)2J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2

  • Calculate partial deviation θ0J(θ0,θ1)\frac{∂}{∂θ_0} J(θ_0,θ_1)
    θ0J(θ0,θ1)=12mi=1m(hθ(xi)yi)\frac{∂}{∂θ_0}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )
    θ1J(θ0,θ1)=12mi=1m((hθ(xi)yi)xi)\frac{∂}{∂θ_1}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m ((h_θ (x_i )-y_i)x_i)
  • So gradient descent becomes
    repeat until convergence {
    θ0:=θ0α1mi=1m(hθ(x(i))y(i))θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})
    θ1:=θ1α1mi=1m(hθ(x(i))y(i))x(i)θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x^{(i)}
    }
  • mm => size of the training set
  • θ0,θ1θ_0, θ_1 => constants that are changing simultaneously
  • xi,yix_i, y_i => values of the given training set

Gradient descent is called batch gradient if it needs to look at every example in the entire training set at each step

Example of gradient descent to minimalize cost function

Step 1

step01

Step 2

step02

Step 3

step03

Step 4

step04

Step 5

step05

Step 6

step06

Step 7

step07

Step 8

step09

Step 9 – a good time to predict (global minimum was found)

step09

Linear Algebra

  • Matrices are two-dimensional arrays

[abcdef]\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}

  • Dimension: rows x columns

    • 3x2 matrix => R3x2\R^{3x2}
  • Entry: AijA_{ij} -> ithi^{th} row, jthj^{th} column

Ex. A12A_{12} = b

  • Vector – matrix with one column and many rows

[abcd]\begin{bmatrix} a \\ b \\ c \\ d \end{bmatrix}

  • Dimension: n-dimensional vector

    • 3-dimensional vector -> R3\R^3

Convention

  • Uppercase letters – matrices (A, B, C, …)
  • Lowercase letters – vectors (a, b, c, …)

Scalar

Single value – not a matrix or vector

Matrix addition

Add each corresponding element:

[abcd]\begin{bmatrix} a & b \\ c & d \end{bmatrix} + [wxyz]\begin{bmatrix} w & x \\ y & z \end{bmatrix} = [a+wb+xc+yd+z]\begin{bmatrix} a + w & b + x \\ c + y & d + z \end{bmatrix}

Example:

[124532]\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} + [134113]\begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = [1+12+34+45+13+12+3]\begin{bmatrix} 1 + 1 & 2 + 3 & 4 + 4 \\ 5 + 1 & 3 + 1 & 2 + 3 \end{bmatrix} = [258641]\begin{bmatrix} 2 & 5 & 8 \\ 6 & 4 & 1 \end{bmatrix}

Undefined for matrices with different dimension

Matrix subtracting

Subtract each corresponding element

[abcd]\begin{bmatrix} a & b \\ c & d \end{bmatrix} - [wxyz]\begin{bmatrix} w & x \\ y & z \end{bmatrix} = [awbxcydz]\begin{bmatrix} a - w & b - x \\ c - y & d - z \end{bmatrix}

Example:

[124532]\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} - [134113]\begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = [112344513123]\begin{bmatrix} 1 - 1 & 2 - 3 & 4 - 4 \\ 5 - 1 & 3 - 1 & 2 - 3 \end{bmatrix} = [010421]\begin{bmatrix} 0 & -1 & 0 \\ 4 & 2 & -1 \end{bmatrix}

Undefined for matrices with different dimension

Scalar multiplication

Multiply each element by the scalar value

[abcd]\begin{bmatrix} a & b \\ c & d \end{bmatrix} * xx = [axbxcxdx]\begin{bmatrix} a * x & b * x \\ c * x & d * x \end{bmatrix}

Example:

[124532]\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} * 22 = [122242523222]\begin{bmatrix} 1 * 2 & 2 * 2 & 4 * 2 \\ 5 * 2 & 3 * 2 & 2 * 2 \end{bmatrix} = [2481064]\begin{bmatrix} 2 & 4 & 8 \\ 10 & 6 & 4 \end{bmatrix}

Scalar division

Divide each element by the scalar value

[abcd]\begin{bmatrix} a & b \\ c & d \end{bmatrix} / xx = [a/xb/xc/xd/x]\begin{bmatrix} a / x & b / x \\ c / x & d / x \end{bmatrix}

Example:

[124532]\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} / 22 = [1/22/24/25/23/22/2]\begin{bmatrix} 1 / 2 & 2 / 2 & 4 / 2 \\ 5 / 2 & 3 / 2 & 2 / 2 \end{bmatrix} = [0.5122.51.51]\begin{bmatrix} 0.5 & 1 & 2 \\ 2.5 & 1.5 & 1 \end{bmatrix}

Matrix-Vector Multiplication

A x X = Y
[]\begin{bmatrix} & & & \\ & & & \end{bmatrix} x []\begin{bmatrix} & \\ & \\ & \end{bmatrix} = []\begin{bmatrix} & \\ & \end{bmatrix}
mxnmxn nx1nx1 mdimensional vectorm-dimensional \ vector
(m rows, n columns) (n-dimensional vector)

To get yiy_i, multiply A’s ithi^{th} row with an element of vector xx, and add them up:

[abcdef]\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} * [xy]\begin{bmatrix} x \\ y \end{bmatrix} = [axbycxdyexfy]\begin{bmatrix} a * x & b * y \\ c * x & d * y \\ e * x & f * y \end{bmatrix}

Example:

[123456789]\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} * [321]\begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} = [132231435261738291]\begin{bmatrix} 1 * 3 & 2 * 2 & 3 * 1 \\ 4 * 3 & 5 * 2 & 6 * 1 \\ 7 * 3 & 8 * 2 & 9 * 1 \end{bmatrix} = [102846]\begin{bmatrix} 10 \\ 28 \\ 46 \end{bmatrix}

Matrix-Matrix Multiplication

A x X = Y
[]\begin{bmatrix} & & & \\ & & & \end{bmatrix} x []\begin{bmatrix} & & & \\ & & & \end{bmatrix} = []\begin{bmatrix} & & &\\ & & & \end{bmatrix}
mxnmxn matrix nxonxo matrix mxomxo matrix
(m rows, n columns) (n rows, o columns) (m rows, o columns)

The ithi^{th} column of the matrix CC is obtained by multiplying AA with the ith column of BB (for i=1,2,oi = 1, 2,… o)

[abcdef]\begin{bmatrix} a & b \\ c & d \\ e & f\end{bmatrix} * [wxyz]\begin{bmatrix} w & x \\ y & z \end{bmatrix} = [aw+byax+bzcw+dycx+dzew+fyex+fz]\begin{bmatrix} a * w + b * y & a * x + b * z \\ c * w + d * y & c * x + d * z \\ e * w + f * y & e * x + f * z \end{bmatrix}
3x2\textcolor{blue}{3x2} 2x2\textcolor{blue}{2x2} 3x2\textcolor{blue}{3x2}

Example:

[121343565]\begin{bmatrix} 1 & 2 & 1 \\ 3 & 4 & 3 \\ 5 & 6 & 5 \end{bmatrix} * [123213]\begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 1 & 3 \end{bmatrix} = [11+23+1112+22+1331+43+3132+42+3351+63+5152+62+53]\begin{bmatrix} 1 * 1 + 2 * 3 + 1 * 1 & 1 * 2 + 2 * 2 + 1 * 3 \\ 3 * 1 + 4 * 3 + 3 * 1 & 3 * 2 + 4 * 2 + 3 * 3 \\ 5 * 1 + 6 * 3 + 5 * 1 & 5 * 2 + 6 * 2 + 5 * 3 \end{bmatrix} = [8918232837]\begin{bmatrix} 8 & 9 \\ 18 & 23 \\ 28 & 37 \end{bmatrix}

Matrix Multiplication Properties

  • Matrices are not commutative: AB!=BCA * B != B * C
  • Matrices are associative: (AB)C=A(BC)(A * B) * C = A * (B * C)

Identity Matrix

Denoted II (or InxnI_{nxn}) = simply – has 1’s in the diagonal (upper left to lower right diagonal) and 0’s elsewhere

Example:

[1001]\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} [100010001]\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} [1000...0100...0010...0001...............1]\begin{bmatrix} 1 & 0 & 0 & 0 & ... \\ 0 & 1 & 0 & 0 & ... \\ 0 & 0 & 1 & 0 & ... \\ 0 & 0 & 0 & 1 & ... \\ ... & ... & ... & ... & 1 \\ \end{bmatrix}

For any matrix AA

AI=IA=AA * I = I * A = A

Matrix Inverse

If is an mxm matrix, and it has an inverse

A(A1)=A1A=IA(A^{-1}) = A^{-1} * A = I

Matrices that don’t have an inverse are:

  • Singular
  • Degenerate

Matrix Transpose

Let A be on mxnmxn matrix and let B=ATB = A^T

Then BB is nxm matrix and

Bij=AjiB_{ij} = A_{ji}

Example:

A=[120359]A = \begin{bmatrix} 1 & 2 & 0 \\ 3 & 5 & 9 \end{bmatrix}B=AT[132509]B = A^T \begin{bmatrix} 1 & 3 \\ 2 & 5 \\ 0 & 9 \end{bmatrix}

The first row of AA becomes first columns of ATA^T