Site Loader

My notes from the Machine Learning Course provided by Andrew Ng on Coursera

What is Machine Learning?

"The field of study that gives computers the ability to learn without being explicitly programmed"

~Arthur Samuel (1959)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"

~Tom Mitchell (1998)

Supervised learning

  • We know how the correct output should look like
  • We know the relationship between the input and output
  • Two types of supervised learning:
    • Regression
    • Classification

Regression

  • Predict values based on labelled data
  • Continuous valued output

Example: House size -> Sell prize

Classification

  • Predict discrete valued label

Example: Email -> Spam or not

Binary or Binary Classification – group data in two types of label

Multi-class or Multinomial Classification – group data in more than two kinds of label

Unsupervised learning

  • Have only dataset given (without labelled data)
  • Algorithms finding structure in data
  • No feedback based on the prediction results
  • Two types of unsupervised learning:
    • Clustering
    • Non-clustering

Clustering

Example: Collection on 1 000 000 different genes and find in automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, etc.

Non-clustering

Example: Cocktail Party Algorithm – allows to find structure in a chaotic environment

Supervised Learning vs Unsupervised Learning

  • Supervised Learning will ALWAYS have an input-output pair
  • Unsupervised Learning will have ONLY data with no label and will try to find some structure

Model Representation

  • x^{(i)}– input variable (input feature)
  • y^{(i)} – output variable (target variable that we are trying to predict)
  • x^{(i)},y^{(i)} – training example
  • x^{(i)},y^{(i)});ⅈ=1,…,m – training set
  • Function h:X→Y – hypothesis, the output of the learning algorithm, where h(x) is a good predictor for y

Regression -> if y is continuous

Classification – if y is a small set of discrete values

Regression Problem (Linear Regression)

Linear Regression with one variable x and two parameters θ

h_θ (x)=θ_0+θ_1 x – linear function

Regression problem predict a continuous-valued label

Cost Function

The occurrence of Hypothesis Function can be measured by using Cost Function

  • Hypothesis: h_θ (x)=θ_0+θ_1 x
  • Parameters: θ_0,θ_1
  • Cost function: J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2
  • Goal: \frac{minimize}{θ_0,θ_1} J(θ_0,θ_1)

Other names for this J(θ_0,θ_1) cost function:

  • Squared error function
  • Mean squared error

Gradient descent

Finds local minimum for cost function J(θ_0,θ_1)

Gradient descent algorithm

repeat until convergence {
θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
}

Parameter θ should be simultaneously updated at each iteration

Example:

temp0 ≔ θ_0-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
temp1 ≔ θ_1-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
θ_0 := temp0
θ_1 := temp1

Gradient Descent Intuition

  • Finds local minimum
  • Coverages to a local minimum with fixed α because derivative becomes smaller (baby steps)

Learning Rate α

  • If α is too small, Gradient Descent is slow
  • If α is too large, Gradient Descent can overstep minimum and not coverage

Choosing a proper learning rate α in the beginning and stick to it at each iteration since Gradient Descent will automatically take smaller steps when approaching local minimum.

Gradient Descent for Linear Regression

Linear Regression with one variable

Gradient descent algorithm

repeat until convergence {
θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
(for j=1 and j=0)
}

Linear Regression Model

h_θ (x)=θ_0+θ_1 x
J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2

  • Calculate partial deviation $\frac{∂}{∂θ_0} J(θ_0,θ_1)$
    \frac{∂}{∂θ_0}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )
    \frac{∂}{∂θ_1}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m ((h_θ (x_i )-y_i)x_i)

  • So gradient descent becomes
    repeat until convergence {
    θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})
    θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x^{(i)}
    }

  • m => size of the training set

  • θ_0, θ_1 => constants that are changing simultaneously

  • x_i, y_i => values of the given training set

Gradient descent is called batch gradient if it needs to look at every example in the entire training set at each step

Example of gradient descent to minimalize cost function

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9 – a good time to predict (global minimum was found)

Linear Algebra

  • Matrices are two-dimensional arrays

\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}

  • Dimension: rows x columns
    • 3×2 matrix => \R^{3x2}
  • Entry: A_{ij} -> i^{th} row, j^{th} column

Ex. A_{12} = b

  • Vector – matrix with one column and many rows

\begin{bmatrix} a \\ b \\ c \\ d \end{bmatrix}

  • Dimension: n-dimensional vector
    • 3-dimensional vector -> \R^3

Convention

  • Uppercase letters – matrices (A, B, C, …)
  • Lowercase letters – vectors (a, b, c, …)

Scalar

Single value – not a matrix or vector

Matrix addition

Add each corresponding element:

\begin{bmatrix} a & b \\ c & d \end{bmatrix} + \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a + w & b + x \\ c + y & d + z \end{bmatrix}

Example:
\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} + \begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 + 1 & 2 + 3 & 4 + 4 \\ 5 + 1 & 3 + 1 & 2 + 3 \end{bmatrix} = \begin{bmatrix} 2 & 5 & 8 \\ 6 & 4 & 1 \end{bmatrix}

Undefined for matrices with different dimension

Matrix subtracting

Subtract each corresponding element

\begin{bmatrix} a & b \\ c & d \end{bmatrix} - \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a - w & b - x \\ c - y & d - z \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} - \begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 - 1 & 2 - 3 & 4 - 4 \\ 5 - 1 & 3 - 1 & 2 - 3 \end{bmatrix} = \begin{bmatrix} 0 & -1 & 0 \\ 4 & 2 & -1 \end{bmatrix}

Undefined for matrices with different dimension

Scalar multiplication

Multiply each element by the scalar value

\begin{bmatrix} a & b \\ c & d \end{bmatrix} * x = \begin{bmatrix} a * x & b * x \\ c * x & d * x \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} * 2 = \begin{bmatrix} 1 * 2 & 2 * 2 & 4 * 2 \\ 5 * 2 & 3 * 2 & 2 * 2 \end{bmatrix} = \begin{bmatrix} 2 & 4 & 8 \\ 10 & 6 & 4 \end{bmatrix}

Scalar division

Divide each element by the scalar value

\begin{bmatrix} a & b \\ c & d \end{bmatrix} / x = \begin{bmatrix} a / x & b / x \\ c / x & d / x \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} / 2 = \begin{bmatrix} 1 / 2 & 2 / 2 & 4 / 2 \\ 5 / 2 & 3 / 2 & 2 / 2 \end{bmatrix} = \begin{bmatrix} 0.5 & 1 & 2 \\ 2.5 & 1.5 & 1 \end{bmatrix}

Matrix-Vector Multiplication

A x X = Y
\begin{bmatrix} & & & \\ & & & \end{bmatrix} x \begin{bmatrix} & \\ & \\ & \end{bmatrix} = \begin{bmatrix} & \\ & \end{bmatrix}
mxn nx1 m-dimensional \ vector
(m rows, n columns) (n-dimensional vector)

To get y_i, multiply A’s i^{th} row with an element of vector x, and add them up:

\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} * \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} a * x & b * y \\ c * x & d * y \\ e * x & f * y \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} * \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 * 3 & 2 * 2 & 3 * 1 \\ 4 * 3 & 5 * 2 & 6 * 1 \\ 7 * 3 & 8 * 2 & 9 * 1 \end{bmatrix} = \begin{bmatrix} 10 \\ 28 \\ 46 \end{bmatrix}

Matrix-Matrix Multiplication

A x X = Y
\begin{bmatrix} & & & \\ & & & \end{bmatrix} x \begin{bmatrix} & & & \\ & & & \end{bmatrix} = \begin{bmatrix} & & &\\ & & & \end{bmatrix}
mxn matrix nxo matrix mxo matrix
(m rows, n columns) (n rows, o columns) (m rows, o columns)

The i^{th} column of the matrix C is obtained by multiplying A with the ith column of B (for i = 1, 2,… o)

\begin{bmatrix} a & b \\ c & d \\ e & f\end{bmatrix} * \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a * w + b * y & a * x + b * z \\ c * w + d * y & c * x + d * z \\ e * w + f * y & e * x + f * z \end{bmatrix}
\textcolor{blue}{3x2} \textcolor{blue}{2x2} \textcolor{blue}{3x2}

Example:

\begin{bmatrix} 1 & 2 & 1 \\ 3 & 4 & 3 \\ 5 & 6 & 5 \end{bmatrix} * \begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 * 1 + 2 * 3 + 1 * 1 & 1 * 2 + 2 * 2 + 1 * 3 \\ 3 * 1 + 4 * 3 + 3 * 1 & 3 * 2 + 4 * 2 + 3 * 3 \\ 5 * 1 + 6 * 3 + 5 * 1 & 5 * 2 + 6 * 2 + 5 * 3 \end{bmatrix} = \begin{bmatrix} 8 & 9 \\ 18 & 23 \\ 28 & 37 \end{bmatrix}

Matrix Multiplication Properties

  • Matrices are not commutative: A * B != B * C
  • Matrices are associative: (A * B) * C = A * (B * C)

Identity Matrix

Denoted I (or I_{nxn}) = simply – has 1’s in the diagonal (upper left to lower right diagonal) and 0’s elsewhere

Example:

\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & ... \\ 0 & 1 & 0 & 0 & ... \\ 0 & 0 & 1 & 0 & ... \\ 0 & 0 & 0 & 1 & ... \\ ... & ... & ... & ... & 1 \\ \end{bmatrix}

For any matrix A

A * I = I * A = A

Matrix Inverse

If is an mxm matrix, and it has an inverse

A(A^{-1}) = A^{-1} * A = I

Matrices that don’t have an inverse are:

  • Singular
  • Degenerate

Matrix Transpose

Let A be on mxn matrix and let B = A^T

Then B is nxm matrix and

B_{ij} = A_{ji}

Example:

A = \begin{bmatrix} 1 & 2 & 0 \\ 3 & 5 & 9 \end{bmatrix}   B = A^T \begin{bmatrix} 1 & 3 \\ 2 & 5 \\ 0 & 9 \end{bmatrix}

The first row of A becomes first columns of A^T

Post Author: gosia

Leave a Reply