My notes from the Machine Learning Course provided by Andrew Ng on Coursera

## What is Machine Learning?

"The field of study that gives computers the ability to learn without being explicitly programmed"

~Arthur Samuel (1959)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"

~Tom Mitchell (1998)

## Supervised learning

• We know how the correct output should look like
• We know the relationship between the input and output
• Two types of supervised learning:
• Regression
• Classification

### Regression

• Predict values based on labelled data
• Continuous valued output

Example: House size -> Sell prize

### Classification

• Predict discrete valued label

Example: Email -> Spam or not

Binary or Binary Classification – group data in two types of label

Multi-class or Multinomial Classification – group data in more than two kinds of label

## Unsupervised learning

• Have only dataset given (without labelled data)
• Algorithms finding structure in data
• No feedback based on the prediction results
• Two types of unsupervised learning:
• Clustering
• Non-clustering

### Clustering

Example: Collection on 1 000 000 different genes and find in automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, etc.

### Non-clustering

Example: Cocktail Party Algorithm – allows to find structure in a chaotic environment

## Supervised Learning vs Unsupervised Learning

• Supervised Learning will ALWAYS have an input-output pair
• Unsupervised Learning will have ONLY data with no label and will try to find some structure

## Model Representation

• x^{(i)}– input variable (input feature)
• y^{(i)} – output variable (target variable that we are trying to predict)
• x^{(i)},y^{(i)} – training example
• x^{(i)},y^{(i)});ⅈ=1,…,m – training set
• Function h:X→Y – hypothesis, the output of the learning algorithm, where h(x) is a good predictor for y

## Regression Problem (Linear Regression)

Linear Regression with one variable x and two parameters θ

h_θ (x)=θ_0+θ_1 x – linear function

Regression problem predict a continuous-valued label

## Cost Function

The occurrence of Hypothesis Function can be measured by using Cost Function

• Hypothesis: h_θ (x)=θ_0+θ_1 x
• Parameters: θ_0,θ_1
• Cost function: J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2
• Goal: \frac{minimize}{θ_0,θ_1} J(θ_0,θ_1)

Other names for this J(θ_0,θ_1) cost function:

• Squared error function
• Mean squared error

Finds local minimum for cost function J(θ_0,θ_1)

repeat until convergence {
θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
}

Parameter θ should be simultaneously updated at each iteration

Example:

temp0 ≔ θ_0-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
temp1 ≔ θ_1-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
θ_0 := temp0
θ_1 := temp1

• Finds local minimum
• Coverages to a local minimum with fixed α because derivative becomes smaller (baby steps)

## Learning Rate α

• If α is too small, Gradient Descent is slow
• If α is too large, Gradient Descent can overstep minimum and not coverage

Choosing a proper learning rate α in the beginning and stick to it at each iteration since Gradient Descent will automatically take smaller steps when approaching local minimum.

## Gradient Descent for Linear Regression

Linear Regression with one variable

repeat until convergence {
θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)
(for j=1 and j=0)
}

### Linear Regression Model

h_θ (x)=θ_0+θ_1 x
J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2

• Calculate partial deviation $\frac{∂}{∂θ_0} J(θ_0,θ_1)$
\frac{∂}{∂θ_0}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )
\frac{∂}{∂θ_1}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m ((h_θ (x_i )-y_i)x_i)

repeat until convergence {
θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})
θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x^{(i)}
}

• m => size of the training set

• θ_0, θ_1 => constants that are changing simultaneously

• x_i, y_i => values of the given training set

Gradient descent is called batch gradient if it needs to look at every example in the entire training set at each step

### Example of gradient descent to minimalize cost function

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9 – a good time to predict (global minimum was found)

## Linear Algebra

• Matrices are two-dimensional arrays

\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}

• Dimension: rows x columns
• 3×2 matrix => \R^{3x2}
• Entry: A_{ij} -> i^{th} row, j^{th} column

Ex. A_{12} = b

• Vector – matrix with one column and many rows

\begin{bmatrix} a \\ b \\ c \\ d \end{bmatrix}

• Dimension: n-dimensional vector
• 3-dimensional vector -> \R^3

## Convention

• Uppercase letters – matrices (A, B, C, …)
• Lowercase letters – vectors (a, b, c, …)

## Scalar

Single value – not a matrix or vector

\begin{bmatrix} a & b \\ c & d \end{bmatrix} + \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a + w & b + x \\ c + y & d + z \end{bmatrix}

Example:
\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} + \begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 + 1 & 2 + 3 & 4 + 4 \\ 5 + 1 & 3 + 1 & 2 + 3 \end{bmatrix} = \begin{bmatrix} 2 & 5 & 8 \\ 6 & 4 & 1 \end{bmatrix}

Undefined for matrices with different dimension

## Matrix subtracting

Subtract each corresponding element

\begin{bmatrix} a & b \\ c & d \end{bmatrix} - \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a - w & b - x \\ c - y & d - z \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} - \begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 - 1 & 2 - 3 & 4 - 4 \\ 5 - 1 & 3 - 1 & 2 - 3 \end{bmatrix} = \begin{bmatrix} 0 & -1 & 0 \\ 4 & 2 & -1 \end{bmatrix}

Undefined for matrices with different dimension

## Scalar multiplication

Multiply each element by the scalar value

\begin{bmatrix} a & b \\ c & d \end{bmatrix} * x = \begin{bmatrix} a * x & b * x \\ c * x & d * x \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} * 2 = \begin{bmatrix} 1 * 2 & 2 * 2 & 4 * 2 \\ 5 * 2 & 3 * 2 & 2 * 2 \end{bmatrix} = \begin{bmatrix} 2 & 4 & 8 \\ 10 & 6 & 4 \end{bmatrix}

## Scalar division

Divide each element by the scalar value

\begin{bmatrix} a & b \\ c & d \end{bmatrix} / x = \begin{bmatrix} a / x & b / x \\ c / x & d / x \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} / 2 = \begin{bmatrix} 1 / 2 & 2 / 2 & 4 / 2 \\ 5 / 2 & 3 / 2 & 2 / 2 \end{bmatrix} = \begin{bmatrix} 0.5 & 1 & 2 \\ 2.5 & 1.5 & 1 \end{bmatrix}

## Matrix-Vector Multiplication

A x X = Y
\begin{bmatrix} & & & \\ & & & \end{bmatrix} x \begin{bmatrix} & \\ & \\ & \end{bmatrix} = \begin{bmatrix} & \\ & \end{bmatrix}
mxn nx1 m-dimensional \ vector
(m rows, n columns) (n-dimensional vector)

To get y_i, multiply A’s i^{th} row with an element of vector x, and add them up:

\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} * \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} a * x & b * y \\ c * x & d * y \\ e * x & f * y \end{bmatrix}

Example:

\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} * \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 * 3 & 2 * 2 & 3 * 1 \\ 4 * 3 & 5 * 2 & 6 * 1 \\ 7 * 3 & 8 * 2 & 9 * 1 \end{bmatrix} = \begin{bmatrix} 10 \\ 28 \\ 46 \end{bmatrix}

## Matrix-Matrix Multiplication

A x X = Y
\begin{bmatrix} & & & \\ & & & \end{bmatrix} x \begin{bmatrix} & & & \\ & & & \end{bmatrix} = \begin{bmatrix} & & &\\ & & & \end{bmatrix}
mxn matrix nxo matrix mxo matrix
(m rows, n columns) (n rows, o columns) (m rows, o columns)

The i^{th} column of the matrix C is obtained by multiplying A with the ith column of B (for i = 1, 2,… o)

\begin{bmatrix} a & b \\ c & d \\ e & f\end{bmatrix} * \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a * w + b * y & a * x + b * z \\ c * w + d * y & c * x + d * z \\ e * w + f * y & e * x + f * z \end{bmatrix}
\textcolor{blue}{3x2} \textcolor{blue}{2x2} \textcolor{blue}{3x2}

Example:

\begin{bmatrix} 1 & 2 & 1 \\ 3 & 4 & 3 \\ 5 & 6 & 5 \end{bmatrix} * \begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 * 1 + 2 * 3 + 1 * 1 & 1 * 2 + 2 * 2 + 1 * 3 \\ 3 * 1 + 4 * 3 + 3 * 1 & 3 * 2 + 4 * 2 + 3 * 3 \\ 5 * 1 + 6 * 3 + 5 * 1 & 5 * 2 + 6 * 2 + 5 * 3 \end{bmatrix} = \begin{bmatrix} 8 & 9 \\ 18 & 23 \\ 28 & 37 \end{bmatrix}

## Matrix Multiplication Properties

• Matrices are not commutative: A * B != B * C
• Matrices are associative: (A * B) * C = A * (B * C)

## Identity Matrix

Denoted I (or I_{nxn}) = simply – has 1’s in the diagonal (upper left to lower right diagonal) and 0’s elsewhere

Example:

\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & ... \\ 0 & 1 & 0 & 0 & ... \\ 0 & 0 & 1 & 0 & ... \\ 0 & 0 & 0 & 1 & ... \\ ... & ... & ... & ... & 1 \\ \end{bmatrix}

For any matrix A

A * I = I * A = A

## Matrix Inverse

If is an mxm matrix, and it has an inverse

A(A^{-1}) = A^{-1} * A = I

Matrices that don’t have an inverse are:

• Singular
• Degenerate

## Matrix Transpose

Let A be on mxn matrix and let B = A^T

Then B is nxm matrix and

B_{ij} = A_{ji}

Example:

A = \begin{bmatrix} 1 & 2 & 0 \\ 3 & 5 & 9 \end{bmatrix}   B = A^T \begin{bmatrix} 1 & 3 \\ 2 & 5 \\ 0 & 9 \end{bmatrix}

The first row of A becomes first columns of A^T