# Machine Learning Coursera’s Notes – Week 1

12 July 2020

My notes from the Machine Learning Course provided by Andrew Ng on Coursera

## What is Machine Learning?

"The field of study that gives computers the ability to learn without being explicitly programmed"
~Arthur Samuel (1959)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"
~Tom Mitchell (1998)

## Supervised learning

• We know how the correct output should look like
• We know the relationship between the input and output
• Two types of supervised learning:

• Regression
• Classification

### Regression

• Predict values based on labelled data
• Continuous valued output

Example: House size -> Sell prize

### Classification

• Predict discrete valued label

Example: Email -> Spam or not

Binary or Binary Classification – group data in two types of label

Multi-class or Multinomial Classification – group data in more than two kinds of label

## Unsupervised learning

• Have only dataset given (without labelled data)
• Algorithms finding structure in data
• No feedback based on the prediction results
• Two types of unsupervised learning:

• Clustering
• Non-clustering

### Clustering

Example: Collection on 1 000 000 different genes and find in automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, etc.

### Non-clustering

Example: Cocktail Party Algorithm – allows to find structure in a chaotic environment

## Supervised Learning vs Unsupervised Learning

• Supervised Learning will ALWAYS have an input-output pair
• Unsupervised Learning will have ONLY data with no label and will try to find some structure

## Model Representation

• $x^{(i)}$- input variable (input feature)
• $y^{(i)}$ - output variable (target variable that we are trying to predict)
• $x^{(i)},y^{(i)}$ - training example
• $x^{(i)},y^{(i)});ⅈ=1,…,m$ - training set
• Function $h:X→Y$ - hypothesis, the output of the learning algorithm, where $h(x)$ is a good predictor for $y$

## Regression Problem (Linear Regression)

Linear Regression with one variable $x$ and two parameters $θ$

$h_θ (x)=θ_0+θ_1 x$ - linear function

Regression problem predict a continuous-valued label

## Cost Function

The occurrence of Hypothesis Function can be measured by using Cost Function

• Hypothesis: $h_θ (x)=θ_0+θ_1 x$
• Parameters: $θ_0,θ_1$
• Cost function: $J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2$
• Goal: $\frac{minimize}{θ_0,θ_1} J(θ_0,θ_1)$

Other names for this $J(θ_0,θ_1)$ cost function:

• Squared error function
• Mean squared error

Finds local minimum for cost function $J(θ_0,θ_1)$

repeat until convergence {
$θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
}

Parameter $θ$ should be simultaneously updated at each iteration

Example:

$temp0 ≔ θ_0-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
$temp1 ≔ θ_1-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
$θ_0$ := temp0
$θ_1$ := temp1

• Finds local minimum
• Coverages to a local minimum with fixed α because derivative becomes smaller (baby steps)

## Learning Rate α

• If α is too small, Gradient Descent is slow
• If α is too large, Gradient Descent can overstep minimum and not coverage

Choosing a proper learning rate α in the beginning and stick to it at each iteration since Gradient Descent will automatically take smaller steps when approaching local minimum.

## Gradient Descent for Linear Regression

Linear Regression with one variable

repeat until convergence {
$θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
(for j=1 and j=0)
}

### Linear Regression Model

$h_θ (x)=θ_0+θ_1 x$
$J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2$

• Calculate partial deviation $\frac{∂}{∂θ_0} J(θ_0,θ_1)$
$\frac{∂}{∂θ_0}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )$
$\frac{∂}{∂θ_1}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m ((h_θ (x_i )-y_i)x_i)$
repeat until convergence {
$θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})$
$θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x^{(i)}$
}
• $m$ => size of the training set
• $θ_0, θ_1$ => constants that are changing simultaneously
• $x_i, y_i$ => values of the given training set

Gradient descent is called batch gradient if it needs to look at every example in the entire training set at each step

### Example of gradient descent to minimalize cost function

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9 – a good time to predict (global minimum was found)

## Linear Algebra

• Matrices are two-dimensional arrays

$\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}$

• Dimension: rows x columns

• 3x2 matrix => $\R^{3x2}$
• Entry: $A_{ij}$ -> $i^{th}$ row, $j^{th}$ column

Ex. $A_{12}$ = b

• Vector – matrix with one column and many rows

$\begin{bmatrix} a \\ b \\ c \\ d \end{bmatrix}$

• Dimension: n-dimensional vector

• 3-dimensional vector -> $\R^3$

## Convention

• Uppercase letters – matrices (A, B, C, …)
• Lowercase letters – vectors (a, b, c, …)

## Scalar

Single value – not a matrix or vector

$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ + $\begin{bmatrix} w & x \\ y & z \end{bmatrix}$ = $\begin{bmatrix} a + w & b + x \\ c + y & d + z \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix}$ + $\begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix}$ = $\begin{bmatrix} 1 + 1 & 2 + 3 & 4 + 4 \\ 5 + 1 & 3 + 1 & 2 + 3 \end{bmatrix}$ = $\begin{bmatrix} 2 & 5 & 8 \\ 6 & 4 & 1 \end{bmatrix}$

Undefined for matrices with different dimension

## Matrix subtracting

Subtract each corresponding element

$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ - $\begin{bmatrix} w & x \\ y & z \end{bmatrix}$ = $\begin{bmatrix} a - w & b - x \\ c - y & d - z \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix}$ - $\begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix}$ = $\begin{bmatrix} 1 - 1 & 2 - 3 & 4 - 4 \\ 5 - 1 & 3 - 1 & 2 - 3 \end{bmatrix}$ = $\begin{bmatrix} 0 & -1 & 0 \\ 4 & 2 & -1 \end{bmatrix}$

Undefined for matrices with different dimension

## Scalar multiplication

Multiply each element by the scalar value

$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ * $x$ = $\begin{bmatrix} a * x & b * x \\ c * x & d * x \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix}$ * $2$ = $\begin{bmatrix} 1 * 2 & 2 * 2 & 4 * 2 \\ 5 * 2 & 3 * 2 & 2 * 2 \end{bmatrix}$ = $\begin{bmatrix} 2 & 4 & 8 \\ 10 & 6 & 4 \end{bmatrix}$

## Scalar division

Divide each element by the scalar value

$\begin{bmatrix} a & b \\ c & d \end{bmatrix}$ / $x$ = $\begin{bmatrix} a / x & b / x \\ c / x & d / x \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix}$ / $2$ = $\begin{bmatrix} 1 / 2 & 2 / 2 & 4 / 2 \\ 5 / 2 & 3 / 2 & 2 / 2 \end{bmatrix}$ = $\begin{bmatrix} 0.5 & 1 & 2 \\ 2.5 & 1.5 & 1 \end{bmatrix}$

## Matrix-Vector Multiplication

A x X = Y
$\begin{bmatrix} & & & \\ & & & \end{bmatrix}$ x $\begin{bmatrix} & \\ & \\ & \end{bmatrix}$ = $\begin{bmatrix} & \\ & \end{bmatrix}$
$mxn$ $nx1$ $m-dimensional \ vector$
(m rows, n columns) (n-dimensional vector)

To get $y_i$, multiply A’s $i^{th}$ row with an element of vector $x$, and add them up:

$\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}$ $*$ $\begin{bmatrix} x \\ y \end{bmatrix}$ = $\begin{bmatrix} a * x & b * y \\ c * x & d * y \\ e * x & f * y \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix}$ $*$ $\begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix}$ = $\begin{bmatrix} 1 * 3 & 2 * 2 & 3 * 1 \\ 4 * 3 & 5 * 2 & 6 * 1 \\ 7 * 3 & 8 * 2 & 9 * 1 \end{bmatrix}$ = $\begin{bmatrix} 10 \\ 28 \\ 46 \end{bmatrix}$

## Matrix-Matrix Multiplication

A x X = Y
$\begin{bmatrix} & & & \\ & & & \end{bmatrix}$ x $\begin{bmatrix} & & & \\ & & & \end{bmatrix}$ = $\begin{bmatrix} & & &\\ & & & \end{bmatrix}$
$mxn$ matrix $nxo$ matrix $mxo$ matrix
(m rows, n columns) (n rows, o columns) (m rows, o columns)

The $i^{th}$ column of the matrix $C$ is obtained by multiplying $A$ with the ith column of $B$ (for $i = 1, 2,… o$)

$\begin{bmatrix} a & b \\ c & d \\ e & f\end{bmatrix}$ $*$ $\begin{bmatrix} w & x \\ y & z \end{bmatrix}$ = $\begin{bmatrix} a * w + b * y & a * x + b * z \\ c * w + d * y & c * x + d * z \\ e * w + f * y & e * x + f * z \end{bmatrix}$
$\textcolor{blue}{3x2}$ $\textcolor{blue}{2x2}$ $\textcolor{blue}{3x2}$

Example:

$\begin{bmatrix} 1 & 2 & 1 \\ 3 & 4 & 3 \\ 5 & 6 & 5 \end{bmatrix}$ $*$ $\begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 1 & 3 \end{bmatrix}$ = $\begin{bmatrix} 1 * 1 + 2 * 3 + 1 * 1 & 1 * 2 + 2 * 2 + 1 * 3 \\ 3 * 1 + 4 * 3 + 3 * 1 & 3 * 2 + 4 * 2 + 3 * 3 \\ 5 * 1 + 6 * 3 + 5 * 1 & 5 * 2 + 6 * 2 + 5 * 3 \end{bmatrix}$ = $\begin{bmatrix} 8 & 9 \\ 18 & 23 \\ 28 & 37 \end{bmatrix}$

## Matrix Multiplication Properties

• Matrices are not commutative: $A * B != B * C$
• Matrices are associative: $(A * B) * C = A * (B * C)$

## Identity Matrix

Denoted $I$ (or $I_{nxn}$) = simply – has 1’s in the diagonal (upper left to lower right diagonal) and 0’s elsewhere

Example:

$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}$ $\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}$ $\begin{bmatrix} 1 & 0 & 0 & 0 & ... \\ 0 & 1 & 0 & 0 & ... \\ 0 & 0 & 1 & 0 & ... \\ 0 & 0 & 0 & 1 & ... \\ ... & ... & ... & ... & 1 \\ \end{bmatrix}$

For any matrix $A$

$A * I = I * A = A$

## Matrix Inverse

If is an mxm matrix, and it has an inverse

$A(A^{-1}) = A^{-1} * A = I$

Matrices that don’t have an inverse are:

• Singular
• Degenerate

## Matrix Transpose

Let A be on $mxn$ matrix and let $B = A^T$

Then $B$ is nxm matrix and

$B_{ij} = A_{ji}$

Example:

$A = \begin{bmatrix} 1 & 2 & 0 \\ 3 & 5 & 9 \end{bmatrix}$$B = A^T \begin{bmatrix} 1 & 3 \\ 2 & 5 \\ 0 & 9 \end{bmatrix}$

The first row of $A$ becomes first columns of $A^T$