Machine Learning Coursera’s Notes – Week 1 – C:\Users\Gosia Blog\

My notes from the Machine Learning Course provided by Andrew Ng on Coursera

What is Machine Learning?

"The field of study that gives computers the ability to learn without being explicitly programmed"

~Arthur Samuel (1959)

“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E"

~Tom Mitchell (1998)

Supervised learning

We know how the correct output should look like
We know the relationship between the input and output
Two types of supervised learning:
- Regression
- Classification

Regression

Predict values based on labelled data
Continuous valued output

Example: House size -> Sell prize

Classification

Predict discrete valued label

Example: Email -> Spam or not

Binary or Binary Classification – group data in two types of label

Multi-class or Multinomial Classification – group data in more than two kinds of label

Unsupervised learning

Have only dataset given (without labelled data)
Algorithms finding structure in data
No feedback based on the prediction results
Two types of unsupervised learning:
- Clustering
- Non-clustering

Clustering

Example: Collection on 1 000 000 different genes and find in automatically group these genes into groups that are somehow similar or related by different variables, such as lifespan, location, roles, etc.

Non-clustering

Example: Cocktail Party Algorithm – allows to find structure in a chaotic environment

Supervised Learning vs Unsupervised Learning

Supervised Learning will ALWAYS have an input-output pair
Unsupervised Learning will have ONLY data with no label and will try to find some structure

Model Representation

$x^{(i)}$ – input variable (input feature)
$y^{(i)}$ – output variable (target variable that we are trying to predict)
$x^{(i)},y^{(i)}$ – training example
$x^{(i)},y^{(i)});ⅈ=1,…,m$ – training set
Function $h:X\toY$ – hypothesis, the output of the learning algorithm, where $h(x)$ is a good predictor for $y$

Regression -> if $y$ is continuous

Classification – if $y$ is a small set of discrete values

Regression Problem (Linear Regression)

Linear Regression with one variable $x$ and two parameters $θ$

$h_θ (x)=θ_0+θ_1 x$ – linear function

Regression problem predict a continuous-valued label

Cost Function

The occurrence of Hypothesis Function can be measured by using Cost Function

Hypothesis: $h_θ (x)=θ_0+θ_1 x$
Parameters: $θ_0,θ_1$
Cost function: $J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2$
Goal: $\frac{minimize}{θ_0,θ_1} J(θ_0,θ_1)$

Other names for this $J(θ_0,θ_1)$ cost function:

Squared error function
Mean squared error

Gradient descent

Finds local minimum for cost function $J(θ_0,θ_1)$

Gradient descent algorithm

repeat until convergence {
$θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
}

Parameter $θ$ should be simultaneously updated at each iteration

Example:

$temp0 ≔ θ_0-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
$temp1 ≔ θ_1-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
$θ_0 := temp0$
$θ_1 := temp1$

Gradient Descent Intuition

Finds local minimum
Coverages to a local minimum with fixed α because derivative becomes smaller (baby steps)

Learning Rate α

If α is too small, Gradient Descent is slow
If α is too large, Gradient Descent can overstep minimum and not coverage

Choosing a proper learning rate α in the beginning and stick to it at each iteration since Gradient Descent will automatically take smaller steps when approaching local minimum.

Gradient Descent for Linear Regression

Linear Regression with one variable

Gradient descent algorithm

repeat until convergence {
$θ_j ≔ θ_j-α \frac{∂}{∂θ_0} J(θ_0,θ_1)$
(for j=1 and j=0)
}

Linear Regression Model

$h_θ (x)=θ_0+θ_1 x$
$J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )^2$

Calculate partial deviation $\frac{∂}{∂θ_0} J(θ_0,θ_1)$
$\frac{∂}{∂θ_0}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m (h_θ (x_i )-y_i )$
$\frac{∂}{∂θ_1}J(θ_0,θ_1 )=\frac{1}{2m} \displaystyle\sum_{i=1}^m ((h_θ (x_i )-y_i)x_i)$
So gradient descent becomes
repeat until convergence {
$θ_0 := θ_0 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})$
$θ_1 := θ_1 - \alpha \frac{1}{m} \displaystyle\sum_{i=1}^m (h_θ (x^{(i)} )-y^{(i)})x^{(i)}$
}
$m$ => size of the training set
$θ_0, θ_1$ => constants that are changing simultaneously
$x_i, y_i$ => values of the given training set

Gradient descent is called batch gradient if it needs to look at every example in the entire training set at each step

Example of gradient descent to minimalize cost function

Step 1

Step 2

Step 3

Step 4

Step 5

Step 6

Step 7

Step 8

Step 9 – a good time to predict (global minimum was found)

Linear Algebra

Matrices are two-dimensional arrays

$\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix}$

Dimension: rows x columns
- 3×2 matrix => $\R^{3x2}$
Entry: $A_{ij}$ -> $i^{th}$ row, $j^{th}$ column

Ex. $A_{12} = b$

Vector – matrix with one column and many rows

$\begin{bmatrix} a \\ b \\ c \\ d \end{bmatrix}$

Dimension: n-dimensional vector
- 3-dimensional vector -> $\R^3$

Convention

Uppercase letters – matrices (A, B, C, …)
Lowercase letters – vectors (a, b, c, …)

Scalar

Single value – not a matrix or vector

Matrix addition

Add each corresponding element:

$\begin{bmatrix} a & b \\ c & d \end{bmatrix} + \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a + w & b + x \\ c + y & d + z \end{bmatrix}$

Example:
$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} + \begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 + 1 & 2 + 3 & 4 + 4 \\ 5 + 1 & 3 + 1 & 2 + 3 \end{bmatrix} = \begin{bmatrix} 2 & 5 & 8 \\ 6 & 4 & 1 \end{bmatrix}$

Undefined for matrices with different dimension

Matrix subtracting

Subtract each corresponding element

$\begin{bmatrix} a & b \\ c & d \end{bmatrix} - \begin{bmatrix} w & x \\ y & z \end{bmatrix} = \begin{bmatrix} a - w & b - x \\ c - y & d - z \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} - \begin{bmatrix} 1 & 3 & 4 \\ 1 & 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 - 1 & 2 - 3 & 4 - 4 \\ 5 - 1 & 3 - 1 & 2 - 3 \end{bmatrix} = \begin{bmatrix} 0 & -1 & 0 \\ 4 & 2 & -1 \end{bmatrix}$

Undefined for matrices with different dimension

Scalar multiplication

Multiply each element by the scalar value

$\begin{bmatrix} a & b \\ c & d \end{bmatrix} * x = \begin{bmatrix} a * x & b * x \\ c * x & d * x \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} * 2 = \begin{bmatrix} 1 * 2 & 2 * 2 & 4 * 2 \\ 5 * 2 & 3 * 2 & 2 * 2 \end{bmatrix} = \begin{bmatrix} 2 & 4 & 8 \\ 10 & 6 & 4 \end{bmatrix}$

Scalar division

Divide each element by the scalar value

$\begin{bmatrix} a & b \\ c & d \end{bmatrix} / x = \begin{bmatrix} a / x & b / x \\ c / x & d / x \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 4 \\ 5 & 3 & 2 \end{bmatrix} / 2 = \begin{bmatrix} 1 / 2 & 2 / 2 & 4 / 2 \\ 5 / 2 & 3 / 2 & 2 / 2 \end{bmatrix} = \begin{bmatrix} 0.5 & 1 & 2 \\ 2.5 & 1.5 & 1 \end{bmatrix}$

Matrix-Vector Multiplication

A	x	X	=	Y
$\begin{bmatrix} & & & \\ & & & \end{bmatrix}$	x	$\begin{bmatrix} & \\ & \\ & \end{bmatrix}$	=	$\begin{bmatrix} & \\ & \end{bmatrix}$
$mxn$		$nx1$		$m-dimensional \ vector$
(m rows, n columns)		(n-dimensional vector)

To get $y_i$ , multiply A’s $i^{th}$ row with an element of vector $x$ , and add them up:

$\begin{bmatrix} a & b \\ c & d \\ e & f \end{bmatrix} * \begin{bmatrix} x \\ y \end{bmatrix} = \begin{bmatrix} a * x & b * y \\ c * x & d * y \\ e * x & f * y \end{bmatrix}$

Example:

$\begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} * \begin{bmatrix} 3 \\ 2 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 * 3 & 2 * 2 & 3 * 1 \\ 4 * 3 & 5 * 2 & 6 * 1 \\ 7 * 3 & 8 * 2 & 9 * 1 \end{bmatrix} = \begin{bmatrix} 10 \\ 28 \\ 46 \end{bmatrix}$

Matrix-Matrix Multiplication

A	x	X	=	Y
$\begin{bmatrix} & & & \\ & & & \end{bmatrix}$	x	$\begin{bmatrix} & & & \\ & & & \end{bmatrix}$	=	$\begin{bmatrix} & & &\\ & & & \end{bmatrix}$
$mxn$ matrix		$nxo$ matrix		$mxo$ matrix
(m rows, n columns)		(n rows, o columns)		(m rows, o columns)

The $i^{th}$ column of the matrix $C$ is obtained by multiplying $A$ with the ith column of $B$ (for $i = 1, 2,\dots o$ )

$\begin{bmatrix} a & b \\ c & d \\ e & f\end{bmatrix}$	$*$	$\begin{bmatrix} w & x \\ y & z \end{bmatrix}$	=	$\begin{bmatrix} a * w + b * y & a * x + b * z \\ c * w + d * y & c * x + d * z \\ e * w + f * y & e * x + f * z \end{bmatrix}$
$\textcolor{blue}{3x2}$		$\textcolor{blue}{2x2}$		$\textcolor{blue}{3x2}$

Example:

$\begin{bmatrix} 1 & 2 & 1 \\ 3 & 4 & 3 \\ 5 & 6 & 5 \end{bmatrix} * \begin{bmatrix} 1 & 2 \\ 3 & 2 \\ 1 & 3 \end{bmatrix} = \begin{bmatrix} 1 * 1 + 2 * 3 + 1 * 1 & 1 * 2 + 2 * 2 + 1 * 3 \\ 3 * 1 + 4 * 3 + 3 * 1 & 3 * 2 + 4 * 2 + 3 * 3 \\ 5 * 1 + 6 * 3 + 5 * 1 & 5 * 2 + 6 * 2 + 5 * 3 \end{bmatrix} = \begin{bmatrix} 8 & 9 \\ 18 & 23 \\ 28 & 37 \end{bmatrix}$

Matrix Multiplication Properties

Matrices are not commutative: $A * B != B * C$
Matrices are associative: $(A * B) * C = A * (B * C)$

Identity Matrix

Denoted $I$ (or $I_{nxn}$ ) = simply – has 1’s in the diagonal (upper left to lower right diagonal) and 0’s elsewhere

Example:

$\begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 0 & 0 & 0 & ... \\ 0 & 1 & 0 & 0 & ... \\ 0 & 0 & 1 & 0 & ... \\ 0 & 0 & 0 & 1 & ... \\ ... & ... & ... & ... & 1 \\ \end{bmatrix}$

For any matrix $A$

$A * I = I * A = A$

Matrix Inverse

If is an $mxm$ matrix, and it has an inverse

$A(A^{-1}) = A^{-1} * A = I$

Matrices that don’t have an inverse are:

Singular
Degenerate

Matrix Transpose

Let A be on $mxn$ matrix and let $B = A^T$

Then $B$ is $nxm$ matrix and

$B_{ij} = A_{ji}$

Example:

$A = \begin{bmatrix} 1 & 2 & 0 \\ 3 & 5 & 9 \end{bmatrix} B = A^T \begin{bmatrix} 1 & 3 \\ 2 & 5 \\ 0 & 9 \end{bmatrix}$

The first row of $A$ becomes first columns of $A^T$

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Machine Learning Coursera’s Notes – Week 1

Recent Posts

Categories

What is Machine Learning?