1 Predictive tasks vs Descriptive tasks

Predictive tasks: Predict the value of a particular attribute based on the values of other attributes
Descriptive tasks: Find human-interpretable patterns that describe data

2 Data

2.1 Variables: Characteristic of an object

Features, Attributes, Dimension, Field

2.2 Object: Collection of attributes describe an object

Entity, Instance, Event case, Record, Observation

3 Data Quality

Range: How narrow or wide of the scope of these data?
Relevancy: Is the data relevant to the problem?
Recency: How recent the data is generated?
Robustness: Signal to noise ratio
Reliability: How accurate?

4 Applications

Web mining: recommendation systems
Screening images: Early warning of ecological disasters
Marketing and sales
Diagnosis
Load forecasting
Decision involving judgement

Many more…

5 Machine Learning Algorithms

Supervised learning algorithms

Deals with labelled dataset
Unsupervised learning algorithm

Deals with labelled dataset

6 Loss function

Function that calculates loss for a single data point

7 Cost function

Calculates loss for the entire data sets

8 Prediction accuracy measures (cost functions)

8.1 Mean Error

Error can be both negative and positive. So they can cancel each other during the summation.

8.2 Mean Absolute Error (L1 loss)

8.3 Mean Squared Error (L2 loss)

8.4 Mean Percentage Error

8.5 Mean Absolute Percentage Error

8.6 Root Mean Squared Error

8.7 Visualizaion of error distribution

Graphical representations reveal more than metrics alone.

8.8 Accuracy Measures on Training Set vs Test Set

Accuracy measure on training set: Tells about the model fit

Accuracy measure on test set: Model ability to predict new data

8.9 Evaluate Classifier Against Benchmark

Naive approach: approach relies soley on

Outcome: Numeric

Naive Benchmark: Average ()

A good prediction model should outperform the benchmark criterion in terms of predictive accuracy.

8.10 Accuracy evaluation: Categorical

Confusion matrix/ Classification matrix

8.11 Performance in Case of Unequal Importance of Classes

Suppose the most important class is “Yes”

9 Classification and Regression Trees (CART)

Decision trees
Supervised learning method
Data driven method

9.1 Model

Goal: What is

9.2 How do we estimate ?

Data-driven methods:

estimate using observed data without making explicit assumptions about the functional form of .

Parametric methods:

estimate using observed data by making assumptions about the functional form of .

9.3 Classification and Regression Trees

Classification tree - Outcome is categorical
Regression tree - Outcome is numeric

9.4 Classification and Regression Trees

CART models work by partitioning the feature space into a number of simple rectangular regions, divided up by axis parallel splits.
The splits are logical rules that split feature-space into two non-overlapping subregions.

9.5 Example: Feature space

Features: Sepal Length, Sepal Width

Outcome: setosa/versicolor

## Extracted only two species for easy explanation
data <- iris[1:100,]
library(ggplot2)
library(viridis)
ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed()

9.6 Decision tree

# Load rpart and rpart.plot
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Species~Sepal.Length + Sepal.Width, data=data, cp=.02)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.7 Parts of a decision tree

Root node
Decision node
Terminal node/ Leaf node (gives outputs/class assignments)
Subtree

9.8

Image source: https://www.tutorialandexample.com/wp-content/uploads/2019/10/Decision-Trees-Root-Node.png

9.9 Decision tree

# Load rpart and rpart.plot
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Species~Sepal.Length + Sepal.Width, data=data, cp=.02)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.10 Root node split

ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed() + geom_vline(xintercept = 5.5)

9.11 Root node split, Decision node split - right

ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed() + geom_vline(xintercept = 5.5) + geom_hline(yintercept = 3)

9.12 Root node split, Decision node splits

ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed() + geom_vline(xintercept = 5.5) + geom_hline(yintercept = 3) + geom_hline(yintercept = 3.3)

9.13 Shallow decision tree

# Create a decision tree model
tree <- rpart(Species~Sepal.Length + Sepal.Width, data=data, cp=.5)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.14 Two key ideas underlying trees

Recursive partitioning (for constructing the tree)
Pruning (for cutting the tree back)
Pruning is a useful strategy for avoiding over fitting.
There are some alternative methods to avoid over fitting as well.

9.15 Constructing Classification Trees

Recursive Partitioning

Recursive partitioning splits P-dimensional feature space into nonoverlapping multidimensional rectangles.
The division is accomplished recursively (i.e. operating on the results of prior division)

9.16 Main questions

Splitting variable

Which attribute/ feature should be placed at the root node?

Which features will act as internal nodes?
Splitting point
Looking for a split that increases the homogeneity (or “pure” as possible) of the resulting subsets.

9.17 Example

split that increases the homogeneity

ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed()

9.18 Example (cont.)

split that increases the homogeneity .

ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed() + geom_vline(xintercept = 5.5)

9.19 Key idea

Iteratively split variables into groups
Evaluate “homogeneity” within each group
Split again if necessary

9.20 How does a decision tree determine the best split?

Decision tree uses entropy and information gain to select a feature which gives the best split.

9.21 Measures of Impurity

An impurity measure is a heuristic for selection of the splitting criterion that best separates a given feature space.
The two most popular measures
- Gini index
- Entropy measure

9.22 Gini index

Gini index for rectangle is defined by

- proportion of records in rectangle that belong to class

Gini index takes value 0 when all the records belong to the same class.

9.23 Gini index (cont)

In the two-class case Gini index is at peak when

9.24 Entropy measure

9.25 Example: Calculation (left)

df <- data.frame(x=rep(c(2, 4, 6, 8), each=4),
                 y=rep(c(2, 4, 6, 8), times=4), col=factor(c(rep("red", 15), "blue")))
ggplot(df, aes(x=x, y=y, col=col)) + geom_point(size=4)

9.26 Example: calculation (right) (cont.)

df <- data.frame(x=rep(c(2, 4, 6, 8), each=4),
                 y=rep(c(2, 4, 6, 8), times=4), col=factor(c(rep("red", 8), rep("blue", 8))))
ggplot(df, aes(x=x, y=y, col=col)) + geom_point(size=4)

9.27 Finding the best threshold split?

In-class demonstration

9.28 Overfitting in decision trees

Overfitting refers to the condition when the model completely fits the training data but fails to generalize the testing unseen data.
If a decision tree is fully grown or when you increase the depth of the decision tree, it may lose some generalization capability.
Pruning is a technique that is used to reduce overfitting. Pruning simplifies a decision tree by removing the weakest rules.

9.29 Stopping criteria

Tree depth (number of splits)
Minimum number of records in a terminal node
Minimum reduction in impurity
Complexity parameter ( ) - available in rpart package

9.30 Pre-pruning (early stopping)

Stop the learning algorithm before the tree becomes too complex
Hyperparameters of the decision tree algorithm that can be tuned to get a robust model

max_depth

min_samples_leaf

min_samples_split

9.31 Post pruning

Simplify the tree after the learning algorithm terminates

The idea here is to allow the decision tree to grow fully and observe the CP value

9.32 Simplify the tree after the learning algorithm terminates

Complexity of tree is measured by number of leaves.

The more leaf nodes you have, the more complexity.
We need a balance between complexity and predictive power

Total cost = measure of fit + measure of complexity

9.33 Total cost = measure of fit + measure of complexity

measure of fit: error

measure of complexity: number of leaf nodes ()

The parameter trade off between complexity and predictive power. The parameter is a penalty factor for tree size.

: Fully grown decision tree

: Root node only

between 0 and balance predictive power and complexity.

9.34 Example: candidate for pruning (in-class)

# Load rpart and rpart.plot
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Species~Sepal.Length + Sepal.Width, data=data, cp=.02)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.35 Classification trees - label of terminal node

labels are based on majority votes.

9.36 Regression Trees

# Load rpart and rpart.plot
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Petal.Length~Sepal.Length + Sepal.Width, data=data, cp=.02)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.37 Regression Trees

Value of the terminal node: average outcome value of the training records that were in that terminal node.

Your turn: Impurity measures for regression tree

9.38 Decision trees - advantages

Easy to interpret
Better performance in non-linear setting
No feature scaling required

9.39 Decision trees - disadvantages

Unstable: Adding a new data point or little bit of noise can lead to re-generation of the overall tree and all nodes need to be recalculated and recreated.
Not suitable for large datasets

9.40 Decision Tree

data <- iris[1:100,]
library(rpart)
library(rpart.plot)
# Create a decision tree model
tree <- rpart(Species~Sepal.Length + Sepal.Width, data=data, cp=.02)
# Visualize the decision tree with rpart.plot
rpart.plot(tree, box.palette="RdBu", shadow.col="gray", nn=TRUE)

9.41 Decision boundary

library(tidyverse)
ggplot(data, aes(x=Sepal.Length, y=Sepal.Width, col=Species)) + geom_point() + scale_color_manual(values = c("#1b9e77", "#d95f02")) + coord_fixed() + geom_vline(xintercept = 5.5) + geom_hline(yintercept = 3) + geom_hline(yintercept = 3.3)

9.42 Decision trees - Limitation

To capture a complex decision boundary we need to use a deep tree

In-class explanation

9.43 Bias-Variance Tradeoff

A deep decision tree has low bias and high variance.

9.44 Bagging (Bootstrap Aggregation)

Technique for reducing the variance of an estimated predicted function
Works well for high-variance, low-bias procedures, such as trees

9.45 Ensemble Methods

Combines several base models
Bagging (Bootstrap Aggregation) is an ensemble method

9.46 Ensemble Methods

“Ensemble learning gives credence to the idea of the “wisdom of crowds,” which suggests that the decision-making of a larger group of people is typically better than that of an individual expert.”

Source: https://www.ibm.com/cloud/learn/boosting

9.47 Bootstrap

Generate multiple samples of training data, via bootstrapping

Example

Training data:

Three samples generated from bootstrapping

Sample 1 =

Sample 2 =

Sample 3 =

9.48 Aggregation

Train a decision tree on each bootstrap sample of data without pruning.
Aggregate prediction using either voting or averaging

9.49 Bagging - in class diagram

9.50 Bagging

Pros

Ease of implementation
Reduction of variance

Cons

Loss of interpretability
Computationally expensive

9.51 Bagging

Bootstrapped subsamples are created
A Decision Tree is formed on each bootstrapped sample.
The results of each tree are aggregated

9.52 Random Forests: Improving on Bagging

The ensembles of trees in Bagging tend to be highly correlated.
All of the bagged trees will look quite similar to each other. Hence, the predictions from the bagged trees will be highly correlated.

9.53 Random Forests

Bootstrap samples
At each split, randomly select a set of predictors from the full set of predictors
From the selected predictors we select the optimal predictor and the optimal corresponding threshold for the split.
Grow multiple trees and aggregate

9.54 Random Forests - Hyper parameters

Number of variables randomly sampled as candidates at each split
Number of trees to grow
Minimum size of terminal nodes. Setting this number larger causes smaller trees to be grown (and thus take less time).

Note: In theory, each tree in the random forest is full (not pruned), but in practice this can be computationally expensive,thus, imposing a minimum node size is not unusual.

9.55 Random Forests

Bagging ensemble method
Gives final prediction by aggregating the predictions of bootstrapped decision tree samples.
Trees in a random forest are independent of each other.

9.56 Random Forests