Python Conditional Statements

Introduction

A decision tree is a supervised learning algorithm that can be applied to both classification and regression tasks. In classification it predicts categorical outcomes, while in regression it predicts continuous values.
In machine learning, data is split into a training dataset and a test dataset. A decision tree is a model built by analyzing the training dataset. This model can be used to make predictions by applying test data.

Structure of Decision Tree

A decision tree has a tree-like structure with various types of nodes:
Root Node: A tree starts from the topmost node which is called root node.
Internal node or decision Node: When a node splits into sub-nodes, then it is called the decision node, which represents a decision based on a feature. Root node and internal nodes are denoted by an rectangle shape.
Leaf / Terminal Node: Nodes that do not split and terminate a branch is called Leaf node or Terminal node, which represents a final outcome (class label or a continuous value).It is denoted by oval shape.

Building a decision tree

A decision tree can be built systematically using two algorithms:
1.Split algorithm based on information Gain: used to build ID3 (iterative Iterative Dichotomiser 3) for classification task.
2.Split algorithm based on Gini Index: used to build CART (Classification and Regression Trees.) for both classification and regression tasks.

Split algorithm based on Information gain

ID3 is an algorithm developed by Ross Quinlan in 1986, used to generate a decision tree from a dataset. The algorithm repeatedly splitting the dataset into smaller subsets or dichotomizing ,until it reaches the most informative and pure classification.
It selects the best feature at each step by calculating information gain and represent it as a node.
Information gain: is a metric used to measure the importance of a feature and helps in choosing the best one for splitting. Information Gain is higher for the pure nodes with a maximum value of 1.

Entropy: is a measure of uncertainty or impurity in a dataset. Entropy is high when data is spread across different classes and low, or zero, when all data belongs to the same class. Entropy is used to calculate information gain during tree construction.

Pure Node: A node is pure if all the data points in the node belong to the same class. For example, if all instances are labeled as “Yes” or all are labeled as “No,” the node is pure.
Impurity of Node: A node is impure when it contains instances of multiple classes or mixed values, indicating uncertainty in classification.For example, some instances are labeled as “Yes” and remaining are labeled as “No” .

How to build a Decision Tree

Calculate the entropy of the target variable.
Calculate the entropy for each feature: compute the entropy for the subsets formed by splitting the dataset based on those feature values.
Calculate the Information Gain.
Select the best feature: The feature with the highest information gain is selected as the splitting criterion , which is selected as the root node or internal node.
Repeat the process 2–4 recursively until a stopping condition is met (e.g., when the node is pure or when a predefined tree depth is reached).
Create leaf nodes: Once the tree reaches a pure state or a stopping condition, the leaf node is assigned a class label.

Example

In this example, the entire dataset is used for training. There are two features, “Have feathers?” and “Can fly?” along with the target variable “Bird?”. Predict whether the species is a Bird or not .

Have Feathers?	Can Fly?	Bird?
Yes	Yes	Yes
Yes	Yes	Yes
Yes	No	No
No	Yes	No
No	Yes	No

Solution

Step 1: Calculate Entropy of the target Variable ("Bird?")

There are 5 instances with two classes, in the target variable(Bird): Yes = 2, No = 3

P(Yes) =2/5 =0.4
P(No) =3/5 =0.6

Have Feathers?	Can Fly?	Bird?
Yes	Yes	Yes
Yes	Yes	Yes
Yes	No	No
No	Yes	No
No	Yes	No

Entropy(Bird) = -(2/5 log₂(2/5)) + (3/5 log₂(3/5))
= -(0.4 log₂(0.4)) + (0.6 log₂(0.6))
= -(0.4 x (-1.3219)+(0.6 x (-0.73697))
= -(-0.52876 – 0.44218)
= 0.971

Note: To calculate log₂ in scientific calculator

log₂(0.4)= log (0.4) / log(2)

Step 2: Calculate Entropy of the feature

(a)Calculate Entropy of the feature("Have feathers?") :

There are 5 instances with two classes, in the feature “Have Feathers?” :
Yes = 3, No = 2
(i)Yes= 3 instances( 2 Yes ,1 No in target variable)

Have Feathers?	Bird?
Yes	Yes
Yes	Yes
Yes	No

Entropy(feathers=yes) = -(2/3 log₂(2/3)) + (1/3 log₂(1/3))
= -(0.667 log₂(0.667)) + (0.333 log₂(0.333))
= -(0.667 x (-0.585) + (0.333 x (-1.585))
= -(-0.390195 – 0.527805)
= 0.918

(ii)No= 2 instances( 0 Yes ,2 No in target variable)

Have Feathers?	Bird?
No	No
No	No

Entropy(Have feathers=No) = -(0/2 log₂(0/2)) + (2/2 log₂(2/2))
= – 0 + (1 log₂(1))
= – 0 + 0
= 0(Pure Node)

Therefore, Entropy(Have feathers) = (No. of Yes / Total Instance) x Entropy(HF=Yes) + (No. of No / Total Instance) x Entropy(HF=No)
= (3/5) x 0.918 + (2/5) x 0
= 0.6 x 0.918 + 0
= 0.5508

(b)Calculate Entropy of the feature("Can fly") :

There are 5 instances with two classes, in the feature “Can fly?”:
Yes = 4, No = 1
(i)Yes= 4 instances( 2 Yes ,2 No in target variable)

Can Fly?	Bird?
Yes	Yes
Yes	Yes
Yes	No
Yes	No

Entropy(Can fly?=yes) = -(2/4 log₂(2/4)) + (2/4 log₂(2/4))
= -(0.5 log₂(0.5)) + (0.5 log₂(0.5))
= -(0.5 x (-1) + (0.5 x (-1))
= -(-0.5 – 0.5)
= 1

(ii)No= 1 instance( 0 Yes ,1 No in target variable)

Can Fly?	Bird?
No	No

Entropy(Can fly=No) = -(0/1 log₂(0/1)) + (1/1 log₂(1/1))
= – 0 + (1 log₂(1))
= – 0 + 0
= 0(Pure Node)

Therefore, Entropy(Can fly) = (No. of Yes / Total Instance) x Entropy(CF=Yes) + (No. of No / Total Instance) x Entropy(CF=No)
= (4/5) x 1 + (1/5) x 0
= 0.8 + 0
= 0.8

Step 3: Calculate the Information Gain of the features

Information Gain = Entropy (target variable) – Entropy (Feature)
= Entropy (Bird) – Entropy (Have Feathers | Can Fly)
In this example, Entropy (Bird) = 0.971. Therefore, Information gain of the features are:

Features	Entropy	Information Gain
Have Feathers?	0.5508	0.971 – 0.5508 = 0.4202
Can Fly?	0.8	0.971 – 0.8 = 0.171

Step 4: Select the best feature or best split

The feature “Have Feathers?” has the highest information gain (0.4202); hence, it is selected as the first split or root node of the decision tree.

Step 5: Create the Decision Tree

Root Node: Start with the best feature “Have Feathers?”
(i)Right Branch (No): This is a pure subset (all instances belong to the “No” class), so it is terminated with the label “Not Bird.”
(ii)Left Branch (Yes): This subset is not pure, so it requires further splitting based on the feature “Can Fly?”
(a)Right Branch (No): This is a pure subset (all instances belong to the “N” class), so it is terminated with the label “Not Bird.”
(b)Left Branch (Yes): Although this subset is impure, there are no additional features available for further splits, so the branch is terminated with the label “Bird”(Yes).

Step 6: Generate the rules from the model, to make prediction

If “Have Feathers?” = No
Prediction: Not Bird
else
If “Have Feathers?” = Yes and “Can Fly?” = No
Prediction: Not Bird
else if “Have Feathers?” = Yes and “Can Fly?” = Yes
Prediction: Bird

Test the Model

Advantages

1. Easy to understand.
2. Simple to implement.
3. KNN doesn’t build any models using training data before classification.
4. Handle noisy training data.
5. It handles both categorical data and numerical data.

Disadvantages

1. It is expensive when working with a large data set.
2. A lot of memory is required for processing large data sets.
3. Choosing the right value of K can be tricky.
4. KNN is called a lazy learner because it starts computation after new test data arrives.

Applications

The KNN algorithm is versatile and widely used across various domains. Some of the applications include:
1. Text Classification and Image Classification
2. Handwriting Recognition and Gesture Recognition
3. Recommending products, movies, etc., to users
4. Classifying patient data to diagnose diseases
5. Categorizing genetic sequences and proteins
6. Detecting fraudulent transactions through pattern analysis
7. Weather Forecasting
8. Stock Market Predictions
9. Classifying air quality from sensor data
10. Detecting unusual network activity and security breaches

Python Implementation from scratch

This practical demonstration highlights the effectiveness of the K-Nearest Neighbors (KNN) algorithm in making predictions. We implement the algorithm using the above example in Python from scratch to categorize students as ‘pass’ or ‘fail.’

Output

The new data point (5, 6) is classified as : Pass

Test the classifier

Testing a classifier involves evaluating the performance of an KNN algorithm by applying it to new, unseen data. The basic metrics to measure the performance of a classifier are:
(i) Accuracy
(ii) Error Rate

(i) Accuracy:
Accuracy represents the proportion of correctly classified test data out of the total test data. It is calculated as:

(ii) Error Rate:
The error rate (or misclassification rate), also known as the misclassified rate, represents the proportion of test data that was incorrectly classified. It is calculated as:

Steps to test the classifier

1. Prepare Datasets: Define the training dataset and test dataset.
2. Implement the k-NN function: Create a function to predict labels for the test data based on the nearest neighbors from the training data.
3. Test Classifier: Compare the prediction data with actual data (labels).
4. Calculate Metrics: If both are unequal, then count the incorrect predictions and compute the error rate.
5. Print Results: Display the error rate to evaluate the performance of a classifier.

Source code

Output

Total number of tests : 3
Number of errors : 1
Error rate for KNN with k=3: 0.33

Frequently Asked Questions (FAQs)

How is the 'k' parameter important in K-Nearest Neighbors algorithm?

The 'k' parameter represents the number of neighbors around the test data. A larger 'k' may lead to underfitting, while a smaller 'k' might cause overfitting.

How does KNN handle multi-class classification?

If there are more than two outcome labels or categories, then algorithm handles those multi-class classifications by counting each category of K-nearest neighbors around the test data and assigning the new test data to the class that has the majority count.

What is the time complexity of K-Nearest Neighbors algorithm during the prediction phase?

The time complexity of the algorithm during the prediction phase is O(n * d), where 'n' is the number of training data and 'd' is the number of features since KNN is computationally expensive for large datasets.

What are the other distance metrics used in K-Nearest Neighbors algorithm?

The most common distance metric used in the KNN algorithm is the Euclidean distance. Other distance metrics include Manhattan distance, Minkowski distance, Hamming distance, and cosine similarity, depending on the problem.

What is the curse of dimensionality, and how does it affect K-Nearest Neighbors?

The curse of dimensionality refers to the problem that adding more features (dimensions) makes the distance measurements meaningless. In KNN, this can degrade performance and lead to less reliable predictions.

How can we improve the performance of K-Nearest Neighbors algorithm?

(i) Feature Scaling: Normalize the data to ensure that all features contribute equally to distance calculations.
(ii) Dimensionality Reduction: Techniques like PCA (Principal Component Analysis) can reduce the number of dimensions to avoid the curse of dimensionality.
(iii) Weighted KNN: Assign different weights to neighbors based on their distance, giving closer neighbors more importance.

We trust that this article has helped readers grasp the fundamentals of K-Nearest Neighbors algorithm. Feel free to subscribe to our platform to stay updated with future articles and tutorials on machine learning and data science.

For the latest updates and discussions, we welcome you to connect with us on platforms like Twitter, Instagram, Youtube, LinkedIn, Pinterest and Facebook.

Share the Article

Python Conditional Statements

Introduction

Structure of Decision Tree

Building a decision tree

Split algorithm based on Information gain

How to build a Decision Tree

Example

Solution

Step 1: Calculate Entropy of the target Variable ("Bird?")

Step 2: Calculate Entropy of the feature

(a)Calculate Entropy of the feature("Have feathers?") :

(b)Calculate Entropy of the feature("Can fly") :

Step 3: Calculate the Information Gain of the features

Step 4: Select the best feature or best split

Step 5: Create the Decision Tree

Step 6: Generate the rules from the model, to make prediction

Test the Model

Advantages

Disadvantages

Applications

Python Implementation from scratch

Output

Test the classifier

Steps to test the classifier

Source code

Output

Frequently Asked Questions (FAQs)

Contact Us

Quick Links

Python Conditional Statements

Introduction

Structure of Decision Tree

Building a decision tree

Split algorithm based on Information gain

How to build a Decision Tree

Example

Solution

Step 1: Calculate Entropy of the target Variable ("Bird?")

Step 2: Calculate Entropy of the feature

(a)Calculate Entropy of the feature("Have feathers?") :

(b)Calculate Entropy of the feature("Can fly") :

Step 3: Calculate the Information Gain of the features

Step 4: Select the best feature or best split

Step 5: Create the Decision Tree

Step 6: Generate the rules from the model, to make prediction

Test the Model

Advantages

Disadvantages

Applications

Python Implementation from scratch

Output

Test the classifier

Steps to test the classifier

Source code

Output

Frequently Asked Questions (FAQs)

Contact Us

Quick Links

Spread the code...