Python Conditional Statements

Introduction

A decision tree is a supervised learning algorithm that can be applied to both classification and regression tasks. In classification it predicts categorical outcomes, while in regression it predicts continuous values.
In machine learning, data is split into a training dataset and a test dataset. A decision tree is a model built by analyzing the training dataset. This model can be used to make predictions by applying test data.

Decision Tree

Structure of Decision Tree

A decision tree has a tree-like structure with various types of nodes:
Root Node: A tree starts from the topmost node which is called  root node.
Internal node or decision Node: When a node splits into sub-nodes, then it is called the decision node, which represents a decision based on a feature. Root node and internal nodes are denoted by an rectangle shape.
Leaf / Terminal Node: Nodes that do not split and terminate a branch is called Leaf node or Terminal node, which represents a final outcome (class label or a continuous value).It is denoted by oval shape.

Decision Tree

Building a decision tree

A decision tree can be built systematically using two algorithms:
1.Split algorithm based on information Gain: used to build ID3 (iterative Iterative Dichotomiser 3) for classification task.
2.Split algorithm based on Gini Index: used to build CART (Classification and Regression Trees.) for both classification and regression tasks.

Split algorithm based on Information gain

ID3 is an algorithm developed by Ross Quinlan in 1986, used to generate a decision tree from a dataset. The algorithm repeatedly splitting the dataset into smaller subsets or dichotomizing ,until it reaches the most informative and pure classification.
It selects the best feature at each step by calculating information gain and represent it as a node.
Information gain: is a metric used to measure the importance of a feature and helps in choosing the best one for splitting. Information Gain is higher for the pure nodes with a maximum value of 1.

Decision Tree

Entropy: is a measure of uncertainty or impurity in a dataset. Entropy is high when data is spread across different classes and low, or zero, when all data belongs to the same class. Entropy is used to calculate information gain during tree construction.

Decision Tree

Pure Node: A node is pure if all the data points in the node belong to the same class. For example, if all instances are labeled as “Yes” or all are labeled as “No,” the node is pure.
Impurity of Node: A node is impure when it contains instances of multiple classes or mixed values, indicating uncertainty in classification.For example, some instances are labeled as “Yes” and remaining are labeled as “No” .

How to build a Decision Tree
  1. Calculate the entropy of the target variable.
  2. Calculate the entropy for each feature: compute the entropy for the subsets formed by splitting the dataset based on those feature values.
  3. Calculate the Information Gain.
  4. Select the best feature: The feature with the highest information gain is selected as the splitting criterion , which is selected as the root node or internal node.
  5. Repeat the process 2–4 recursively until a stopping condition is met (e.g., when the node is pure or when a predefined tree depth is reached).
  6. Create leaf nodes: Once the tree reaches a pure state or a stopping condition, the leaf node is assigned a class label.
Example

In this example, the entire dataset is used for training. There are two features, “Have feathers?” and “Can fly?” along with the target variable “Bird?”. Predict whether the species is a Bird or not .

Have Feathers? Can Fly? Bird?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
Yes
No
No
Yes
No

Solution

Step 1: Calculate Entropy of the target Variable ("Bird?")

There are 5 instances with two classes, in the target variable(Bird):    Yes = 2, No = 3

P(Yes)  =2/5  =0.4
P(No)   =3/5  =0.6

Have Feathers? Can Fly? Bird?
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
No
No
Yes
No
No
Yes
No
Decision Tree

Entropy(Bird) = -(2/5 log2(2/5)) + (3/5 log2(3/5))
      = -(0.4 log2(0.4)) + (0.6 log2(0.6))
      = -(0.4 x (-1.3219)+(0.6 x (-0.73697))
      = -(-0.52876 – 0.44218)
      = 0.971

Note:  To calculate  log in scientific calculator 

log2(0.4)= log (0.4)  /  log(2)

Step 2: Calculate Entropy of the feature

(a)Calculate Entropy of the feature("Have feathers?") :

There are 5 instances with two classes, in the feature “Have Feathers?”  :
Yes = 3, No = 2
(i)Yes= 3 instances( 2 Yes ,1 No in target variable)

Have Feathers? Bird?
Yes
Yes
Yes
Yes
Yes
No

Entropy(feathers=yes) = -(2/3 log2(2/3)) + (1/3 log2(1/3))
        = -(0.667 log2(0.667)) + (0.333 log2(0.333))
        = -(0.667 x (-0.585) + (0.333 x (-1.585))
        = -(-0.390195 – 0.527805)
        = 0.918

(ii)No= 2 instances( 0 Yes ,2 No in target variable)

Have Feathers? Bird?
No
No
No
No

Entropy(Have feathers=No) = -(0/2 log2(0/2)) + (2/2 log2(2/2))
        = – 0 + (1 log2(1))
        = – 0 + 0
        =  0(Pure Node)

Therefore, Entropy(Have feathers) = (No. of Yes / Total Instance)  x  Entropy(HF=Yes) + (No. of No / Total Instance)  x  Entropy(HF=No)
        = (3/5)  x  0.918  +  (2/5)  x  0
        = 0.6 x 0.918 + 0
        = 0.5508

(b)Calculate Entropy of the feature("Can fly") :

There are 5 instances with two classes, in the feature “Can fly?”:
Yes = 4, No = 1
(i)Yes= 4 instances( 2 Yes ,2 No in target variable)

Can Fly? Bird?
Yes
Yes
Yes
Yes
Yes
No
Yes
No

Entropy(Can fly?=yes) = -(2/4 log2(2/4)) + (2/4 log2(2/4))
        = -(0.5 log2(0.5)) + (0.5 log2(0.5))
        = -(0.5 x (-1) + (0.5 x (-1))
        = -(-0.5 – 0.5)
        = 1

(ii)No= 1 instance( 0 Yes ,1 No in target variable)

Can Fly? Bird?
No
No

Entropy(Can fly=No) = -(0/1 log2(0/1)) + (1/1 log2(1/1))
       = – 0 + (1 log2(1))
       = – 0 + 0
       =  0(Pure Node)

Therefore, Entropy(Can fly) = (No. of Yes / Total Instance)  x  Entropy(CF=Yes) + (No. of No / Total Instance)  x  Entropy(CF=No)
        = (4/5)  x  1  +  (1/5)  x  0
        = 0.8 + 0
        = 0.8

Step 3: Calculate the Information Gain of the features

Information Gain = Entropy (target variable) – Entropy (Feature)
          = Entropy (Bird) –  Entropy (Have Feathers | Can Fly)
In this example, Entropy (Bird) = 0.971. Therefore, Information gain of  the  features are:

Features Entropy Information Gain
Have Feathers?
0.5508
0.971 – 0.5508 = 0.4202
Can Fly?
0.8
0.971 – 0.8 = 0.171

Step 4: Select the best feature or best split

The feature “Have Feathers?” has the highest information gain (0.4202); hence, it is selected as the first split or root node of the decision tree.

Step 5: Create the Decision Tree

Decision Tree

Root Node: Start with the best feature “Have Feathers?”
(i)Right Branch (No): This is a pure subset (all instances belong to the “No” class), so it is terminated with the label “Not Bird.”
(ii)Left Branch (Yes): This subset is not pure, so it requires further splitting based on the feature “Can Fly?”
   (a)Right Branch (No): This is a pure subset (all instances belong to the “N” class), so it is terminated with the label “Not Bird.”
    (b)Left Branch (Yes): Although this subset is impure, there are no additional features available for further splits, so the branch is terminated with the label “Bird”(Yes).

Step 6: Generate the rules from the model, to make prediction

If “Have Feathers?” = No
      Prediction: Not Bird
else
      If “Have Feathers?” = Yes and “Can Fly?” = No
           Prediction: Not Bird
      else if “Have Feathers?” = Yes and “Can Fly?” = Yes
            Prediction: Bird

Test the Model

Advantages

1. Easy to understand.
2. Simple to implement.
3. KNN doesn’t build any models using training data before classification.
4. Handle noisy training data.
5. It handles both categorical data and numerical data.

Disadvantages

1. It is expensive when working with a large data set.
2. A lot of memory is required for processing large data sets.
3. Choosing the right value of K can be tricky.
4. KNN is called a lazy learner because it starts computation after new test data arrives.

Applications

The KNN algorithm is versatile and widely used across various domains. Some of the applications include:
1. Text Classification and  Image Classification
2. Handwriting Recognition and Gesture Recognition
3. Recommending products, movies, etc., to users
4. Classifying patient data to diagnose diseases
5. Categorizing genetic sequences and proteins
6. Detecting fraudulent transactions through pattern analysis
7. Weather Forecasting
8. Stock  Market Predictions
9. Classifying air quality from sensor data
10. Detecting unusual network activity and security breaches

Python Implementation from scratch

This practical demonstration highlights the effectiveness of the K-Nearest Neighbors (KNN) algorithm in making predictions. We implement the algorithm using the above example in Python from scratch to categorize students as ‘pass’ or ‘fail.’

Output

The new data point (5, 6)  is classified as  :  Pass

Test the classifier

Testing a classifier involves evaluating the performance of an KNN algorithm by applying it to new, unseen data. The basic metrics to measure the performance of a classifier are:
(i) Accuracy
(ii) Error Rate

(i) Accuracy:
Accuracy represents the proportion of correctly classified test data out of the total test data. It is calculated as:

K-Nearest Neighbors(KNN)

(ii) Error Rate:
The error rate (or misclassification rate), also known as the misclassified rate, represents the proportion of test data that was incorrectly classified. It is calculated as:

K-Nearest Neighbors(KNN)

Steps to test the classifier

1. Prepare Datasets: Define the training dataset and test dataset.
2. Implement the k-NN function: Create a function to predict labels for the test data based on the nearest neighbors from the training data.
3. Test Classifier: Compare the prediction data with actual data (labels).
4. Calculate Metrics: If both are unequal, then count the incorrect predictions and compute the error rate.
5. Print Results: Display the error rate to evaluate the performance of a classifier.

Source code

Output

Total number of tests : 3
Number of errors : 1
Error rate for KNN with k=3: 0.33

Frequently Asked Questions (FAQs)

The 'k' parameter represents the number of neighbors around the test data. A larger 'k' may lead to underfitting, while a smaller 'k' might cause overfitting.

If there are more than two outcome labels or categories, then algorithm handles those multi-class classifications by counting each category of K-nearest neighbors around the test data and assigning the new test data to the class that has the majority count.

The time complexity of the algorithm during the prediction phase is O(n * d), where 'n' is the number of training data and 'd' is the number of features since KNN is computationally expensive for large datasets.

The most common distance metric used in the KNN algorithm is the Euclidean distance. Other distance metrics include Manhattan distance, Minkowski distance, Hamming distance, and cosine similarity, depending on the problem.

The curse of dimensionality refers to the problem that adding more features (dimensions) makes the  distance measurements meaningless. In KNN, this can degrade performance and lead to less reliable predictions.

(i) Feature Scaling: Normalize the data to ensure that all features contribute equally to distance calculations.
(ii) Dimensionality Reduction: Techniques like PCA (Principal Component Analysis) can reduce the number of dimensions to avoid the curse of dimensionality.
(iii) Weighted KNN: Assign different weights to neighbors based on their distance, giving closer neighbors more importance.

We trust that this article has helped readers grasp the fundamentals of K-Nearest Neighbors algorithm. Feel free to subscribe to our platform to stay updated with future articles and tutorials on machine learning and data science. 

For the latest updates and discussions, we welcome you to connect with us on platforms like Twitter, Instagram, Youtube, LinkedIn, Pinterest and Facebook.

Share the Article

Spread the code...