Python

Decision Tree in Sklearn

Decision Trees are hierarchical models in machine learning that can be applied to classification and regression problems. They recursively compare the features of the input data and finally predict the output at the leaf node. We will discuss about the Decision Trees and their implementation in the sklearn library.

What Is a Decision Tree Algorithm?

The classification and regression models are constructed using a decision tree technique. It maps the vectors of values to labels and represents the classifier as a decision tree. Such a tree can be compared to the nested if-then-else statements where the condition is always a straightforward test of the values in the vector. And the then and else branches are either further if-then-otherwise statements or provide a categorization label. A decision tree learns from the data, finds the most suitable features for differentiating the output, and recursively checks for the given input data to predict the label. A decision tree might look like this, for instance, if the input vector is (a, b, c):

IF  a > 10  
THEN IF b < 20
     THEN RETURN "1"
     ELSE IF a < 15
          THEN RETURN "0"
          ELSE RETURN "1"
ELSE IF c  5
          THEN RETURN "1"
          ELSE RETURN "0"
     ELSE RETURN "1"

Note that the other decision trees have that characteristic besides this one. Consequently, the problem is not only to locate such a decision tree but also to identify the most suitable one. The fact that the input is a sample from a sizable real-world collection and that the decision tree is built to identify the vectors in this more extensive set accurately determines what “suitable“ means in this case. Therefore, the definition of “suitable” depends on (1) the properties of this wider set (for example, the probability for each vector) and (2) the financial impact of misclassification in each specific instance.

Terminologies Related to Decision Tree

Root Node: The decision tree’s root node is where it all begins. The entire dataset is represented, which is then divided into two or more homogeneous sets.

Leaf Node: The leaf nodes are the last output nodes of the tree. After which, the tree cannot be further divided.

Splitting: The division of the decision node/root node into sub-nodes in accordance with the specified conditions is known as splitting.

Branch: A branch or subtree is a tree created from a node of a parent tree.

Pruning: Pruning is the procedure of removing the tree’s undesirable branches.

Parent and Child Nodes: The tree’s root node is referred to as the parent node, while the nodes that originate from it are referred to as the child nodes.

Implementing Decision Trees in Sklearn

Importing the libraries:

from sklearn.datasets import make_classification
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_val_score

Creating the dataset:

X, y = make_classification(random_state=42)
print('Train data is', X)
print('Test data is', y)

Output:

Train data is [[-2.02514259  0.0291022  -0.47494531 ... -0.33450124  0.86575519
  -1.20029641]
 [ 1.61371127  0.65992405 -0.15005559 ...  1.37570681  0.70117274
  -0.2975635 ]
 [ 0.16645221  0.95057302  1.42050425 ...  1.18901653 -0.55547712
  -0.63738713]
 ...
 [-0.03955515 -1.60499282  0.22213377 ... -0.30917212 -0.46227529
  -0.43449623]
 [ 1.08589557  1.2031659  -0.6095122  ... -0.3052247  -1.31183623
  -1.06511366]
 [-0.00607091  1.30857636 -0.17495976 ...  0.99204235  0.32169781
  -0.66809045]]
Test data is [0 0 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 0 0 1 1 0 0 0 0 1 1 0 1 0 0 0 0 0 0 1 0
 0 1 1 1 0 1 0 0 1 1 0 0 1 1 1 0 1 0 0 1 1 0 1 1 1 1 1 0 1 0 0 1 0 1 0 1 0
 1 1 1 0 0 0 1 0 1 0 1 1 1 1 1 0 0 1 0 1 1 0 1 1 0 0]

Creating the model:

model = DecisionTreeClassifier(random_state=0)
cross_val_score(model, X, y, cv=10)

Output:

array([0.9, 1. , 0.8, 1. , 1. , 0.9, 0.9, 1. , 0.9, 1. ])

Conclusion

We discussed the Decision Trees models in sklearn which create a tree-like structure for classifying or predicting the output labels. They divide the nodes to reduce the depth of the tree. We also saw the various terms related to the Decision Trees like leaf node, parent nodes, pruning, etc. Then, we later discussed the implementation of sklearn.

About the author

Simran Kaur

Simran works as a technical writer. The graduate in MS Computer Science from the well known CS hub, aka Silicon Valley, is also an editor of the website. She enjoys writing about any tech topic, including programming, algorithms, cloud, data science, and AI. Travelling, sketching, and gardening are the hobbies that interest her.