CS5350/CS6350,Machine Learning
Learning Decision Trees
1
Decision Tree Learning Overview
AF Decision Tree Learning is one of the most widely used and practical
methods for inductive inference over supervised data
AF It approximates discrete-valued functions (as opposed to continuous)
AF It is robust to noisy data
AF Decision trees can represent any discrete function on discrete
features
AF It is also efficient for processing large amounts of data,so is often
used in data mining applications
AF Decision Tree Learners’ bias typically prefers small tree over larger
ones
CS5350/CS6350,Machine Learning
Learning Decision Trees
2
Decision Tree Representation
AF Atypeofconcept hierarchy (clustering algorithms also form
hierarchies)
AF Allows arbitrary conjunction & disjunction
AF Can categorize instances into multiple disjoint categories
AF Each internal node specifies an attribute that an object must be
tested on
AF Each leaf node represents a classification
CS5350/CS6350,Machine Learning
Learning Decision Trees
3
Decision Tree Example
heart
attack
Pain
abdomen
throat
chest
none
Fever
Cough
yes
yes
no
no
Fever
yes no
appendicitis
none
cold
flu
strep
flu
This can be rewritten as rules in disjunctive normal form
CS5350/CS6350,Machine Learning
Learning Decision Trees
4
Basic Algorithm
AF Top-down (starting from root),divisive (versus agglomerative)
algorithm
AF Asks:,Which attribute should be tested at this node?”
AF Funnels instances down the tree according to their values for that
attribute.
AF Picks most common class if no attributes left to test
AF Greedy (never backtracking) search for the,best” tree that fits the
examples.
CS5350/CS6350,Machine Learning
Learning Decision Trees
5
Choosing your Instance Representation
AF Including informative attributes in your representation is very
important! If a crucial attribute is not represented,then no decision
tree will be able to learn the concept.
AF If two (noise-free) training instances have the same representation
but belong to different classes,then the attribute set is said to be
inadequate,For example:
Name Cough Fever Weight Pain Diagnosis
Ernie no yes normal throat flu
Bert no yes normal throat appendicitis
CS5350/CS6350,Machine Learning
Learning Decision Trees
6
Choosing a Good Decision Tree
AF If the attributes are adequate,it is always possible to construct a
decision tree that correctly classifies all of the training instances.
AF There are often many correct decision trees.
AF Many algorithms prefer the simplest tree that correctly classifies the
training instances (an,Occam’s razor” bias).
AF The simplest tree captures the most generalization and hopefully
represents only the most essential relationships.
CS5350/CS6350,Machine Learning
Learning Decision Trees
7
Preference for Simple Trees
Given examples,(+CWbig,red,circleCX),(+CWsmall,red,squareCX),
(-CWmedium,red,circleCX)
med
big
+
+
Size
small
CS5350/CS6350,Machine Learning
Learning Decision Trees
8
A Complex Tree
This tree is also consistent with:
(+CWbig,red,circleCX),(+CWsmall,red,squareCX),
(-CWmedium,red,circleCX)
Color
red blue
+
Shape
circle
square
Size
small
Color
Size
med
small
big
med
big
bluered
+
+
+
CS5350/CS6350,Machine Learning
Learning Decision Trees
9
Top-Down Induction of Decision Trees
DTree(examples,attributes)
If all examples in one category,
then return a leaf node with this category as label
Else if attributes=?,
then return a leaf node with the most common category in
examples as label
Else BTAWthe,best” decision attribute for D2D3CSCT
Assign BT as decision attribute for D2D3CSCT
For each value,DA
CX
of BT
Let examples
CX
be the subset of examples with value DA
CX
for BT
If examples
CX
is empty
Then create a leaf node with label most common in examples
Else call DTree(examples
CX
,attributes-BT)
CS5350/CS6350,Machine Learning
Learning Decision Trees
10
Picking the Best Attribute
AF Finding a minimal decision tree consistent with a set of data is
NP-hard.
AF DTree produces a fairly simple tree but cannot guarantee optimality.
AF Want a test that creates subsets which are relatively,pure” in one
class so they are closer to being leaf nodes.
AF Information gain is the most popular of many such possible methods.
CS5350/CS6350,Machine Learning
Learning Decision Trees
11
Entropy
Entropy(S)
1.0
0.5
0.0 0.5 1.0
p
+
AF CB is a sample of training examples
AF D4
A8
is the proportion of positive examples in CB
AF D4
A9
is the proportion of negative examples in CB
AF Entropy measures the impurity,or disorder,of CB
BXD2D8D6D3D4DDB4CBB5AHA0D4
A8
D0D3CV
BE
D4
A8
A0D4
A9
D0D3CV
BE
D4
A9
CS5350/CS6350,Machine Learning
Learning Decision Trees
12
What does Entropy Mean?
AF Expected number of bits needed to encode class of randomly drawn
member of CB,under the optimal,shortest-length code
AF Information theory says this isA0D0D3CV
BE
D4 for a,message” having
probability D4.
AF So,expected number of bits to encode + or – of random member of CB:
D4
A8
B4A0D0D3CV
BE
D4
A8
B5 B7 D4
A9
B4A0D0D3CV
BE
D4
A9
B5
AF For multiple classes this generalizes to
BXD2D8D6D3D4DDB4CBB5AH
CR
CG
CXBPBD
A0D4
CX
D0D3CV
BE
D4
CX
where CR is the number of classes and D4
CX
is the proportion of CB
belonging to class CX
CS5350/CS6350,Machine Learning
Learning Decision Trees
13
Evaluating Attributes with Information Gain
AF Not the only metric to evaluate attributes
AF Like most methods used,has effect of preferring attributes with
values correlated with the class
AF BZCPCXD2B4CBBNBTB5 = expected reduction in entropy due to distributing
instances on BT
BZCPCXD2B4CBBNBTB5AHBXD2D8D6D3D4DDB4CBB5 A0
CG
DABECE CPD0D9CTD7B4BTB5
CYCB
DA
CY
CYCBCY
BXD2D8D6D3D4DDB4CB
DA
B5
A1=? A2=?
ftft
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
CS5350/CS6350,Machine Learning
Learning Decision Trees
14
A simple example
XYCLASS
blue large yes
red small yes
green small no
white medium yes
Entropy(S) =
Gain(X) = Entropy(S) -?
Gain(X) =
Gain(Y) = Entropy(Y) -?=
Gain(Y) =
CS5350/CS6350,Machine Learning
Learning Decision Trees
15
Assumptions of Information Gain
Let D4 = # positive training instances
D2 = # negative training instances
An appropriate decision tree will classify objects in the same proportions
as they are found in the training set.
So an arbitrary object will be assigned to class P with probability
D4
D4B7D2
,and
assigned to class N with probability
D2
D4B7D2
.
Suppose training instances are assigned a random value for
attribute A,This produces a partition BV
BD
,BV
BE
,...,BV
D1
,Unless the
proportion of class P objects in each BV
CX
is exactly the same as the
proportion of class P objects in C,attribute A will appear to yield
an information gain!
CS5350/CS6350,Machine Learning
Learning Decision Trees
16
Hypothesis Space in Decision Tree Induction
AF Conducts a search of the space of decision trees,which can
represent all possible discrete functions
AF Creates a single,discrete hypothesis consistent with the data
– No way to provide confidences
– No way to create useful queries
AF Performs hill-climbing search so may find a locally optimal solution.
AF Guaranteed to find a tree that fits any noise-free training set,but it
may not be the smallest.
AF Batch learner,Bases each decision on all examples,but can
terminate early to avoid fitting noisy data.
CS5350/CS6350,Machine Learning
Learning Decision Trees
17
Inductive Bias in Decision Tree Learning
AF prefers shallow trees over deeper ones (a search bias,or as book
calls it preference bias).
AF No language bias (or as book calls it,restriction bias).
ID3 can be viewed a heuristic approximation of breadth-first
search for decision tree generation!
AF prefers to branch on attributes with highest information gain first.
AF outputs a single hypothesis – all eggs in one basket!
CS5350/CS6350,Machine Learning
Learning Decision Trees
18
Occam’s Razor Bias
Occam’s Razor heuristic,Prefer the simplest hypothesis that
fits the data.
Why prefer simpler hypotheses over complex hypotheses?
Common rationale,There are fewer simpler hypotheses,so it is
less likely to be coincidence if a simple hypothesis fits the data.
Counter arguments:
AF But there are lots of meaningless small hypothesis sets!
AF The complexity of a hypothesis depends on the representation used
by the learner.
CS5350/CS6350,Machine Learning
Learning Decision Trees
19
Learning Decision Trees from Unsupervised Data
AF Cannot evaluate attributes based on ability to discriminate among
classes
AF Instead,pick attribute based on ability to discriminate among the
values of all other remaining attributes!
AF Entropy based on proportion of CB having each value for an attribute
AF Average over all these information gains
AF When given a test instances,can use for predicting any attribute that
is missing
Learning Decision Trees
1
Decision Tree Learning Overview
AF Decision Tree Learning is one of the most widely used and practical
methods for inductive inference over supervised data
AF It approximates discrete-valued functions (as opposed to continuous)
AF It is robust to noisy data
AF Decision trees can represent any discrete function on discrete
features
AF It is also efficient for processing large amounts of data,so is often
used in data mining applications
AF Decision Tree Learners’ bias typically prefers small tree over larger
ones
CS5350/CS6350,Machine Learning
Learning Decision Trees
2
Decision Tree Representation
AF Atypeofconcept hierarchy (clustering algorithms also form
hierarchies)
AF Allows arbitrary conjunction & disjunction
AF Can categorize instances into multiple disjoint categories
AF Each internal node specifies an attribute that an object must be
tested on
AF Each leaf node represents a classification
CS5350/CS6350,Machine Learning
Learning Decision Trees
3
Decision Tree Example
heart
attack
Pain
abdomen
throat
chest
none
Fever
Cough
yes
yes
no
no
Fever
yes no
appendicitis
none
cold
flu
strep
flu
This can be rewritten as rules in disjunctive normal form
CS5350/CS6350,Machine Learning
Learning Decision Trees
4
Basic Algorithm
AF Top-down (starting from root),divisive (versus agglomerative)
algorithm
AF Asks:,Which attribute should be tested at this node?”
AF Funnels instances down the tree according to their values for that
attribute.
AF Picks most common class if no attributes left to test
AF Greedy (never backtracking) search for the,best” tree that fits the
examples.
CS5350/CS6350,Machine Learning
Learning Decision Trees
5
Choosing your Instance Representation
AF Including informative attributes in your representation is very
important! If a crucial attribute is not represented,then no decision
tree will be able to learn the concept.
AF If two (noise-free) training instances have the same representation
but belong to different classes,then the attribute set is said to be
inadequate,For example:
Name Cough Fever Weight Pain Diagnosis
Ernie no yes normal throat flu
Bert no yes normal throat appendicitis
CS5350/CS6350,Machine Learning
Learning Decision Trees
6
Choosing a Good Decision Tree
AF If the attributes are adequate,it is always possible to construct a
decision tree that correctly classifies all of the training instances.
AF There are often many correct decision trees.
AF Many algorithms prefer the simplest tree that correctly classifies the
training instances (an,Occam’s razor” bias).
AF The simplest tree captures the most generalization and hopefully
represents only the most essential relationships.
CS5350/CS6350,Machine Learning
Learning Decision Trees
7
Preference for Simple Trees
Given examples,(+CWbig,red,circleCX),(+CWsmall,red,squareCX),
(-CWmedium,red,circleCX)
med
big
+
+
Size
small
CS5350/CS6350,Machine Learning
Learning Decision Trees
8
A Complex Tree
This tree is also consistent with:
(+CWbig,red,circleCX),(+CWsmall,red,squareCX),
(-CWmedium,red,circleCX)
Color
red blue
+
Shape
circle
square
Size
small
Color
Size
med
small
big
med
big
bluered
+
+
+
CS5350/CS6350,Machine Learning
Learning Decision Trees
9
Top-Down Induction of Decision Trees
DTree(examples,attributes)
If all examples in one category,
then return a leaf node with this category as label
Else if attributes=?,
then return a leaf node with the most common category in
examples as label
Else BTAWthe,best” decision attribute for D2D3CSCT
Assign BT as decision attribute for D2D3CSCT
For each value,DA
CX
of BT
Let examples
CX
be the subset of examples with value DA
CX
for BT
If examples
CX
is empty
Then create a leaf node with label most common in examples
Else call DTree(examples
CX
,attributes-BT)
CS5350/CS6350,Machine Learning
Learning Decision Trees
10
Picking the Best Attribute
AF Finding a minimal decision tree consistent with a set of data is
NP-hard.
AF DTree produces a fairly simple tree but cannot guarantee optimality.
AF Want a test that creates subsets which are relatively,pure” in one
class so they are closer to being leaf nodes.
AF Information gain is the most popular of many such possible methods.
CS5350/CS6350,Machine Learning
Learning Decision Trees
11
Entropy
Entropy(S)
1.0
0.5
0.0 0.5 1.0
p
+
AF CB is a sample of training examples
AF D4
A8
is the proportion of positive examples in CB
AF D4
A9
is the proportion of negative examples in CB
AF Entropy measures the impurity,or disorder,of CB
BXD2D8D6D3D4DDB4CBB5AHA0D4
A8
D0D3CV
BE
D4
A8
A0D4
A9
D0D3CV
BE
D4
A9
CS5350/CS6350,Machine Learning
Learning Decision Trees
12
What does Entropy Mean?
AF Expected number of bits needed to encode class of randomly drawn
member of CB,under the optimal,shortest-length code
AF Information theory says this isA0D0D3CV
BE
D4 for a,message” having
probability D4.
AF So,expected number of bits to encode + or – of random member of CB:
D4
A8
B4A0D0D3CV
BE
D4
A8
B5 B7 D4
A9
B4A0D0D3CV
BE
D4
A9
B5
AF For multiple classes this generalizes to
BXD2D8D6D3D4DDB4CBB5AH
CR
CG
CXBPBD
A0D4
CX
D0D3CV
BE
D4
CX
where CR is the number of classes and D4
CX
is the proportion of CB
belonging to class CX
CS5350/CS6350,Machine Learning
Learning Decision Trees
13
Evaluating Attributes with Information Gain
AF Not the only metric to evaluate attributes
AF Like most methods used,has effect of preferring attributes with
values correlated with the class
AF BZCPCXD2B4CBBNBTB5 = expected reduction in entropy due to distributing
instances on BT
BZCPCXD2B4CBBNBTB5AHBXD2D8D6D3D4DDB4CBB5 A0
CG
DABECE CPD0D9CTD7B4BTB5
CYCB
DA
CY
CYCBCY
BXD2D8D6D3D4DDB4CB
DA
B5
A1=? A2=?
ftft
[29+,35-] [29+,35-]
[21+,5-] [8+,30-] [18+,33-] [11+,2-]
CS5350/CS6350,Machine Learning
Learning Decision Trees
14
A simple example
XYCLASS
blue large yes
red small yes
green small no
white medium yes
Entropy(S) =
Gain(X) = Entropy(S) -?
Gain(X) =
Gain(Y) = Entropy(Y) -?=
Gain(Y) =
CS5350/CS6350,Machine Learning
Learning Decision Trees
15
Assumptions of Information Gain
Let D4 = # positive training instances
D2 = # negative training instances
An appropriate decision tree will classify objects in the same proportions
as they are found in the training set.
So an arbitrary object will be assigned to class P with probability
D4
D4B7D2
,and
assigned to class N with probability
D2
D4B7D2
.
Suppose training instances are assigned a random value for
attribute A,This produces a partition BV
BD
,BV
BE
,...,BV
D1
,Unless the
proportion of class P objects in each BV
CX
is exactly the same as the
proportion of class P objects in C,attribute A will appear to yield
an information gain!
CS5350/CS6350,Machine Learning
Learning Decision Trees
16
Hypothesis Space in Decision Tree Induction
AF Conducts a search of the space of decision trees,which can
represent all possible discrete functions
AF Creates a single,discrete hypothesis consistent with the data
– No way to provide confidences
– No way to create useful queries
AF Performs hill-climbing search so may find a locally optimal solution.
AF Guaranteed to find a tree that fits any noise-free training set,but it
may not be the smallest.
AF Batch learner,Bases each decision on all examples,but can
terminate early to avoid fitting noisy data.
CS5350/CS6350,Machine Learning
Learning Decision Trees
17
Inductive Bias in Decision Tree Learning
AF prefers shallow trees over deeper ones (a search bias,or as book
calls it preference bias).
AF No language bias (or as book calls it,restriction bias).
ID3 can be viewed a heuristic approximation of breadth-first
search for decision tree generation!
AF prefers to branch on attributes with highest information gain first.
AF outputs a single hypothesis – all eggs in one basket!
CS5350/CS6350,Machine Learning
Learning Decision Trees
18
Occam’s Razor Bias
Occam’s Razor heuristic,Prefer the simplest hypothesis that
fits the data.
Why prefer simpler hypotheses over complex hypotheses?
Common rationale,There are fewer simpler hypotheses,so it is
less likely to be coincidence if a simple hypothesis fits the data.
Counter arguments:
AF But there are lots of meaningless small hypothesis sets!
AF The complexity of a hypothesis depends on the representation used
by the learner.
CS5350/CS6350,Machine Learning
Learning Decision Trees
19
Learning Decision Trees from Unsupervised Data
AF Cannot evaluate attributes based on ability to discriminate among
classes
AF Instead,pick attribute based on ability to discriminate among the
values of all other remaining attributes!
AF Entropy based on proportion of CB having each value for an attribute
AF Average over all these information gains
AF When given a test instances,can use for predicting any attribute that
is missing