Document
Information
Package base
version 2.8.0
4/19/2009
Project 5: Statistical
Learning with Multi-Scale Cardiovascular Data
Contact email: [email protected]
1.1.1. Special Adjustment for Root-NA-NA Node
1.5. Depth, Real Depth and Label
2.4. Functions split.cont() and split.discrete()
2.5. Function impurity.at.node()
2.6. Function predict.na.tree()
2.7. Function draw.tree() and draw.node()
In the traditional classification tree modeling, it is assumed that there are no missing (NA) data, or is at least relatively rare. data. For the Reynolds cardiovascular datasets we have been studying, missing data are the rule rather than the exception. For example, imaging data is available only for a small subset of subjects. To address this situation we developed a classification method which we refer to as the NA-Tree Classifier. It makes use of traditional classification tree concepts. However, instead of creating a decision tree with the familiar binary structure, it creates a ternary tree, adding an extra NA branch and corresponding node associated with subjects missing data for a the variable defining a given split.
For instance,
when we ask subjects survey questions there may be informative responses such
as "Yes", "No", but there is also the possibility of
noninformative responses such as "Not Applicable" or "Not
available". For informative responses, we can follow the the
traditional tree approach and create branches and child nodes for each possible
response. For the latter two responses, we don't know what to do with
such subjects in the traditional setting. However we solve this in the NA
tree approach by categorize subjects intoto the NA node.
Conceptually, NA trees use the same ideas as traditional classification trees, but making use of a possible third branch (NA branch) at each node gives us much more flexibility when dealing with incomplete data. The method described here should prove useful in large medical studies when subsets of the subject population are chosen for particular analyses.
As described
above, each non-leaf node of the NA tree has at most three branches.
Subjects traverse the tree starting at the root node. Each non-leaf node has an
associated feature variable and a threshold. A subject is passed to one of the
node's children as follows. If the subject's feature value is available and
less than or equal to the threshold, it passes to the left child. If it is
available and greater than the threshold it passes to the right child.
If the feature value is unavailable for the subject, then it passes to the
middle (NA )node..
A typical NA tree
would have the following appearance:
As described in Section 1.2 entitled "Pruned Tree Classifier" of the Painted Tree Classifier section, our sample size is somewhat limited so in order to avoid over-fitting we restrict trees to have a maximum depth of two, meaning that classification of a subject involves decisions made in at most two nodes. In instances where the sample size is larger higher depth trees should certainly be considered.
1.1.1. Special Adjustment for Root-NA-NA
Node
Most of the NA trees resemble the one shown above, with the only exception being that, for the split Root -> NA -> NA, although it is in depth of 2, the real depth is still 0, that is, it has not been split by any real feature at all. So if it has impurity at that node, we split it again to further classify that part of data. If node * (indicated in the previous graph) has both firing and not firings (i.e. impurity), the NA tree will have the following appearance:
Some of the
features we use are categorical variables, for example, sex and blood
type. To incorporate such features in the NA tree, we create binary
variables associated with each possble variable value, and split the tree based
on criterion feature equals value vs. feature does not equal value,
and of course NA.
The discrete tree
will look like:
Note: In the
above figure "!=" is used to denote "not equal to".
At each node, we need to find an optimal split, which amounts to picking a best feature and a threshold (or value to split on in the categorical case). For this, we use the entropy criterion in selection. We try all the possible features and a large collection of thresholds (or values), and we calculate the entropy decrease caused by each split. We then select the feature and threshold (value) pair giving rise to the largest entropy decrease. If none of the choice leads to a decrease, or if the decrease in entropy is not sufficiently large, the node is made into a leaf.
We define the
entropy at each node as:
Where
Definitions of
entropy and other relevant concepts are introduced in Document 2. "Feature
Selection Instruction", 1.1 "Information Theory, Entropy, CMI".
And the entropy
after splitting is defined as:
where
From the definition,
we know that is weighted
average of the three child nodes.
It describes the average entropy these three nodes carry.
Naturally, the
entropy difference is:
Finally, we pick
the feature i and threshold t which maximize D at each node.
We illustrate the
construction with an example. We reach a node (parent node) where there
are 10 firings and 100 not firings. After we split the node with an
arbitrary feature and threshold, we arrive at the following tree:
Then the entropy
at parent node is:
Follow the same
formula, we obtain entropies at left node, NA node and right node as
respectively, 0.7219281, 0.6500224 and 0.2472944.
We get the
entropy after split is
The difference
(i.e. entropy decrease) D is =0.439497-0.3990997=0.0403973
We repeat the
same procedure for every possible feature and threshold, then pick the pair
with the largest D. That is our best split selection at that node.
For numeric
features, it is inefficent to try every possible real number as a
threshold. In addition, allowing for every possible thresold is likely to
lead to over-fitting. Instead, we calculate the 20%, 50% and 80% percentiles of
the feature in training data and use these as thresholds.
For discrete
feature, we simply use every possible value (or level) as threshold. For
each level in that feature, the splitting criterion becomes: 1. feature A =
level i; 2. feature A != level i; 3 feature A is NA.
1.5. Depth, Real Depth and Label
As discussed in section 1.1.1, it is important to distinguish between the depth of a node in an NA-tree (tree depth) and the decision depth, i.e. the number of decisions that have actually been made based on available features to get to a given node.
An graph
illustration will make this more sensible.
Since there are
so many nodes and features, it is very hard to keep track where the each
observation is classified. So we assign numerical labels to the
nodes. The labels are shown in the graph. These labels are fixed and do not
change when we change datasets. Thus, n the future, Node 5 is always
referring to the node which has first split to left (<= for numeric feature
and = for discrete feature) and then to right (> for numeric feature and !=
for discrete feature). And node 12 will always referr to the NA-NA node.
If node 12 has
impurity, we will further split it and label the three new nodes as node
13(left, <=), 14(right, >), 15(middle, NA).
Our CVRG dataset
contains approximately 100 features. Given the available sample size, it
is no surprising that using all of these features for our NA tree produces poor
cross-validated classification rates. So, instead, we pre-select roughly 12
features among the 100 based on cardiovascular experts' judgement, and perform
the analysis based on these 12 features.
These features
are "meanHRbpm", "percPVC", "QTVI_log",
"LFdivHFPow", "SDNN_RST_msec", "snp2",
"snp5", "snp6", "Gender.x", "Inducible",
"LVEF", "DEmass".
We will update our pre-selected feature list according to the
result. We will update our
pre-selected feature list according to the result.
We note the full
feature lists of all the important numeric features and discrete features here:
Discrete list
contains: "snp1", "snp2", "snp3",
"snp4", "snp5", "snp6", "Gender.x",
"Race", "cardiom.type", "Inducible".
Continuous list
contains: "meanHRbpm", "perckept", "percPVC",
"QTVI_log", "QTVInumer.1000", "QTVIdenom.1000",
"QTV_msecsqu", "HRV_bpmsqu", "QTintmean_sec",
"HRmean_bpm", "QTRRslope", "QTRRintercept",
"QTRR_r2", "meancoh", "VLFPow",
"LFPow", "HFPow", "TotPow_clinical_use",
"TotPow_mathematical", "LFdivHFPow",
"LFdivHFplLF", "RMSSD_msec", "pNN50",
"meanNN_msec", "SDNN_msec", "total_RRvariance_msec",
"SDNN_RST_msec", "sdRLTESQTintmsec",
"avgNssthoseelim", "avgNssthosekept", "LVEDV",
"LVESV", "LVEF", "LVEDMASS", "DEmass".
In our analysis,
we have total number of observations of 937. After our data cleaning
procedure, i.e. removing observations with NA implant dates, and removing
observations with negative days of observations. These two steps removed
343 observations, and left us 594 observations. We use as a phenotype
"fired within ndays of implant." Since the phenotype is based on
observing individuals for at least ndays, in order to eliminate sampling bias,
we must exclude all subjects who have not implanted at least ndays ago. We then build up
our NA tree based on the response variable "fired" using the other
features. The phenotype "fired" is define to mean d$fired<-d$Days.To.First.App.Firing<=ndays, where ndays is a pre-specified
number. This means that if the days of firing is smaller than ndays, we
code a new variable "fired" 1, and otherwise 0. The value of
ndays value can be changed depending on the interest of the researcher's goal.
We have focused on a vaue of ndays of 545, i.e. one and one half years.
There are multiple reasons why we use the 545 day criterion. One point to make
is that if we choose too small time of observation criterion we end up with too
small a number of firings. On the other hand if we require that subjects need
to be observed for too long an amount of time to be included, then we lose
sample size.
For example, if
we include all the patients who had fired before 545 days, then both patient X
and Y should be included since they both have days of firing less than 545 days
(300 and 500 respectively). However,
only patient X is observable to us and it is impossible to observed patient
Y. It is because that we only
observed patient Y for 400 days and cannot know whether he fires after 400
days. So without being able to
include both patient X and Y, we are introducing bias into our model.
Note:
Days of firing
are independent of days of observations.
So, to sum up, in
order to analyze firings happened before 545 days, we have to exclude all the
observations with days of observations less than 545 days. So observations left will have either
days of firing before or after 545 days, but all of them have days of observations
larger than 545 days.
As we did for
every analysis, we need to do cross validation for this method. The
definition of cross validation and basic method is explained in document 5,
"Painted Tree Classifier", 1.3. "Leave One Out Cross Validation
(LOOCV)".
Cross validation
is a method in machine learning to test the performance of a statistical
analysis. It is mainly used when the statistical analysis's goal is
prediction and one wants to test its accuracy. The routine procedure is
that first partitioning a sample of data into complementary subsets, performing
the analysis on one subset (called the training set or learning set) to get the
model and parameters, and validating the model on the rest dataset (called
testing set).
The n-fold
cross-validation is basically partitioning the original dataset into n subsets
(approximately equal size). One of them is drawn as the testing set, and
the remaining n−1 subsets are used as training set. The cross-validation
process is then repeated n times (the folds), with each of the subset used
exactly once as the testing set. Then the n results from each validation
can be averaged to produce a single evaluation. The advantage of this
method is that all observations are used for both training and validation, and
each observation is used in testing set exactly once.
The leave one out
cross validation mentioned in "Painted Tree Classifier" is the
special case of n-fold cross validation with n being the sample size.
In our analysis,
we used 8 fold cross validation.
Since we changed
the structure of the tree model, there is not an existing software package to
do such analysis. We program our own NA tree from scratch, which involves
many functions. I am going to introduce them one by one with basic input,
output and functions.
This function
basically takes a dataset which contains features and response variable, and
creates a tree frame. Please note that this function does not really
generate the NA tree, instead, it initiates the tree model by building up the
starting point of a NA tree root and the tree frame which tree.grow() will be
base on.
We can see below
that the output of tree.int() is just a table with one entry (root node).
Output of
tree.int()
node.label depth real.depth
split.var.name split.var.value dec.rule
1
0
0
0
<leaf>
NA
NA
parent.node call
count.fired count.nonfired
1
NA
FALSE
26 411
As we can see,
the dataset has 411+26=437 patients (or observations), of which 26 patients
have fired and 411 patients have not fired.
This function
takes a dataset and a tree frame which is created by tree.int(), and generates
the whole NA tree and also the classification of the training data. This
is the biggest function since it calls nearly all the rest of functions at some
point during the program. It can be regard as an outline of the whole
procedure, while each other function just does one single step of the
procedure.
We need to
carefully explain the output of this function since it is most important
information our model gives. It is
crucial to fully understand the output.
For example, we build up a tree and print the output.
>
tree.d<-tree.init(d)
>
tree.struct.d <- tree.grow(d, tree.d, 1, features, 0.5)
>
tree.struct.d
Output:
$frame
node.label depth
real.depth split.var.name split.var.value dec.rule
1 0 0 0
snp2
AA
<NA>
2 1 1 1 LFdivHFPow 0.8012 ==
3 2 1 1 meanHRbpm
83.8006 !=
4 3 1 0
LVEF 34.62 <NA>
5 4 2 2
<leaf> <NA> <=
6 5 2 2
<leaf>
<NA> >
7 6 2 1
<leaf>
<NA>
<NA>
8 7 2 2
<leaf>
<NA> <=
9 8 2 2 <leaf>
<NA> >
10 9 2 1
<leaf>
<NA>
<NA>
11
10 2 1
<leaf>
<NA>
<=
12
11 2 1
<leaf>
<NA>
>
13
12 2 0
<leaf>
<NA>
<NA>
parent.node call count.fired count.nonfired
1
NA FALSE 26
411
2
0 FALSE 2
109
3
0 FALSE 22
219
4
0 FALSE 2
83
5
1 FALSE 2
13
6
1 FALSE 0
57
7
1 FALSE 0
39
8
2 FALSE
13
153
9
2 FALSE 7
35
10 2 FALSE 2
31
11 3 FALSE 0
21
12 3 FALSE 2
3
13 3 FALSE 0 59
$where
[1] 7 7 9
9 6 6
6 6 7
6 7 12 9
6 12 7 12 9
9 12 6
[22] 9 6 6
6 10 6 9
6 9 6
6 6 9
9 9 6 12 9 9 9
7
[43] 7 6 9
6 9 9
7 9 7
7 9 12 9
7 7 6
7 9 9
7 6
[64] 9 7 7
7 7 8
9 12 12 7 7
8 6 7
6 9 7
9 7 9
6
[85] 6 7 6
7 6 6
7 5 7
5 7 7
7 5 8
7 8 10 9
8 6
[106] 8
9 7 7
6 7 6
7 7 7
7 8 7
5 7 4
7 8 4
8 7
[127] 7
7 5 6
7 6 8
8 6 6
8 5 4
7 5 8
9 7 7
7 7
[148] 4
8 5 6
5 7 7
7 8 9
7 7 7
8 7 7
7 4 7
8 7
[169] 4
7 7 5
7 5 5
4 7 7
8 7 7 10 5 8 12 7
7 12 7
[190] 7
7 7 5
7 7 7
7 5 7
7 7 7
7 5 10 7 12 12 7 12
[211] 7
7 7 7
7 7 7
8 5 7
7 8 7
7 7 5
5 5 10 7
7
[232]
12 8 5 7 8
9 6 8 12 7 8 8
8 5 5
5 4 7
7 7 8
[253] 7
7 4 8
7 7 5
7 12 5 7
7 8 7
4 7 8
5 8 7
7
[274] 8
5 8 7
7 7 5
7 7 12 7
5 7 7
5 7 5
5 7 7 12
[295] 5
5 8 7
8 7 7
7 7 7
8 7 5
5 5 7
8 12 5 7
8
[316] 4
5 7 4
7 10 4 5
7 7 6
9 8 7
7 5 8
7 8 7
4
[337] 5
7 7 7
5 7 5
7 7 12 7
5 12 5 7 12 10 12 12 12 12
[358] 12
12 12 12 12 12 11 10 12 12 12 12 12 10 12 10 12 12 11 12 12
[379] 11
12 10 12 12 12 12 12 10 11 11 12 10 10 12 12 10 10 12 10 12
[400] 10
10 12 7 7 6 7
7 5 7 12 5 7 5
5 5 7
5 10 7 7
[421] 7
5 6 7
7 7 5
7 4 7
5 5 12 10 12 12 12
This is the main
output. The field $frame give the structure of the NA tree using a
table. Since R cannot draw a tree with arrows and lines, we can only
express the tree by using a table. For each row in the output table, it
stands for node and its related information. For example, there are some
basic information about the node label, its depth and real depth. It also
shows whether it is a leaf or not. If it is not a leaf, it shows what
feature it uses to split this node along with its corresponding
threshold. If the node is not root, it shows how this node is split from
its parent node (i.e. dec.rule) and parent node label. In the end, it
shows how the firings and not firings are distributed at each node, in order to
give people an idea how the NA tree is splitting.
Each table above
can be translated to a graph below, which indicates an exactly same NA tree
structure.
The second part
of the output is where the training sample is classified. Since only node
4 to node 12 are leaves, the 437 patients are classified into this 9 leaves.
This function is
used at each node. It takes the dataset at that node and a list of
features, and returns the best feature and threshold based on the criterion
mentioned in the previous section. It is one of most important function
in this NA tree model.
2.4. Functions split.cont() and
split.discrete()
These two
functions are used after best.split() has decided which feature and threshold
to use. They are also used at each node. They take the feature and
threshold and split the dataset at each node.
In the output
table of tree.grow(), each row is actually generated by either split.cont() or
split.discrete() depending on whether it is a numeric or discrete feature.
2.5. Function impurity.at.node()
This function is
used to calculate the entropy at each node. It just does what the formula of does. The
return of this function is one number and it takes two numbers, which are
numbers of firings and not firings.
2.6. Function predict.na.tree()
This function is used
to predict classification of new observations or patients. It takes the existing NA tree model and
a vector or matrix of new observations, and returns the classification of these
observations. It basically goes
through the NA tree from top to bottom for every single observation and record
the node that observation ends up with.
2.7. Function draw.tree() and
draw.node()
These two
functions print the NA tree in a more fashion way. It is easier to distinguish between different depth and
understand the tree structure.
Function draw.tree() print the whole NA tree, while draw.node() only
print one node. For example, some
common output looks like the following (It is the same NA tree developed
above):
>
draw.tree(tree.struct.d$frame)
Node 0 : Root node Fired 26 Nonfired 411 Total 437
Node 1 : snp2 == AA Fired 2 Nonfired 109 Total 111
Node 4 : LFdivHFPow <= 0.8012 Fired 2 Nonfired 13 Total 15
Node 5 : LFdivHFPow > 0.8012 Fired 0 Nonfired 57 Total 57
Node 6 : LFdivHFPow is.na Fired 0 Nonfired 39 Total 39
Node 2 : snp2 != AA Fired 22 Nonfired 219 Total 241
Node 7 : meanHRbpm<=83.8006 Fired 13 Nonfired 153 Total 166
Node 8 : meanHRbpm > 83.8006 Fired 7 Nonfired 35 Total 42
Node 9 : meanHRbpm is.na Fired 2 Nonfired 31 Total 33
Node 3 : snp2 is.na Fired 2 Nonfired 83 Total 85
Node 10 : LVEF <= 34.62 Fired 0 Nonfired 21 Total 21
Node 11 : LVEF > 34.62 Fired 2 Nonfired 3 Total 5
Node 12 : LVEF is.na Fired 0 Nonfired 59 Total 59
>
draw.node(tree.struct.d$frame,2)
Node 2 : snp2 != AA Fired 22 Nonfired 219 Total 241
In order to
prepare data to be used in the NA tree model, we need the following code to
organize data. The following part
of code clears the data and generates the key variable (��fired��) on which our
NA tree model develops. The
data.csv file is just the file we get after merging all the datasets.
d <- read.csv("./data.csv",as.is=TRUE)
# keep
only the implanted subjects with positive days of observation
d <-
d[!is.na(d$Implant.Date),]
d <-
d[d$Days.Of.Observation>0,]
# keep
only subjects with at least ndays of observation
ndays<-545
d <-
d[d$Days.Of.Observation>=ndays,]
#
# Create
a "fired" [logical]
variable
d$fired<-d$Days.To.First.App.Firing<=ndays
3. R Code
make.fired.5050<-function(d)
{
# Splite d into two half. One is for fired data. The other is for
not fired data.
d.fired<-d[d$fired==1,]
d.not.fired<-d[!d$fired,]
# Calculate how many fired data
observations.
k.fired<-dim(d.fired)[1]
# Calculate how many not fired
data observations.
k.not.fired<-dim(d.not.fired)[1]
# Get the ratio between number of
fired observations and not fired observations (and
# round down to the largest
integer).
rat<-floor(k.not.fired/k.fired)
if (rat>0)
{
# If ratio is
larger than 0, store d.fired to d.temp.
d.temp<-d.fired
if (rat>1)
{
# If ratio is larger than 1, we replicate attach d.fired to the end of
d.temp
# for rat-1 times. (ratio being larger than 1 means number of fired are
at least
# twice as many as not number of fired.)
for (i in seq(1,rat-1))
{
d.temp<-rbind(d.temp,d.fired)
}
}
}
else
{
# If ratio is
smaller than or equal to 0, give a error message.
print("error with replication: no firings for this dataset\n")
}
# Conbind d.temp and d.not fired
together.
rbind(d.temp,d.not.fired)
}
#######################################################################################
#
#
Function finds best split in the node given data for all the variables in
list.vars
# returns
a list that includes the name of the feature best for a split, a measure for
the goodness of the split, and the best splitting value for that feature
# d.node
is the subset of data this node contains.
#
list.vars is the list of column numbers of the features
#
node.number is the node label of the node that's being considered for a split
best.split<-function(d.node,list.vars,node.number)
{
# Assign
d.node to d.
d<-d.node
# Assign
"fired" column of data to y.
y<-d$fired
# Assign
node.n the number of observations in total.
node.n<-dim(d)[1]
#
print(paste( "dimension of d.node",dim(d)))
# Create
empty list fo goodness.list, var.key.list and var.value.list.
goodness.list<-list()
var.key.list<-list()
var.value.list<-list()
#For each
variable and each threshold(all levels for discrete
#and
limited number of thresholds for continous) compute the goodness
#and
return maximum
#var.key.list
and var.key stores which variable the split is for
#goodness.list
and goodness.vec stores the value of goodness for that split
#var.value
holds the splitting value(threshold for continous, or equal/notequal check
#for
continous
#For
discrete variables only partition of (single levels vs all others) are checked,
l checks for l levels
#instead
of 2^(l-1)-1 checks(all possible partitions of levels).
#This
only matters if there are more than 3 levels
# For
every variable in the list.vars, we run the following program to get an optimal
split.
for (i in
list.vars) {
# Create empty list for
variable name, goodness of fit and threshold.
var.index.vec<-c()
goodness.vec<-c()
var.value.vec<-c()
# If the variable is a
numeric
if (is.numeric(d[,i])) {
#
Assign discrete indicator to FALSE, which is used later in goodness.of.split()
is.discrete<-F
#
Create empty list for sample quantiles.
samp.qt<-c()
#
Get the number of not NA values for the variable.
node.not.na<-sum(!is.na(d[,i]))
#
Choose number of possible thresholds
# if number of samples in node
is not large enough, try all the possible thresholds
#
If number of not NA values is larger than or equal to 10
if
(node.not.na >= 10){
# Create a list of percentages of (0.2 0.5 0.8)
list.qt<-seq(0.2,0.8,by=0.3)
# Get corresponding quantiles in the variable.
samp.qt<- quantile(d[,i],probs=list.qt,na.rm=T)
}
else {
# If the number of not NA values is less than 10
# We only test
one threshold which is the mean.
samp.qt<-mean(d[,i],na.rm=T)
}
#
Assign samp.qt to var.value.vec
var.value.vec<-samp.qt
#
For every quantiles in samp.qt, we run the loop to generate lists of
#
goodness of fit and variable names.
for
(j in samp.qt) {
# Get the goodness of fit for every quantiles.
goodness.val<-goodness.of.split(y,d[,i],j,is.discrete)
# Put goodness of fit into a vector.
goodness.vec<-c(goodness.vec, goodness.val)
# Put variable index into a vector.
var.index.vec<-c(var.index.vec,i)
# var.value.vec<-c( var.value.vec, as.character(j))
}
} else {
#
If the variable d[,i] is discrete, then run the following program.
#
Convert the variable into factor and store it to "disc.var".
disc.var<-as.factor(d[,i])
#
Assign discrete indicator to TRUE, which is used later in goodness.of.split()
is.discrete<-T
#
List all the unique values of disc.var.
disc.levels<-levels(disc.var)
#
Put all the unique values into a vector.
var.value.vec<-disc.levels
#
For every value in disc.level, we run the loop to generate lists of
#
goodness of fit and variable names. (The same as numeric varialbe case)
for
(thres in disc.levels) {
goodness.val<-goodness.of.split(y, disc.var, thres, is.discrete)
goodness.vec<-c(goodness.vec, goodness.val)
var.index.vec<-c(var.index.vec,i)
}
}
# Append to list of
goodness values
# var.key.list hold which
variable the goodness value is for
goodness.list<-list(goodness.list,goodness.vec)
var.key.list<-list(var.key.list,var.index.vec)
var.value.list<-list(var.value.list, var.value.vec)
}
# Unlist
the goodness.list, var,key.list and var.value.list so that they become
# three
vectors, instead of three lists.
goodness.unlist <- unlist(goodness.list)
var.key.unlist <- unlist(var.key.list)
var.value.unlist
<- unlist(var.value.list)
# Look
for the maximum in the vector of "goodness.unlist", and find the
corresponding
# entry
in vector of "var.key.unlist".
argmax.goodness<-which.max(goodness.unlist)
var.best.split<-var.key.unlist[argmax.goodness]
# If the
varialbe we just found is numeric, store the threshold as a numeric value to
#
"var.thres", otherwise, store it as it is.
if
(is.numeric(d[, var.best.split])) {
var.thres<-as.numeric(var.value.unlist[argmax.goodness])
} else {
var.thres<-var.value.unlist[argmax.goodness]
}
# Record
the maximum value in the vector of "goodness.unlist".
best.goodness.in.list
<- goodness.unlist[argmax.goodness]
# If
there is no split that decreases impurity in the node
# return
NULL, otherwise, return the properties of the best split, which includes
# maximum
value of goodness of fit, best variable index, the corresponding threshold
# value.
if
(best.goodness.in.list>1e-06) {
return(list(best.goodness =
best.goodness.in.list, var.split.index = var.best.split ,
var.split.value = var.thres ))
} else {
return(NULL)
}
}
#######################################################################################
#
# Initializes
tree frame data structure
#
tree.init
<- function(d)
{
# Create
basic variables for the tree frame
# And
initiate them with value 0, NA or empty vector.
# For
split variable name, assign it "<leaf>"s for now.
node.depth<-c(0)
node.real.depth<-c(0)
node.split.var.name<-c("<leaf>")
node.split.var.value<-c(NA)
labels<-c(0)
node.dec.rule<-NA
node.parent.node<-NA
# Record
number of observations with firing and not firings.
num.fired<-sum(d$fired==1)
num.nonfired<-sum(d$fired==0)
# If
number of firing is smaller than nu.fired, assign TRUE to class.call, otherwise
FALSE.
class.call
<-c(num.nonfired <=
num.fired )
# Create
tree frame as data.frame object with fields of node label, depth, real depth,
# split
variable name, split variable value, decisioni rule, parent node, call, number
of firings
# and
number of not firings.
tree.frame<-as.data.frame(list(node.label=labels,
depth=node.depth, real.depth=node.real.depth,
split.var.name=node.split.var.name, split.var.value =node.split.var.value,
dec.rule=node.dec.rule, parent.node = node.parent.node, call=class.call,
count.fired=num.fired,
count.nonfired=num.nonfired),
stringsAsFactors=F)
# Return
the tree frame.
return(tree.frame)
}
##############################################################################################
#
# Grows a
tree starting from the root node using variables with indices in $var.list$
# returns
a list containing the data frame that stores the description of the tree ,
# and a vector
that stores in which node each sample in d ends up.
# $d$ is
the dataset
#
$tree.frame$ is the initalized
data frame that stores the
description of the tree
#
$var.list$ is the list of column numbers of the features
#
$roc.thres.ratio$ is the minimum ratio of fireds to nonfireds at which a node
is classified
# as
fired
tree.grow<-
function(d, tree.frame,tree.size, var.list,roc.thres.ratio)
{
# Assign
n as the number of observations in the dataset.
n<-dim(d)[1]
# Assign
a vector of 0s to where.samp.
where.samp<-rep(0,n)
# Assign
FALSE to stop.grow indicator.
stop.grow<-F
# Assign
TRUE to no.further.splits indicator.
no.further.splits<-T
while(stop.grow==F)
{
# Get labels of all the
leaves.
leaf.labels<-tree.frame$node.label[tree.frame$split.var.name=="<leaf>"]
# print(leaf.labels)
# Assign TRUE to
no.further.splits indicator.
no.further.splits<-T
# Assign 0 to
no.further.splits indicator.
split.using.variable <-
0
# For every leaf, we run
the following procedure to grow the tree.
for (i in leaf.labels) {
#
if node has no samples skip the current loop.
if
( sum(where.samp==i)==0) next
#
Assign d.node the dataset at node i.
d.node<-d[where.samp==i,]
# print(dim(d.node))
#
Call best.split() and get the best split variable and threshold for node i.
split.prop<-best.split(d.node, var.list, i)
#
print(split.prop)
#
print(paste("i is ",i))
#
If split.prop returns a variable instead of Null, then split the tree and
#
add more rows in the tree frame table.
if(!is.null(split.prop)){
# Assign FALSE to no.further.splits
no.further.splits<-F
# Get the split variable
split.using.variable <- split.prop$var.split.index
# If the split variable is numeric, use split.cont(), otherwise, use
split.discrete()
# to add more rows in tree frame and to change where.samp.
if (!is.numeric(d[,split.using.variable])){
# Call function split.discrete()
split.tree<-split.discrete(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
} else {
# Call function split.discrete()
split.tree<-split.cont(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
}
}
#
print(where.samp)
#
draw.tree(tree.frame)
}
# extract "leaf"
part of tree frame and assign it leaf.tree.
leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]
# print(tree.frame)
#
print(min(leaf.tree$depth))
# If the NA tree reaches
the depth of 2, or no.further.splits is TRUE, or there
# is no variables in
var.list, then stop growing the tree (i.e. stop grow<-T).
if (
(max(leaf.tree$depth)==2)||(no.further.splits)||(length(var.list)==0) ) {
stop.grow<-T
}
#
print(paste("No Further Splits",no.further.splits) )
}
# If we
have impurity in the NA/NA node of the tree
# that
node is split again.
# if (0)
{
# extract "leaf"
part of tree frame and assign it leaf.tree.
leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]
# If the minimium of real
depth is 0, we assign all.na with locations of real.depth equals to 0.
if
(min(leaf.tree$real.depth)==0) {
all.na<-which.min(leaf.tree$real.depth)
#
Extract the labels of 0 real depth part of leaf.tree.
i<-leaf.tree$node.label[all.na]
if
( sum(where.samp==i)!=0) {
d.node<-d[where.samp==i,]
# print(dim(d.node))
split.prop<-best.split(d.node, var.list, i)
# print(split.prop)
# print(paste("i is
",i))
if(!is.null(split.prop)){
split.using.variable <- split.prop$var.split.index
if
(!is.numeric(d[,split.using.variable])){
split.tree<-split.discrete(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
}
else {
split.tree<-split.cont(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
}
}
}
}
# }
return(list(frame=tree.frame,
where=where.samp) )
}
#
# Split a
node using a discrete variable and update(and return) tree frame data structure
# returns
the updated tree frame data structure and the updated $where.samp$ vector
# $d$ is
the dataset
#
$tree.frame$ is the current data
frame that stores the description
of the tree
#
$parent.node.label$ is the label of node to be split
#
$split.prop$ is a list that contains the name for the splitting variable and
the value to be used for splitting
#
$where.samp$ a vector that stores
in which node each sample in d
ends up.
#
$roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is
classified as fired
split.discrete
<- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)
{
y<-d$fired
tree.size <- dim(tree.frame)[1]-1
var.index<-split.prop$var.split.index
split.value<-split.prop$var.split.value
varname<-names(d)[var.index]
parent.node.index<-which(tree.frame$node.label==parent.node.label)
#
# Create
three new child nodes
#
left.node.indic <-
as.character(d[,varname])==as.character(split.value)
right.node.indic
<- !left.node.indic
na.node.indic <- is.na( d[,varname])
left.node.indic
[na.node.indic]<-F
right.node.indic[na.node.indic]<-F
left.node.indic <- left.node.indic &(where.samp==parent.node.label)
right.node.indic
<- right.node.indic &(where.samp==parent.node.label)
na.node.indic <- na.node.indic
&(where.samp==parent.node.label)
where.samp[
left.node.indic ] <-
tree.size+1
where.samp[
right.node.indic ] <- tree.size+2
where.samp[
na.node.indic ] <-
tree.size+3
#print(paste("parent.node
", parent.node.label) )
#print(paste("parent.node.label
is numeric ", is.numeric(parent.node.label)) )
#print(paste("parent.node
row ", parent.node.label+1) )
#print(tree.frame)
#print(paste("current
depth is ",tree.frame$depth))
#print(paste("current
depth is ",tree.frame[parent.node.label+1,2]))
fired.count.l<-sum(y[left.node.indic]==1)
fired.count.r<-sum(y[right.node.indic]==1)
fired.count.na<-sum(y[na.node.indic]==1)
nonfired.count.l<-sum(y[left.node.indic]==0)
nonfired.count.r<-sum(y[right.node.indic]==0)
nonfired.count.na<-sum(y[na.node.indic]==0)
leaf.call.l <- (fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres
leaf.call.r <-
(fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres
leaf.call.na
<- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres
new.node.labels
<- c(tree.size+1,tree.size+2,tree.size+3)
current.depth<-tree.frame$depth[parent.node.index]
current.real.depth<-tree.frame$real.depth[parent.node.index]
new.depth <-
rep(current.depth+1,3)
new.real.depth <-
rep(c(current.real.depth+1,current.real.depth),c(2,1))
new.split.varname <-
rep("<leaf>",3)
new.split.var.value
<- rep(NA,3)
new.dec.rule
<- c("==","!=",NA)
new.parent.node <-
rep(parent.node.label,3)
new.call
<- c(leaf.call.l, leaf.call.r, leaf.call.na)
new.fired.count <-
c(fired.count.l,fired.count.r,fired.count.na)
new.nonfired.count <-
c(nonfired.count.l,nonfired.count.r, nonfired.count.na)
#
# add the
new nodes to the tree.frame data frame
#
new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,
new.split.varname, new.split.var.value, new.dec.rule,
new.parent.node,new.call,new.fired.count,new.nonfired.count ),
stringsAsFactors=F)
names(new.frame)<-c("node.label"
, "depth", "real.depth", "split.var.name",
"split.var.value", "dec.rule",
"parent.node","call","count.fired",
"count.nonfired")
#print(new.frame)
tree.frame<-rbind(tree.frame,new.frame)
tree.frame$node.label<-as.numeric(tree.frame$node.label)
tree.frame$depth<-as.numeric(tree.frame$depth)
tree.frame$real.depth<-as.numeric(tree.frame$real.depth)
tree.frame$parent.node<-as.numeric(tree.frame$parent.node)
tree.frame$count.fired<-as.numeric(tree.frame$count.fired)
tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)
#
# update
parent node
#
tree.frame$split.var.name[parent.node.index]<-varname
tree.frame$split.var.value[parent.node.index]<-split.value
return(list(frame=
tree.frame, where=where.samp))
}
#
# Split a
node using a continous variable and update(and return) tree frame data
structure
# returns
the updated tree frame data structure and the updated $where.samp$ vector
# $d$ is
the dataset
#
$tree.frame$ is the current data
frame that stores the description
of the tree
#
$parent.node.label$ is the label of node to be split
#
$split.prop$ is a list that contains the name for the splitting variable and
the value to be used for splitting
#
$where.samp$ a vector that stores
in which node each sample in d
ends up.
#
$roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is
classified as fired
split.cont
<- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)
{
y<-d$fired
#
print(split.prop)
tree.size <- dim(tree.frame)[1] - 1
var.index<-split.prop$var.split.index
split.value<-split.prop$var.split.value
varname<-names(d)[var.index]
parent.node.index<-which(tree.frame$node.label==parent.node.label)
#
print(paste("is.numeric.value ",is.numeric(split.value)))
#
# Create
three new child nodes
#
left.node.indic <- d[,varname]<=split.value
right.node.indic
<- !left.node.indic
na.node.indic <- is.na( d[,varname])
left.node.indic
[na.node.indic]<-F
right.node.indic[na.node.indic]<-F
left.node.indic <- left.node.indic
&(where.samp==parent.node.label)
right.node.indic
<- right.node.indic &(where.samp==parent.node.label)
na.node.indic <- na.node.indic
&(where.samp==parent.node.label)
where.samp[
left.node.indic ] <-
tree.size+1
where.samp[
right.node.indic ] <- tree.size+2
where.samp[
na.node.indic ] <-
tree.size+3
fired.count.l<-sum(y[left.node.indic]==1)
fired.count.r<-sum(y[right.node.indic]==1)
fired.count.na<-sum(y[na.node.indic]==1)
nonfired.count.l<-sum(y[left.node.indic]==0)
nonfired.count.r<-sum(y[right.node.indic]==0)
nonfired.count.na<-sum(y[na.node.indic]==0)
leaf.call.l <-
(fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres
leaf.call.r <-
(fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres
leaf.call.na
<- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres
new.node.labels
<- c(tree.size+1,tree.size+2,tree.size+3)
current.depth<-tree.frame$depth[parent.node.index]
current.real.depth<-tree.frame$real.depth[parent.node.index]
new.depth <-
rep(current.depth+1,3)
new.real.depth <-
rep(c(current.real.depth+1,current.real.depth),c(2,1))
new.split.varname <-
rep("<leaf>",3)
new.split.var.value
<- rep(NA,3)
new.dec.rule
<- c("<=",">",NA)
new.parent.node <-
rep(parent.node.label,3)
new.call
<- c(leaf.call.l, leaf.call.r, leaf.call.na)
new.fired.count <-
c(fired.count.l,fired.count.r,fired.count.na)
new.nonfired.count <-
c(nonfired.count.l,nonfired.count.r, nonfired.count.na)
new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,
new.split.varname,
new.split.var.value,new.dec.rule,new.parent.node,new.call,new.fired.count,new.nonfired.count),
stringsAsFactors=F)
names(new.frame)<-c("node.label"
, "depth", "real.depth", "split.var.name",
"split.var.value", "dec.rule",
"parent.node","call","count.fired",
"count.nonfired")
tree.frame<-rbind(tree.frame,new.frame)
tree.frame$node.label<-as.numeric(tree.frame$node.label)
tree.frame$depth<-as.numeric(tree.frame$depth)
tree.frame$real.depth<-as.numeric(tree.frame$real.depth)
tree.frame$parent.node<-as.numeric(tree.frame$parent.node)
tree.frame$count.fired<-as.numeric(tree.frame$count.fired)
tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)
#
# update
parent node
#
tree.frame$split.var.name[parent.node.index]<-varname
tree.frame$split.var.value[parent.node.index]<-split.value
#
print(paste("update parent.node", varname," ",split.value) )
return(list(frame=
tree.frame, where=where.samp))
}
#
#
Computes goodness(decrease in entropy) of a particular split based on the
feature in split.var
# returns
a goodness of split measure
# $y$ is
the vector class labels
#
$split.var$ is the feature vector(the splitting variable)
# $thres$
is the value of the splitting variable used for splitting the node
#
$is.discrete$ is a flag that is 1 if the splitting variable is a discrete
variable
goodness.of.split <- function
(y,split.var,thres,is.discrete)
{
# If a
split is not necessary, return 0
if
((sum(y==0)==0)||(sum(y==1)==0))
return(0)
na.node<-is.na(split.var)
num.sample<-length(y)
left.node<-rep(FALSE,num.sample)
right.node<-rep(FALSE,num.sample)
# print(num.sample)
#print(paste("threshold",thres))
if (is.discrete) {
left.node<-(as.character(split.var)==as.character(thres))
left.node[na.node]<- F
right.node<-(as.character(split.var)!=as.character(thres))
right.node[na.node]<- F
} else {
left.node<-
split.var<=thres
left.node[na.node]<- F
right.node<- split.var
> thres
right.node[na.node]<- F
}
y.l<-y[left.node]
y.r<-y[right.node]
y.na<-y[na.node]
# print(paste("class 0, class
1",sum(y==0) ,sum(y==1
)))
# print(paste("left.node
size",length(y.l),sum(y.l==0)
,sum(y.l==1 )))
# print(paste("right.node
size",length(y.r),sum(y.r==0)
,sum(y.r==1 )))
# print(paste("na.node
size",length(y.na),sum(y.na==0) ,sum(y.na==1 )))
if
((length(y.l)==0)||(length(y.r)==0))
return(-5)
orig.imp <-
impurity.at.node(sum(y==0)
,sum(y==1 ))
left.imp <-
impurity.at.node(sum(y.l==0) ,sum(y.l==1 ))
right.imp <- impurity.at.node(sum(y.r==0)
,sum(y.r==1 ))
na.imp <- 0
if(length(y.na)!=0)
na.imp
<- impurity.at.node(sum(y.na==0),sum(y.na==1))
wt.left<-length(y.l)
wt.right<-length(y.r)
wt.na<- length(y.na)
wt.left<- wt.left/ num.sample
wt.right<- wt.right/ num.sample
wt.na <- wt.na/ num.sample
imp.diff<-
wt.left*left.imp+wt.right*right.imp+wt.na*na.imp-orig.imp
#
print(paste("imp.diff",imp.diff))
return(-1*imp.diff)
}
#
# Entropy
at a node
#
impurity.at.node
<- function (y.1,y.2)
{
tot.y<-y.1+y.2
impurity<-0
if (tot.y==0) {
print("Entropy Comp. Error")
impurity<- -1
} else {
p.1<-y.1/(tot.y)
p.2<-y.2/(tot.y)
impurity<-0
if ((p.1!=0)&(p.2!=0))
{
impurity <- -1*(p.1*log2(p.1)+p.2*log2(p.2))
} else {
impurity <- 0
}
}
return(impurity)
}
#impurity.at.node<-
function(y.1,y.2)
#{
# tot.y<-y.1+y.2
# impurity<-0
# if (tot.y==0) {
#
print("Entropy Comp. Error")
#
impurity<- -1
# } else {
# impurity<-
min(y.1,y.2)/tot.y
# }
# return(impurity)
#
#}
#impurity.at.node<-
function(y.1,y.2)
#{
# tot.y<-y.1+y.2
# impurity<-0
# if (tot.y==0) {
#
print("Entropy Comp. Error")
# impurity<- -1
# } else {
# impurity<-
y.1*y.2/tot.y^2
# }
# return(impurity)
#
#}
predict.na.tree<-function(tree.frame,
test.sample, numeric.or.factor)
{
at.leaf.node<-F
at.node<- 0
test.where <- 0
child.frame.index <- 0
while(at.leaf.node==F) {
node.frame.index<- which(tree.frame$node.label==at.node)
#
print(paste("node.frame.index",node.frame.index))
split.variable.name<-tree.frame$split.var.name[node.frame.index]
# print(paste("split.variable.name",split.variable.name))
#
print(paste("numeric.or.factor",numeric.or.factor[split.variable.name]))
threshold.value<-tree.frame$split.var.value[node.frame.index]
#
print(split.variable.name)
test.value<-test.sample[split.variable.name]
#
print(numeric.or.factor)
#
print(split.variable.name)
if
(numeric.or.factor[split.variable.name]) {
#
# the splitting variable is continous
#
test.value <- as.numeric(test.value)
threshold.value <- as.numeric(threshold.value)
NA.child.node<-is.na(test.value)
if (!NA.child.node) {
left.node<- test.value<=threshold.value
if (left.node) {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule=="<=") )
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule==">") )
}
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(is.na(tree.frame$dec.rule) ) )
}
}
else {
#
# if the splitting variable is discrete
#
test.value <- as.character(test.value)
threshold.value <- as.character(threshold.value)
NA.child.node<-is.na(test.value)
if
(!NA.child.node) {
left.node<- test.value==threshold.value
if (left.node) {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule=="==") )
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule=="!=") )
}
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) & (is.na(tree.frame$dec.rule)
) )
}
}
#Child node becomes the current node
at.node<-tree.frame$node.label[child.frame.index]
#Check if this current node is
a leaf.
at.leaf.node <- (tree.frame$split.var.name[child.frame.index]
=="<leaf>")
#
print(paste("child.frame.index",child.frame.index))
#print(paste("at.node",at.node))
#print(paste("at.leaf.node",at.leaf.node))
if
(at.leaf.node) {
test.where<-at.node
}
#set the index to the row whose node.label is the the current node
node.frame.index<- which(tree.frame$node.label==at.node)
}
class.call<-tree.frame$call[node.frame.index]
if (is.na(class.call)) {
print(paste("leaf.node.index",node.frame.index))
print(tree.frame)
class.call<-F
}
return(list(call=class.call,where=test.where))
}
draw.node<-function(tree.frame,node.drawn)
{
node.index<-which(tree.frame$node.label==node.drawn)
if
(node.index==0) {
print("No such node
exists")
return()
}
parent.node.label<-tree.frame$parent.node[node.index]
parent.node.index<-which(tree.frame$node.label==parent.node.label)
num.fired<-tree.frame$count.fired[node.index]
num.nonfired<-tree.frame$count.nonfired[node.index]
num.total<-num.fired+num.nonfired
node.depth<-tree.frame$depth[node.index]
node.str<-character()
if
(node.depth>0) {
for (i in 1:node.depth) {
node.str<-paste(node.str,"\t")
}
}
node.str<-paste(node.str,"Node
",node.drawn,": ")
if
(is.na(parent.node.label)) {
node.str<-paste(node.str, "Root node ", "
Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)
} else if
(!is.na(tree.frame$dec.rule[node.index])) {
node.str<-paste(node.str,
tree.frame$split.var.name[parent.node.index],tree.frame$dec.rule[node.index],
tree.frame$split.var.value[parent.node.index],
" Fired" ,num.fired,
"Nonfired",num.nonfired,
"Total", num.total)
} else {
node.str<-paste(node.str, tree.frame$split.var.name[parent.node.index]
,"is.na",
" Fired" ,num.fired,
"Nonfired",num.nonfired,
"Total", num.total)
}
node.str<-paste(node.str,"\n")
cat(node.str)
}
draw.tree<-function(tree.frame)
{
all.nodes.drawn<-F
current.node<-0
tree.size<-dim(tree.frame)[1]
drawn.nodes<-rep(0,tree.size)
node.queue<-c()
while(all.nodes.drawn==F)
{
draw.node(tree.frame,current.node)
current.node.index <-
which(tree.frame$node.label ==current.node)
child.nodes.index <- which(tree.frame$parent.node==current.node)
if
(length(child.nodes.index)==0) {
} else {
child.nodes <- tree.frame$node.label[child.nodes.index]
node.queue <-
c(child.nodes,node.queue)
}
node.queue <- node.queue[ node.queue!=current.node]
drawn.nodes[current.node.index] <-1
if(sum(drawn.nodes)==tree.size) {
all.nodes.drawn <- T
}
else
current.node<-node.queue[1]
}
}
make.fired.5050<-function(d)
{
d.fired<-d[d$fired==1,]
d.not.fired<-d[!d$fired,]
k.fired<-dim(d.fired)[1]
k.not.fired<-dim(d.not.fired)[1]
rat<-floor(k.not.fired/k.fired)
if (rat>0)
{
d.temp<-d.fired
if (rat>1)
{
for (i in seq(1,rat-1))
{
d.temp<-rbind(d.temp,d.fired)
}
}
}
else
{
print("error with replication: no firings for this dataset\n")
}
rbind(d.temp,d.not.fired)
}
#######################################################################################
#
#
Function finds best split in the node given data for all the variables in
list.vars
# returns
a list that includes the name of the feature best for a split, a measure for
the goodness of the split, and the best splitting value for that feature
# d.node
is the subset of data this node contains.
#
list.vars is the list of column numbers of the features
#
node.number is the node label of the node that's being considered for a split
best.split<-function(d.node,list.vars,node.number)
{
d<-d.node
y<-d$fired
node.n<-dim(d)[1]
goodness.list<-list()
var.key.list<-list()
var.value.list<-list()
for (i in
list.vars) {
var.index.vec<-c()
goodness.vec<-c()
var.value.vec<-c()
if (is.numeric(d[,i])) {
is.discrete<-F
samp.qt<-c()
node.not.na<-sum(!is.na(d[,i]))
if
(node.not.na >= 10){
list.qt<-seq(0.2,0.8,by=0.3)
samp.qt<- quantile(d[,i],probs=list.qt,na.rm=T)
}
else {
samp.qt<-mean(d[,i],na.rm=T)
}
var.value.vec<-samp.qt
for
(j in samp.qt) {
goodness.val<-goodness.of.split(y,d[,i],j,is.discrete)
goodness.vec<-c(goodness.vec, goodness.val)
var.index.vec<-c(var.index.vec,i)
}
} else {
disc.var<-as.factor(d[,i])
is.discrete<-T
disc.levels<-levels(disc.var)
var.value.vec<-disc.levels
for
(thres in disc.levels) {
goodness.val<-goodness.of.split(y,
disc.var, thres, is.discrete)
goodness.vec<-c(goodness.vec, goodness.val)
var.index.vec<-c(var.index.vec,i)
}
}
goodness.list<-list(goodness.list,goodness.vec)
var.key.list<-list(var.key.list,var.index.vec)
var.value.list<-list(var.value.list, var.value.vec)
}
goodness.unlist <- unlist(goodness.list)
var.key.unlist <- unlist(var.key.list)
var.value.unlist
<- unlist(var.value.list)
argmax.goodness<-which.max(goodness.unlist)
var.best.split<-var.key.unlist[argmax.goodness]
if
(is.numeric(d[, var.best.split])) {
var.thres<-as.numeric(var.value.unlist[argmax.goodness])
} else {
var.thres<-var.value.unlist[argmax.goodness]
}
best.goodness.in.list
<- goodness.unlist[argmax.goodness]
if
(best.goodness.in.list>1e-06) {
return(list(best.goodness =
best.goodness.in.list, var.split.index = var.best.split ,
var.split.value = var.thres ))
} else {
return(NULL)
}
}
#######################################################################################
#
#
Initializes tree frame data structure
#
tree.init
<- function(d)
{
node.depth<-c(0)
node.real.depth<-c(0)
node.split.var.name<-c("<leaf>")
node.split.var.value<-c(NA)
labels<-c(0)
node.dec.rule<-NA
node.parent.node<-NA
num.fired<-sum(d$fired==1)
num.nonfired<-sum(d$fired==0)
class.call
<-c(num.nonfired <=
num.fired )
tree.frame<-as.data.frame(list(node.label=labels,
depth=node.depth, real.depth=node.real.depth,
split.var.name=node.split.var.name, split.var.value
=node.split.var.value,
dec.rule=node.dec.rule, parent.node = node.parent.node, call=class.call,
count.fired=num.fired,
count.nonfired=num.nonfired),
stringsAsFactors=F)
# Return
the tree frame.
return(tree.frame)
}
##############################################################################################
#
# Grows a
tree starting from the root node using variables with indices in $var.list$
# returns
a list containing the data frame that stores the description of the tree ,
# and a
vector that stores in which
node each sample in d ends up.
# $d$ is
the dataset
#
$tree.frame$ is the initalized
data frame that stores the
description of the tree
#
$var.list$ is the list of column numbers of the features
#
$roc.thres.ratio$ is the minimum ratio of fireds to nonfireds at which a node
is classified
# as
fired
tree.grow<-
function(d, tree.frame,tree.size, var.list,roc.thres.ratio)
{
n<-dim(d)[1]
where.samp<-rep(0,n)
stop.grow<-F
no.further.splits<-T
while(stop.grow==F)
{
leaf.labels<-tree.frame$node.label[tree.frame$split.var.name=="<leaf>"]
no.further.splits<-T
split.using.variable <-
0
for (i in leaf.labels) {
if
( sum(where.samp==i)==0) next
d.node<-d[where.samp==i,]
split.prop<-best.split(d.node, var.list, i)
if(!is.null(split.prop)){
no.further.splits<-F
split.using.variable <- split.prop$var.split.index
if
(!is.numeric(d[,split.using.variable])){
split.tree<-split.discrete(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
} else {
split.tree<-split.cont(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
}
}
}
leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]
if (
(max(leaf.tree$depth)==2)||(no.further.splits)||(length(var.list)==0) ) {
stop.grow<-T
}
}
# If we
have impurity in the NA/NA node of the tree
# that
node is split again.
# if (0)
{
leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]
if
(min(leaf.tree$real.depth)==0) {
all.na<-which.min(leaf.tree$real.depth)
i<-leaf.tree$node.label[all.na]
if
( sum(where.samp==i)!=0) {
d.node<-d[where.samp==i,]
split.prop<-best.split(d.node, var.list, i)
if(!is.null(split.prop)){
split.using.variable <- split.prop$var.split.index
if (!is.numeric(d[,split.using.variable])){
split.tree<-split.discrete(d,tree.frame,
i, split.prop, where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
} else {
split.tree<-split.cont(d,tree.frame, i, split.prop,
where.samp,roc.thres.ratio)
tree.frame<-split.tree$frame
where.samp<-split.tree$where
}
}
}
}
# }
return(list(frame=tree.frame,
where=where.samp) )
}
#
# Split a
node using a discrete variable and update(and return) tree frame data structure
# returns
the updated tree frame data structure and the updated $where.samp$ vector
# $d$ is
the dataset
#
$tree.frame$ is the current data
frame that stores the description
of the tree
#
$parent.node.label$ is the label of node to be split
#
$split.prop$ is a list that contains the name for the splitting variable and
the value to be used for splitting
#
$where.samp$ a vector that stores in
which node each sample in d ends
up.
#
$roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is
classified as fired
split.discrete
<- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)
{
y<-d$fired
tree.size <- dim(tree.frame)[1]-1
var.index<-split.prop$var.split.index
split.value<-split.prop$var.split.value
varname<-names(d)[var.index]
parent.node.index<-which(tree.frame$node.label==parent.node.label)
left.node.indic <- as.character(d[,varname])==as.character(split.value)
right.node.indic
<- !left.node.indic
na.node.indic <- is.na( d[,varname])
left.node.indic
[na.node.indic]<-F
right.node.indic[na.node.indic]<-F
left.node.indic <- left.node.indic &(where.samp==parent.node.label)
right.node.indic
<- right.node.indic &(where.samp==parent.node.label)
na.node.indic <- na.node.indic
&(where.samp==parent.node.label)
where.samp[
left.node.indic ] <-
tree.size+1
where.samp[
right.node.indic ] <- tree.size+2
where.samp[
na.node.indic ] <-
tree.size+3
fired.count.l<-sum(y[left.node.indic]==1)
fired.count.r<-sum(y[right.node.indic]==1)
fired.count.na<-sum(y[na.node.indic]==1)
nonfired.count.l<-sum(y[left.node.indic]==0)
nonfired.count.r<-sum(y[right.node.indic]==0)
nonfired.count.na<-sum(y[na.node.indic]==0)
leaf.call.l <-
(fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres
leaf.call.r <-
(fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres
leaf.call.na
<- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres
new.node.labels
<- c(tree.size+1,tree.size+2,tree.size+3)
current.depth<-tree.frame$depth[parent.node.index]
current.real.depth<-tree.frame$real.depth[parent.node.index]
new.depth <-
rep(current.depth+1,3)
new.real.depth <- rep(c(current.real.depth+1,current.real.depth),c(2,1))
new.split.varname <-
rep("<leaf>",3)
new.split.var.value
<- rep(NA,3)
new.dec.rule
<- c("==","!=",NA)
new.parent.node <-
rep(parent.node.label,3)
new.call
<- c(leaf.call.l, leaf.call.r, leaf.call.na)
new.fired.count <-
c(fired.count.l,fired.count.r,fired.count.na)
new.nonfired.count <-
c(nonfired.count.l,nonfired.count.r, nonfired.count.na)
new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,
new.split.varname,
new.split.var.value, new.dec.rule,
new.parent.node,new.call,new.fired.count,new.nonfired.count ),
stringsAsFactors=F)
names(new.frame)<-c("node.label"
, "depth", "real.depth", "split.var.name",
"split.var.value", "dec.rule",
"parent.node","call","count.fired",
"count.nonfired")
tree.frame<-rbind(tree.frame,new.frame)
tree.frame$node.label<-as.numeric(tree.frame$node.label)
tree.frame$depth<-as.numeric(tree.frame$depth)
tree.frame$real.depth<-as.numeric(tree.frame$real.depth)
tree.frame$parent.node<-as.numeric(tree.frame$parent.node)
tree.frame$count.fired<-as.numeric(tree.frame$count.fired)
tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)
tree.frame$split.var.name[parent.node.index]<-varname
tree.frame$split.var.value[parent.node.index]<-split.value
return(list(frame=
tree.frame, where=where.samp))
}
#
# Split a
node using a continous variable and update(and return) tree frame data
structure
# returns
the updated tree frame data structure and the updated $where.samp$ vector
# $d$ is
the dataset
#
$tree.frame$ is the current data
frame that stores the description
of the tree
#
$parent.node.label$ is the label of node to be split
#
$split.prop$ is a list that contains the name for the splitting variable and
the value to be used for splitting
#
$where.samp$ a vector that stores
in which node each sample in d
ends up.
#
$roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is
classified as fired
split.cont
<- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)
{
y<-d$fired
tree.size <- dim(tree.frame)[1] - 1
var.index<-split.prop$var.split.index
split.value<-split.prop$var.split.value
varname<-names(d)[var.index]
parent.node.index<-which(tree.frame$node.label==parent.node.label)
left.node.indic <- d[,varname]<=split.value
right.node.indic
<- !left.node.indic
na.node.indic <- is.na( d[,varname])
left.node.indic
[na.node.indic]<-F
right.node.indic[na.node.indic]<-F
left.node.indic <- left.node.indic
&(where.samp==parent.node.label)
right.node.indic
<- right.node.indic &(where.samp==parent.node.label)
na.node.indic <- na.node.indic
&(where.samp==parent.node.label)
where.samp[
left.node.indic ] <-
tree.size+1
where.samp[
right.node.indic ] <- tree.size+2
where.samp[
na.node.indic ] <-
tree.size+3
fired.count.l<-sum(y[left.node.indic]==1)
fired.count.r<-sum(y[right.node.indic]==1)
fired.count.na<-sum(y[na.node.indic]==1)
nonfired.count.l<-sum(y[left.node.indic]==0)
nonfired.count.r<-sum(y[right.node.indic]==0)
nonfired.count.na<-sum(y[na.node.indic]==0)
leaf.call.l <-
(fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres
leaf.call.r <-
(fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres
leaf.call.na
<- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres
new.node.labels
<- c(tree.size+1,tree.size+2,tree.size+3)
current.depth<-tree.frame$depth[parent.node.index]
current.real.depth<-tree.frame$real.depth[parent.node.index]
new.depth <-
rep(current.depth+1,3)
new.real.depth <-
rep(c(current.real.depth+1,current.real.depth),c(2,1))
new.split.varname <-
rep("<leaf>",3)
new.split.var.value
<- rep(NA,3)
new.dec.rule
<- c("<=",">",NA)
new.parent.node <-
rep(parent.node.label,3)
new.call <- c(leaf.call.l,
leaf.call.r, leaf.call.na)
new.fired.count <-
c(fired.count.l,fired.count.r,fired.count.na)
new.nonfired.count <-
c(nonfired.count.l,nonfired.count.r, nonfired.count.na)
new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,
new.split.varname,
new.split.var.value,new.dec.rule,new.parent.node,new.call,new.fired.count,new.nonfired.count),
stringsAsFactors=F)
names(new.frame)<-c("node.label"
, "depth", "real.depth", "split.var.name",
"split.var.value",
"dec.rule",
"parent.node","call","count.fired",
"count.nonfired")
tree.frame<-rbind(tree.frame,new.frame)
tree.frame$node.label<-as.numeric(tree.frame$node.label)
tree.frame$depth<-as.numeric(tree.frame$depth)
tree.frame$real.depth<-as.numeric(tree.frame$real.depth)
tree.frame$parent.node<-as.numeric(tree.frame$parent.node)
tree.frame$count.fired<-as.numeric(tree.frame$count.fired)
tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)
tree.frame$split.var.name[parent.node.index]<-varname
tree.frame$split.var.value[parent.node.index]<-split.value
return(list(frame=
tree.frame, where=where.samp))
}
#
#
Computes goodness(decrease in entropy) of a particular split based on the
feature in split.var
# returns
a goodness of split measure
# $y$ is
the vector class labels
#
$split.var$ is the feature vector(the splitting variable)
# $thres$
is the value of the splitting variable used for splitting the node
#
$is.discrete$ is a flag that is 1 if the splitting variable is a discrete
variable
goodness.of.split <- function
(y,split.var,thres,is.discrete)
{
if
((sum(y==0)==0)||(sum(y==1)==0))
return(0)
na.node<-is.na(split.var)
num.sample<-length(y)
left.node<-rep(FALSE,num.sample)
right.node<-rep(FALSE,num.sample)
if (is.discrete) {
left.node<-(as.character(split.var)==as.character(thres))
left.node[na.node]<- F
right.node<-(as.character(split.var)!=as.character(thres))
right.node[na.node]<- F
} else {
left.node<-
split.var<=thres
left.node[na.node]<- F
right.node<- split.var
> thres
right.node[na.node]<- F
}
y.l<-y[left.node]
y.r<-y[right.node]
y.na<-y[na.node]
if
((length(y.l)==0)||(length(y.r)==0))
return(-5)
orig.imp <-
impurity.at.node(sum(y==0)
,sum(y==1 ))
left.imp <-
impurity.at.node(sum(y.l==0) ,sum(y.l==1 ))
right.imp <- impurity.at.node(sum(y.r==0)
,sum(y.r==1 ))
na.imp <- 0
if(length(y.na)!=0)
na.imp <-
impurity.at.node(sum(y.na==0),sum(y.na==1))
wt.left<-length(y.l)
wt.right<-length(y.r)
wt.na<- length(y.na)
wt.left<- wt.left/ num.sample
wt.right<- wt.right/ num.sample
wt.na <- wt.na/ num.sample
imp.diff<-
wt.left*left.imp+wt.right*right.imp+wt.na*na.imp-orig.imp
return(-1*imp.diff)
}
#
# Entropy
at a node
#
impurity.at.node
<- function (y.1,y.2)
{
tot.y<-y.1+y.2
impurity<-0
if (tot.y==0) {
print("Entropy Comp. Error")
impurity<- -1
} else {
p.1<-y.1/(tot.y)
p.2<-y.2/(tot.y)
impurity<-0
if ((p.1!=0)&(p.2!=0))
{
impurity <-
-1*(p.1*log2(p.1)+p.2*log2(p.2))
} else {
impurity <- 0
}
}
return(impurity)
}
#impurity.at.node<-
function(y.1,y.2)
#{
# tot.y<-y.1+y.2
# impurity<-0
# if (tot.y==0) {
#
print("Entropy Comp. Error")
#
impurity<- -1
# } else {
# impurity<-
min(y.1,y.2)/tot.y
# }
# return(impurity)
#
#}
#impurity.at.node<-
function(y.1,y.2)
#{
# tot.y<-y.1+y.2
# impurity<-0
# if (tot.y==0) {
#
print("Entropy Comp. Error")
#
impurity<- -1
# } else {
# impurity<-
y.1*y.2/tot.y^2
# }
# return(impurity)
#
#}
predict.na.tree<-function(tree.frame,
test.sample, numeric.or.factor)
{
at.leaf.node<-F
at.node<- 0
test.where <- 0
child.frame.index <- 0
while(at.leaf.node==F) {
node.frame.index<- which(tree.frame$node.label==at.node)
split.variable.name<-tree.frame$split.var.name[node.frame.index]
threshold.value<-tree.frame$split.var.value[node.frame.index]
test.value<-test.sample[split.variable.name]
if
(numeric.or.factor[split.variable.name]) {
test.value <-
as.numeric(test.value)
threshold.value <- as.numeric(threshold.value)
NA.child.node<-is.na(test.value)
if (!NA.child.node) {
left.node<- test.value<=threshold.value
if
(left.node) {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule=="<=") )
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule==">") )
}
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(is.na(tree.frame$dec.rule) ) )
}
}
else {
test.value <-
as.character(test.value)
threshold.value <- as.character(threshold.value)
NA.child.node<-is.na(test.value)
if (!NA.child.node) {
left.node<- test.value==threshold.value
if (left.node) {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule=="==") )
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(tree.frame$dec.rule=="!=") )
}
} else {
child.frame.index <- which((tree.frame$parent.node==at.node) &
(is.na(tree.frame$dec.rule) ) )
}
}
at.node<-tree.frame$node.label[child.frame.index]
at.leaf.node <- (tree.frame$split.var.name[child.frame.index]
=="<leaf>")
if
(at.leaf.node) {
test.where<-at.node
}
node.frame.index<- which(tree.frame$node.label==at.node)
}
class.call<-tree.frame$call[node.frame.index]
if (is.na(class.call)) {
print(paste("leaf.node.index",node.frame.index))
print(tree.frame)
class.call<-F
}
return(list(call=class.call,where=test.where))
}
draw.node<-function(tree.frame,node.drawn)
{
node.index<-which(tree.frame$node.label==node.drawn)
if
(node.index==0) {
print("No such node
exists")
return()
}
parent.node.label<-tree.frame$parent.node[node.index]
parent.node.index<-which(tree.frame$node.label==parent.node.label)
num.fired<-tree.frame$count.fired[node.index]
num.nonfired<-tree.frame$count.nonfired[node.index]
num.total<-num.fired+num.nonfired
node.depth<-tree.frame$depth[node.index]
node.str<-character()
if (node.depth>0)
{
for (i in 1:node.depth) {
node.str<-paste(node.str,"\t")
}
}
node.str<-paste(node.str,"Node
",node.drawn,": ")
if
(is.na(parent.node.label)) {
node.str<-paste(node.str, "Root node ", "
Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)
} else if
(!is.na(tree.frame$dec.rule[node.index])) {
node.str<-paste(node.str,
tree.frame$split.var.name[parent.node.index],tree.frame$dec.rule[node.index],
tree.frame$split.var.value[parent.node.index],
" Fired" ,num.fired,
"Nonfired",num.nonfired,
"Total", num.total)
} else {
node.str<-paste(node.str,
tree.frame$split.var.name[parent.node.index] ,"is.na",
" Fired" ,num.fired,
"Nonfired",num.nonfired,
"Total", num.total)
}
node.str<-paste(node.str,"\n")
cat(node.str)
}
draw.tree<-function(tree.frame)
{
all.nodes.drawn<-F
current.node<-0
tree.size<-dim(tree.frame)[1]
drawn.nodes<-rep(0,tree.size)
node.queue<-c()
while(all.nodes.drawn==F)
{
draw.node(tree.frame,current.node)
current.node.index <-
which(tree.frame$node.label ==current.node)
child.nodes.index <-
which(tree.frame$parent.node==current.node)
if
(length(child.nodes.index)==0) {
} else {
child.nodes <- tree.frame$node.label[child.nodes.index]
node.queue <-
c(child.nodes,node.queue)
}
node.queue <- node.queue[ node.queue!=current.node]
drawn.nodes[current.node.index] <-1
if(sum(drawn.nodes)==tree.size) {
all.nodes.drawn <- T
}
else
current.node<-node.queue[1]
}
}