NA Tree Classifier

In the traditional classification tree modeling, it is assumed that there are no missing (NA) data, or is at least relatively rare. data. For the Reynolds cardiovascular datasets we have been studying, missing data are the rule rather than the exception. For example, imaging data is available only for a small subset of subjects. To address this situation we developed a classification method which we refer to as the NA-Tree Classifier. It makes use of traditional classification tree concepts. However, instead of creating a decision tree with the familiar binary structure, it creates a ternary tree, adding an extra NA branch and corresponding node associated with subjects missing data for a the variable defining a given split.

For instance, when we ask subjects survey questions there may be informative responses such as "Yes", "No", but there is also the possibility of noninformative responses such as "Not Applicable" or "Not available". For informative responses, we can follow the the traditional tree approach and create branches and child nodes for each possible response. For the latter two responses, we don't know what to do with such subjects in the traditional setting. However we solve this in the NA tree approach by categorize subjects intoto the NA node.

Conceptually, NA trees use the same ideas as traditional classification trees, but making use of a possible third branch (NA branch) at each node gives us much more flexibility when dealing with incomplete data. The method described here should prove useful in large medical studies when subsets of the subject population are chosen for particular analyses.

Index

Top

1.1. NA Tree Structure

As described above, each non-leaf node of the NA tree has at most three branches. Subjects traverse the tree starting at the root node. Each non-leaf node has an associated feature variable and a threshold. A subject is passed to one of the node's children as follows. If the subject's feature value is available and less than or equal to the threshold, it passes to the left child. If it is available and greater than the threshold it passes to the right child. If the feature value is unavailable for the subject, then it passes to the middle (NA )node..

A typical NA tree would have the following appearance:

As described in Section 1.2 entitled "Pruned Tree Classifier" of the Painted Tree Classifier section, our sample size is somewhat limited so in order to avoid over-fitting we restrict trees to have a maximum depth of two, meaning that classification of a subject involves decisions made in at most two nodes. In instances where the sample size is larger higher depth trees should certainly be considered.

Index

Top

1.1.1. Special Adjustment for Root-NA-NA Node

Most of the NA trees resemble the one shown above, with the only exception being that, for the split Root -> NA -> NA, although it is in depth of 2, the real depth is still 0, that is, it has not been split by any real feature at all. So if it has impurity at that node, we split it again to further classify that part of data. If node * (indicated in the previous graph) has both firing and not firings (i.e. impurity), the NA tree will have the following appearance:

Index

Top

1.2. Categorical Feature

Some of the features we use are categorical variables, for example, sex and blood type. To incorporate such features in the NA tree, we create binary variables associated with each possble variable value, and split the tree based on criterion feature equals value vs. feature does not equal value, and of course NA.

The discrete tree will look like:

Note: In the above figure "!=" is used to denote "not equal to".

Index

Top

1.3. Splitting Criterion

At each node, we need to find an optimal split, which amounts to picking a best feature and a threshold (or value to split on in the categorical case). For this, we use the entropy criterion in selection. We try all the possible features and a large collection of thresholds (or values), and we calculate the entropy decrease caused by each split. We then select the feature and threshold (value) pair giving rise to the largest entropy decrease. If none of the choice leads to a decrease, or if the decrease in entropy is not sufficiently large, the node is made into a leaf.

We define the entropy at each node as:

Where

Definitions of entropy and other relevant concepts are introduced in Document 2. "Feature Selection Instruction", 1.1 "Information Theory, Entropy, CMI".

And the entropy after splitting is defined as:

where

From the definition, we know that is weighted average of the three child nodes. It describes the average entropy these three nodes carry.

Naturally, the entropy difference is:

Finally, we pick the feature i and threshold t which maximize D at each node.

We illustrate the construction with an example. We reach a node (parent node) where there are 10 firings and 100 not firings. After we split the node with an arbitrary feature and threshold, we arrive at the following tree:

Then the entropy at parent node is:

Follow the same formula, we obtain entropies at left node, NA node and right node as respectively, 0.7219281, 0.6500224 and 0.2472944.

We get the entropy after split is

The difference (i.e. entropy decrease) D is =0.439497-0.3990997=0.0403973

We repeat the same procedure for every possible feature and threshold, then pick the pair with the largest D. That is our best split selection at that node.

Index

Top

1.4. Threshold List

For numeric features, it is inefficent to try every possible real number as a threshold. In addition, allowing for every possible thresold is likely to lead to over-fitting. Instead, we calculate the 20%, 50% and 80% percentiles of the feature in training data and use these as thresholds.

For discrete feature, we simply use every possible value (or level) as threshold. For each level in that feature, the splitting criterion becomes: 1. feature A = level i; 2. feature A != level i; 3 feature A is NA.

Index

Top

1.5. Depth, Real Depth and Label

As discussed in section 1.1.1, it is important to distinguish between the depth of a node in an NA-tree (tree depth) and the decision depth, i.e. the number of decisions that have actually been made based on available features to get to a given node.

An graph illustration will make this more sensible.

Since there are so many nodes and features, it is very hard to keep track where the each observation is classified. So we assign numerical labels to the nodes. The labels are shown in the graph. These labels are fixed and do not change when we change datasets. Thus, n the future, Node 5 is always referring to the node which has first split to left (<= for numeric feature and = for discrete feature) and then to right (> for numeric feature and != for discrete feature). And node 12 will always referr to the NA-NA node.

If node 12 has impurity, we will further split it and label the three new nodes as node 13(left, <=), 14(right, >), 15(middle, NA).

Index

Top

1.6. Pre-selected Features

Our CVRG dataset contains approximately 100 features. Given the available sample size, it is no surprising that using all of these features for our NA tree produces poor cross-validated classification rates. So, instead, we pre-select roughly 12 features among the 100 based on cardiovascular experts' judgement, and perform the analysis based on these 12 features.

These features are "meanHRbpm", "percPVC", "QTVI_log", "LFdivHFPow", "SDNN_RST_msec", "snp2", "snp5", "snp6", "Gender.x", "Inducible", "LVEF", "DEmass". We will update our pre-selected feature list according to the result. We will update our pre-selected feature list according to the result.

We note the full feature lists of all the important numeric features and discrete features here:

Discrete list contains: "snp1", "snp2", "snp3", "snp4", "snp5", "snp6", "Gender.x", "Race", "cardiom.type", "Inducible".

Continuous list contains: "meanHRbpm", "perckept", "percPVC", "QTVI_log", "QTVInumer.1000", "QTVIdenom.1000", "QTV_msecsqu", "HRV_bpmsqu", "QTintmean_sec", "HRmean_bpm", "QTRRslope", "QTRRintercept", "QTRR_r2", "meancoh", "VLFPow", "LFPow", "HFPow", "TotPow_clinical_use", "TotPow_mathematical", "LFdivHFPow", "LFdivHFplLF", "RMSSD_msec", "pNN50", "meanNN_msec", "SDNN_msec", "total_RRvariance_msec", "SDNN_RST_msec", "sdRLTESQTintmsec", "avgNssthoseelim", "avgNssthosekept", "LVEDV", "LVESV", "LVEF", "LVEDMASS", "DEmass".

Index

Top

1.7. Data

In our analysis, we have total number of observations of 937. After our data cleaning procedure, i.e. removing observations with NA implant dates, and removing observations with negative days of observations. These two steps removed 343 observations, and left us 594 observations. We use as a phenotype "fired within ndays of implant." Since the phenotype is based on observing individuals for at least ndays, in order to eliminate sampling bias, we must exclude all subjects who have not implanted at least ndays ago. We then build up our NA tree based on the response variable "fired" using the other features. The phenotype "fired" is define to mean d$fired<-d$Days.To.First.App.Firing<=ndays, where ndays is a pre-specified number. This means that if the days of firing is smaller than ndays, we code a new variable "fired" 1, and otherwise 0. The value of ndays value can be changed depending on the interest of the researcher's goal. We have focused on a vaue of ndays of 545, i.e. one and one half years. There are multiple reasons why we use the 545 day criterion. One point to make is that if we choose too small time of observation criterion we end up with too small a number of firings. On the other hand if we require that subjects need to be observed for too long an amount of time to be included, then we lose sample size.

For example, if we include all the patients who had fired before 545 days, then both patient X and Y should be included since they both have days of firing less than 545 days (300 and 500 respectively). However, only patient X is observable to us and it is impossible to observed patient Y. It is because that we only observed patient Y for 400 days and cannot know whether he fires after 400 days. So without being able to include both patient X and Y, we are introducing bias into our model.

Note:

Days of firing are independent of days of observations.

So, to sum up, in order to analyze firings happened before 545 days, we have to exclude all the observations with days of observations less than 545 days. So observations left will have either days of firing before or after 545 days, but all of them have days of observations larger than 545 days.

Index

Top

1.8. n-Fold Cross Validation

As we did for every analysis, we need to do cross validation for this method. The definition of cross validation and basic method is explained in document 5, "Painted Tree Classifier", 1.3. "Leave One Out Cross Validation (LOOCV)".

Cross validation is a method in machine learning to test the performance of a statistical analysis. It is mainly used when the statistical analysis's goal is prediction and one wants to test its accuracy. The routine procedure is that first partitioning a sample of data into complementary subsets, performing the analysis on one subset (called the training set or learning set) to get the model and parameters, and validating the model on the rest dataset (called testing set).

The n-fold cross-validation is basically partitioning the original dataset into n subsets (approximately equal size). One of them is drawn as the testing set, and the remaining n−1 subsets are used as training set. The cross-validation process is then repeated n times (the folds), with each of the subset used exactly once as the testing set. Then the n results from each validation can be averaged to produce a single evaluation. The advantage of this method is that all observations are used for both training and validation, and each observation is used in testing set exactly once.

The leave one out cross validation mentioned in "Painted Tree Classifier" is the special case of n-fold cross validation with n being the sample size.

In our analysis, we used 8 fold cross validation.

Index

Top

2. Program Description

Since we changed the structure of the tree model, there is not an existing software package to do such analysis. We program our own NA tree from scratch, which involves many functions. I am going to introduce them one by one with basic input, output and functions.

Index

Top

2.1. Fucntion tree.init()

This function basically takes a dataset which contains features and response variable, and creates a tree frame. Please note that this function does not really generate the NA tree, instead, it initiates the tree model by building up the starting point of a NA tree root and the tree frame which tree.grow() will be base on.

We can see below that the output of tree.int() is just a table with one entry (root node).

Output of tree.int()

node.label depth real.depth split.var.name split.var.value dec.rule

1 0 0 0 <leaf> NA NA

parent.node call count.fired count.nonfired

1 NA FALSE 26 411

As we can see, the dataset has 411+26=437 patients (or observations), of which 26 patients have fired and 411 patients have not fired.

Index

Top

2.2. Function tree.grow()

This function takes a dataset and a tree frame which is created by tree.int(), and generates the whole NA tree and also the classification of the training data. This is the biggest function since it calls nearly all the rest of functions at some point during the program. It can be regard as an outline of the whole procedure, while each other function just does one single step of the procedure.

Index

Top

2.2.1. Output Description

We need to carefully explain the output of this function since it is most important information our model gives. It is crucial to fully understand the output. For example, we build up a tree and print the output.

> tree.d<-tree.init(d)

> tree.struct.d <- tree.grow(d, tree.d, 1, features, 0.5)

> tree.struct.d

Output:

$frame

node.label depth real.depth split.var.name split.var.value dec.rule

1 0 0 0 snp2 AA <NA>

2 1 1 1 LFdivHFPow 0.8012 ==

3 2 1 1 meanHRbpm 83.8006 !=

4 3 1 0 LVEF 34.62 <NA>

5 4 2 2 <leaf> <NA> <=

6 5 2 2 <leaf> <NA> >

7 6 2 1 <leaf> <NA> <NA>

8 7 2 2 <leaf> <NA> <=

9 8 2 2 <leaf> <NA> >

10 9 2 1 <leaf> <NA> <NA>

11 10 2 1 <leaf> <NA> <=

12 11 2 1 <leaf> <NA> >

13 12 2 0 <leaf> <NA> <NA>

parent.node call count.fired count.nonfired

1 NA FALSE 26 411

2 0 FALSE 2 109

3 0 FALSE 22 219

4 0 FALSE 2 83

5 1 FALSE 2 13

6 1 FALSE 0 57

7 1 FALSE 0 39

8 2 FALSE 13 153

9 2 FALSE 7 35

10 2 FALSE 2 31

11 3 FALSE 0 21

12 3 FALSE 2 3

13 3 FALSE 0 59

$where

[1] 7 7 9 9 6 6 6 6 7 6 7 12 9 6 12 7 12 9 9 12 6

[22] 9 6 6 6 10 6 9 6 9 6 6 6 9 9 9 6 12 9 9 9 7

[43] 7 6 9 6 9 9 7 9 7 7 9 12 9 7 7 6 7 9 9 7 6

[64] 9 7 7 7 7 8 9 12 12 7 7 8 6 7 6 9 7 9 7 9 6

[85] 6 7 6 7 6 6 7 5 7 5 7 7 7 5 8 7 8 10 9 8 6

[106] 8 9 7 7 6 7 6 7 7 7 7 8 7 5 7 4 7 8 4 8 7

[127] 7 7 5 6 7 6 8 8 6 6 8 5 4 7 5 8 9 7 7 7 7

[148] 4 8 5 6 5 7 7 7 8 9 7 7 7 8 7 7 7 4 7 8 7

[169] 4 7 7 5 7 5 5 4 7 7 8 7 7 10 5 8 12 7 7 12 7

[190] 7 7 7 5 7 7 7 7 5 7 7 7 7 7 5 10 7 12 12 7 12

[211] 7 7 7 7 7 7 7 8 5 7 7 8 7 7 7 5 5 5 10 7 7

[232] 12 8 5 7 8 9 6 8 12 7 8 8 8 5 5 5 4 7 7 7 8

[253] 7 7 4 8 7 7 5 7 12 5 7 7 8 7 4 7 8 5 8 7 7

[274] 8 5 8 7 7 7 5 7 7 12 7 5 7 7 5 7 5 5 7 7 12

[295] 5 5 8 7 8 7 7 7 7 7 8 7 5 5 5 7 8 12 5 7 8

[316] 4 5 7 4 7 10 4 5 7 7 6 9 8 7 7 5 8 7 8 7 4

[337] 5 7 7 7 5 7 5 7 7 12 7 5 12 5 7 12 10 12 12 12 12

[358] 12 12 12 12 12 12 11 10 12 12 12 12 12 10 12 10 12 12 11 12 12

[379] 11 12 10 12 12 12 12 12 10 11 11 12 10 10 12 12 10 10 12 10 12

[400] 10 10 12 7 7 6 7 7 5 7 12 5 7 5 5 5 7 5 10 7 7

[421] 7 5 6 7 7 7 5 7 4 7 5 5 12 10 12 12 12

This is the main output. The field $frame give the structure of the NA tree using a table. Since R cannot draw a tree with arrows and lines, we can only express the tree by using a table. For each row in the output table, it stands for node and its related information. For example, there are some basic information about the node label, its depth and real depth. It also shows whether it is a leaf or not. If it is not a leaf, it shows what feature it uses to split this node along with its corresponding threshold. If the node is not root, it shows how this node is split from its parent node (i.e. dec.rule) and parent node label. In the end, it shows how the firings and not firings are distributed at each node, in order to give people an idea how the NA tree is splitting.

Each table above can be translated to a graph below, which indicates an exactly same NA tree structure.

The second part of the output is where the training sample is classified. Since only node 4 to node 12 are leaves, the 437 patients are classified into this 9 leaves.

Index

Top

2.3. Function best.split()

This function is used at each node. It takes the dataset at that node and a list of features, and returns the best feature and threshold based on the criterion mentioned in the previous section. It is one of most important function in this NA tree model.

Index

Top

2.4. Functions split.cont() and split.discrete()

These two functions are used after best.split() has decided which feature and threshold to use. They are also used at each node. They take the feature and threshold and split the dataset at each node.

In the output table of tree.grow(), each row is actually generated by either split.cont() or split.discrete() depending on whether it is a numeric or discrete feature.

Index

Top

2.5. Function impurity.at.node()

This function is used to calculate the entropy at each node. It just does what the formula of does. The return of this function is one number and it takes two numbers, which are numbers of firings and not firings.

Index

Top

2.6. Function predict.na.tree()

This function is used to predict classification of new observations or patients. It takes the existing NA tree model and a vector or matrix of new observations, and returns the classification of these observations. It basically goes through the NA tree from top to bottom for every single observation and record the node that observation ends up with.

Index

Top

2.7. Function draw.tree() and draw.node()

These two functions print the NA tree in a more fashion way. It is easier to distinguish between different depth and understand the tree structure. Function draw.tree() print the whole NA tree, while draw.node() only print one node. For example, some common output looks like the following (It is the same NA tree developed above):

> draw.tree(tree.struct.d$frame)

Node 0 : Root node Fired 26 Nonfired 411 Total 437

Node 1 : snp2 == AA Fired 2 Nonfired 109 Total 111

Node 4 : LFdivHFPow <= 0.8012 Fired 2 Nonfired 13 Total 15

Node 5 : LFdivHFPow > 0.8012 Fired 0 Nonfired 57 Total 57

Node 6 : LFdivHFPow is.na Fired 0 Nonfired 39 Total 39

Node 2 : snp2 != AA Fired 22 Nonfired 219 Total 241

Node 7 : meanHRbpm<=83.8006 Fired 13 Nonfired 153 Total 166

Node 8 : meanHRbpm > 83.8006 Fired 7 Nonfired 35 Total 42

Node 9 : meanHRbpm is.na Fired 2 Nonfired 31 Total 33

Node 3 : snp2 is.na Fired 2 Nonfired 83 Total 85

Node 10 : LVEF <= 34.62 Fired 0 Nonfired 21 Total 21

Node 11 : LVEF > 34.62 Fired 2 Nonfired 3 Total 5

Node 12 : LVEF is.na Fired 0 Nonfired 59 Total 59

> draw.node(tree.struct.d$frame,2)

Node 2 : snp2 != AA Fired 22 Nonfired 219 Total 241

Index

Top

2.8. Data Preparation

In order to prepare data to be used in the NA tree model, we need the following code to organize data. The following part of code clears the data and generates the key variable (“fired”) on which our NA tree model develops. The data.csv file is just the file we get after merging all the datasets.

d <- read.csv("./data.csv",as.is=TRUE)

# keep only the implanted subjects with positive days of observation

d <- d[!is.na(d$Implant.Date),]

d <- d[d$Days.Of.Observation>0,]

# keep only subjects with at least ndays of observation

ndays<-545

d <- d[d$Days.Of.Observation>=ndays,]

# Create a "fired" [logical] variable

d$fired<-d$Days.To.First.App.Firing<=ndays

Index

Top

3. R Code

3.1. Commented Code

make.fired.5050<-function(d)

{

# Splite d into two half. One is for fired data. The other is for not fired data.

d.fired<-d[d$fired==1,]

d.not.fired<-d[!d$fired,]

# Calculate how many fired data observations.

k.fired<-dim(d.fired)[1]

# Calculate how many not fired data observations.

k.not.fired<-dim(d.not.fired)[1]

# Get the ratio between number of fired observations and not fired observations (and

# round down to the largest integer).

rat<-floor(k.not.fired/k.fired)

if (rat>0)

{

# If ratio is larger than 0, store d.fired to d.temp.

d.temp<-d.fired

if (rat>1)

{

# If ratio is larger than 1, we replicate attach d.fired to the end of d.temp

# for rat-1 times. (ratio being larger than 1 means number of fired are at least

# twice as many as not number of fired.)

for (i in seq(1,rat-1))

{

d.temp<-rbind(d.temp,d.fired)

}

else

{

# If ratio is smaller than or equal to 0, give a error message.

print("error with replication: no firings for this dataset\n")

}

# Conbind d.temp and d.not fired together.

rbind(d.temp,d.not.fired)

}

#######################################################################################

# Function finds best split in the node given data for all the variables in list.vars

# returns a list that includes the name of the feature best for a split, a measure for the goodness of the split, and the best splitting value for that feature

# d.node is the subset of data this node contains.

# list.vars is the list of column numbers of the features

# node.number is the node label of the node that's being considered for a split

best.split<-function(d.node,list.vars,node.number)

{

# Assign d.node to d.

d<-d.node

# Assign "fired" column of data to y.

y<-d$fired

# Assign node.n the number of observations in total.

node.n<-dim(d)[1]

# print(paste( "dimension of d.node",dim(d)))

# Create empty list fo goodness.list, var.key.list and var.value.list.

goodness.list<-list()

var.key.list<-list()

var.value.list<-list()

#For each variable and each threshold(all levels for discrete

#and limited number of thresholds for continous) compute the goodness

#and return maximum

#var.key.list and var.key stores which variable the split is for

#goodness.list and goodness.vec stores the value of goodness for that split

#var.value holds the splitting value(threshold for continous, or equal/notequal check

#for continous

#For discrete variables only partition of (single levels vs all others) are checked, l checks for l levels

#instead of 2^(l-1)-1 checks(all possible partitions of levels).

#This only matters if there are more than 3 levels

# For every variable in the list.vars, we run the following program to get an optimal split.

for (i in list.vars) {

# Create empty list for variable name, goodness of fit and threshold.

var.index.vec<-c()

goodness.vec<-c()

var.value.vec<-c()

# If the variable is a numeric

if (is.numeric(d[,i])) {

# Assign discrete indicator to FALSE, which is used later in goodness.of.split()

is.discrete<-F

# Create empty list for sample quantiles.

samp.qt<-c()

# Get the number of not NA values for the variable.

node.not.na<-sum(!is.na(d[,i]))

# Choose number of possible thresholds

# if number of samples in node is not large enough, try all the possible thresholds

# If number of not NA values is larger than or equal to 10

if (node.not.na >= 10){

# Create a list of percentages of (0.2 0.5 0.8)

list.qt<-seq(0.2,0.8,by=0.3)

# Get corresponding quantiles in the variable.

samp.qt<- quantile(d[,i],probs=list.qt,na.rm=T)

} else {

# If the number of not NA values is less than 10

# We only test one threshold which is the mean.

samp.qt<-mean(d[,i],na.rm=T)

}

# Assign samp.qt to var.value.vec

var.value.vec<-samp.qt

# For every quantiles in samp.qt, we run the loop to generate lists of

# goodness of fit and variable names.

for (j in samp.qt) {

# Get the goodness of fit for every quantiles.

goodness.val<-goodness.of.split(y,d[,i],j,is.discrete)

# Put goodness of fit into a vector.

goodness.vec<-c(goodness.vec, goodness.val)

# Put variable index into a vector.

var.index.vec<-c(var.index.vec,i)

# var.value.vec<-c( var.value.vec, as.character(j))

}

} else {

# If the variable d[,i] is discrete, then run the following program.

# Convert the variable into factor and store it to "disc.var".

disc.var<-as.factor(d[,i])

# Assign discrete indicator to TRUE, which is used later in goodness.of.split()

is.discrete<-T

# List all the unique values of disc.var.

disc.levels<-levels(disc.var)

# Put all the unique values into a vector.

var.value.vec<-disc.levels

# For every value in disc.level, we run the loop to generate lists of

# goodness of fit and variable names. (The same as numeric varialbe case)

for (thres in disc.levels) {

goodness.val<-goodness.of.split(y, disc.var, thres, is.discrete)

goodness.vec<-c(goodness.vec, goodness.val)

var.index.vec<-c(var.index.vec,i)

}

# Append to list of goodness values

# var.key.list hold which variable the goodness value is for

goodness.list<-list(goodness.list,goodness.vec)

var.key.list<-list(var.key.list,var.index.vec)

var.value.list<-list(var.value.list, var.value.vec)

}

# Unlist the goodness.list, var,key.list and var.value.list so that they become

# three vectors, instead of three lists.

goodness.unlist <- unlist(goodness.list)

var.key.unlist <- unlist(var.key.list)

var.value.unlist <- unlist(var.value.list)

# Look for the maximum in the vector of "goodness.unlist", and find the corresponding

# entry in vector of "var.key.unlist".

argmax.goodness<-which.max(goodness.unlist)

var.best.split<-var.key.unlist[argmax.goodness]

# If the varialbe we just found is numeric, store the threshold as a numeric value to

# "var.thres", otherwise, store it as it is.

if (is.numeric(d[, var.best.split])) {

var.thres<-as.numeric(var.value.unlist[argmax.goodness])

} else {

var.thres<-var.value.unlist[argmax.goodness]

}

# Record the maximum value in the vector of "goodness.unlist".

best.goodness.in.list <- goodness.unlist[argmax.goodness]

# If there is no split that decreases impurity in the node

# return NULL, otherwise, return the properties of the best split, which includes

# maximum value of goodness of fit, best variable index, the corresponding threshold

# value.

if (best.goodness.in.list>1e-06) {

return(list(best.goodness = best.goodness.in.list, var.split.index = var.best.split ,

var.split.value = var.thres ))

} else {

return(NULL)

}

#######################################################################################

# Initializes tree frame data structure

tree.init <- function(d)

{

# Create basic variables for the tree frame

# And initiate them with value 0, NA or empty vector.

# For split variable name, assign it "<leaf>"s for now.

node.depth<-c(0)

node.real.depth<-c(0)

node.split.var.name<-c("<leaf>")

node.split.var.value<-c(NA)

labels<-c(0)

node.dec.rule<-NA

node.parent.node<-NA

# Record number of observations with firing and not firings.

num.fired<-sum(d$fired==1)

num.nonfired<-sum(d$fired==0)

# If number of firing is smaller than nu.fired, assign TRUE to class.call, otherwise FALSE.

class.call <-c(num.nonfired <= num.fired )

# Create tree frame as data.frame object with fields of node label, depth, real depth,

# split variable name, split variable value, decisioni rule, parent node, call, number of firings

# and number of not firings.

tree.frame<-as.data.frame(list(node.label=labels, depth=node.depth, real.depth=node.real.depth,

split.var.name=node.split.var.name, split.var.value =node.split.var.value,

dec.rule=node.dec.rule, parent.node = node.parent.node, call=class.call, count.fired=num.fired,

count.nonfired=num.nonfired),

stringsAsFactors=F)

# Return the tree frame.

return(tree.frame)

}

##############################################################################################

# Grows a tree starting from the root node using variables with indices in $var.list$

# returns a list containing the data frame that stores the description of the tree ,

# and a vector that stores in which node each sample in d ends up.

# $d$ is the dataset

# $tree.frame$ is the initalized data frame that stores the description of the tree

# $var.list$ is the list of column numbers of the features

# $roc.thres.ratio$ is the minimum ratio of fireds to nonfireds at which a node is classified

# as fired

tree.grow<- function(d, tree.frame,tree.size, var.list,roc.thres.ratio)

{

# Assign n as the number of observations in the dataset.

n<-dim(d)[1]

# Assign a vector of 0s to where.samp.

where.samp<-rep(0,n)

# Assign FALSE to stop.grow indicator.

stop.grow<-F

# Assign TRUE to no.further.splits indicator.

no.further.splits<-T

while(stop.grow==F) {

# Get labels of all the leaves.

leaf.labels<-tree.frame$node.label[tree.frame$split.var.name=="<leaf>"]

# print(leaf.labels)

# Assign TRUE to no.further.splits indicator.

no.further.splits<-T

# Assign 0 to no.further.splits indicator.

split.using.variable <- 0

# For every leaf, we run the following procedure to grow the tree.

for (i in leaf.labels) {

# if node has no samples skip the current loop.

if ( sum(where.samp==i)==0) next

# Assign d.node the dataset at node i.

d.node<-d[where.samp==i,]

# print(dim(d.node))

# Call best.split() and get the best split variable and threshold for node i.

split.prop<-best.split(d.node, var.list, i)

# print(split.prop)

# print(paste("i is ",i))

# If split.prop returns a variable instead of Null, then split the tree and

# add more rows in the tree frame table.

if(!is.null(split.prop)){

# Assign FALSE to no.further.splits

no.further.splits<-F

# Get the split variable

split.using.variable <- split.prop$var.split.index

# If the split variable is numeric, use split.cont(), otherwise, use split.discrete()

# to add more rows in tree frame and to change where.samp.

if (!is.numeric(d[,split.using.variable])){

# Call function split.discrete()

split.tree<-split.discrete(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

} else {

# Call function split.discrete()

split.tree<-split.cont(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

}

# print(where.samp)

# draw.tree(tree.frame)

}

# extract "leaf" part of tree frame and assign it leaf.tree.

leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]

# print(tree.frame)

# print(min(leaf.tree$depth))

# If the NA tree reaches the depth of 2, or no.further.splits is TRUE, or there

# is no variables in var.list, then stop growing the tree (i.e. stop grow<-T).

if ( (max(leaf.tree$depth)==2)||(no.further.splits)||(length(var.list)==0) ) {

stop.grow<-T

}

# print(paste("No Further Splits",no.further.splits) )

}

# If we have impurity in the NA/NA node of the tree

# that node is split again.

# if (0) {

# extract "leaf" part of tree frame and assign it leaf.tree.

leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]

# If the minimium of real depth is 0, we assign all.na with locations of real.depth equals to 0.

if (min(leaf.tree$real.depth)==0) {

all.na<-which.min(leaf.tree$real.depth)

# Extract the labels of 0 real depth part of leaf.tree.

i<-leaf.tree$node.label[all.na]

if ( sum(where.samp==i)!=0) {

d.node<-d[where.samp==i,]

# print(dim(d.node))

split.prop<-best.split(d.node, var.list, i)

# print(split.prop)

# print(paste("i is ",i))

if(!is.null(split.prop)){

split.using.variable <- split.prop$var.split.index

if (!is.numeric(d[,split.using.variable])){

split.tree<-split.discrete(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

} else {

split.tree<-split.cont(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

}

# }

return(list(frame=tree.frame, where=where.samp) )

}

# Split a node using a discrete variable and update(and return) tree frame data structure

# returns the updated tree frame data structure and the updated $where.samp$ vector

# $d$ is the dataset

# $tree.frame$ is the current data frame that stores the description of the tree

# $parent.node.label$ is the label of node to be split

# $split.prop$ is a list that contains the name for the splitting variable and the value to be used for splitting

# $where.samp$ a vector that stores in which node each sample in d ends up.

# $roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is classified as fired

split.discrete <- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)

{

y<-d$fired

tree.size <- dim(tree.frame)[1]-1

var.index<-split.prop$var.split.index

split.value<-split.prop$var.split.value

varname<-names(d)[var.index]

parent.node.index<-which(tree.frame$node.label==parent.node.label)

# Create three new child nodes

left.node.indic <- as.character(d[,varname])==as.character(split.value)

right.node.indic <- !left.node.indic

na.node.indic <- is.na( d[,varname])

left.node.indic [na.node.indic]<-F

right.node.indic[na.node.indic]<-F

left.node.indic <- left.node.indic &(where.samp==parent.node.label)

right.node.indic <- right.node.indic &(where.samp==parent.node.label)

na.node.indic <- na.node.indic &(where.samp==parent.node.label)

where.samp[ left.node.indic ] <- tree.size+1

where.samp[ right.node.indic ] <- tree.size+2

where.samp[ na.node.indic ] <- tree.size+3

#print(paste("parent.node ", parent.node.label) )

#print(paste("parent.node.label is numeric ", is.numeric(parent.node.label)) )

#print(paste("parent.node row ", parent.node.label+1) )

#print(tree.frame)

#print(paste("current depth is ",tree.frame$depth))

#print(paste("current depth is ",tree.frame[parent.node.label+1,2]))

fired.count.l<-sum(y[left.node.indic]==1)

fired.count.r<-sum(y[right.node.indic]==1)

fired.count.na<-sum(y[na.node.indic]==1)

nonfired.count.l<-sum(y[left.node.indic]==0)

nonfired.count.r<-sum(y[right.node.indic]==0)

nonfired.count.na<-sum(y[na.node.indic]==0)

leaf.call.l <- (fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres

leaf.call.r <- (fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres

leaf.call.na <- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres

new.node.labels <- c(tree.size+1,tree.size+2,tree.size+3)

current.depth<-tree.frame$depth[parent.node.index]

current.real.depth<-tree.frame$real.depth[parent.node.index]

new.depth <- rep(current.depth+1,3)

new.real.depth <- rep(c(current.real.depth+1,current.real.depth),c(2,1))

new.split.varname <- rep("<leaf>",3)

new.split.var.value <- rep(NA,3)

new.dec.rule <- c("==","!=",NA)

new.parent.node <- rep(parent.node.label,3)

new.call <- c(leaf.call.l, leaf.call.r, leaf.call.na)

new.fired.count <- c(fired.count.l,fired.count.r,fired.count.na)

new.nonfired.count <- c(nonfired.count.l,nonfired.count.r, nonfired.count.na)

# add the new nodes to the tree.frame data frame

new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,

new.split.varname, new.split.var.value, new.dec.rule, new.parent.node,new.call,new.fired.count,new.nonfired.count ),

stringsAsFactors=F)

names(new.frame)<-c("node.label" , "depth", "real.depth", "split.var.name",

"split.var.value", "dec.rule", "parent.node","call","count.fired", "count.nonfired")

#print(new.frame)

tree.frame<-rbind(tree.frame,new.frame)

tree.frame$node.label<-as.numeric(tree.frame$node.label)

tree.frame$depth<-as.numeric(tree.frame$depth)

tree.frame$real.depth<-as.numeric(tree.frame$real.depth)

tree.frame$parent.node<-as.numeric(tree.frame$parent.node)

tree.frame$count.fired<-as.numeric(tree.frame$count.fired)

tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)

# update parent node

tree.frame$split.var.name[parent.node.index]<-varname

tree.frame$split.var.value[parent.node.index]<-split.value

return(list(frame= tree.frame, where=where.samp))

}

# Split a node using a continous variable and update(and return) tree frame data structure

# returns the updated tree frame data structure and the updated $where.samp$ vector

# $d$ is the dataset

# $tree.frame$ is the current data frame that stores the description of the tree

# $parent.node.label$ is the label of node to be split

# $split.prop$ is a list that contains the name for the splitting variable and the value to be used for splitting

# $where.samp$ a vector that stores in which node each sample in d ends up.

# $roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is classified as fired

split.cont <- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)

{

y<-d$fired

# print(split.prop)

tree.size <- dim(tree.frame)[1] - 1

var.index<-split.prop$var.split.index

split.value<-split.prop$var.split.value

varname<-names(d)[var.index]

parent.node.index<-which(tree.frame$node.label==parent.node.label)

# print(paste("is.numeric.value ",is.numeric(split.value)))

# Create three new child nodes

left.node.indic <- d[,varname]<=split.value

right.node.indic <- !left.node.indic

na.node.indic <- is.na( d[,varname])

left.node.indic [na.node.indic]<-F

right.node.indic[na.node.indic]<-F

left.node.indic <- left.node.indic &(where.samp==parent.node.label)

right.node.indic <- right.node.indic &(where.samp==parent.node.label)

na.node.indic <- na.node.indic &(where.samp==parent.node.label)

where.samp[ left.node.indic ] <- tree.size+1

where.samp[ right.node.indic ] <- tree.size+2

where.samp[ na.node.indic ] <- tree.size+3

fired.count.l<-sum(y[left.node.indic]==1)

fired.count.r<-sum(y[right.node.indic]==1)

fired.count.na<-sum(y[na.node.indic]==1)

nonfired.count.l<-sum(y[left.node.indic]==0)

nonfired.count.r<-sum(y[right.node.indic]==0)

nonfired.count.na<-sum(y[na.node.indic]==0)

leaf.call.l <- (fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres

leaf.call.r <- (fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres

leaf.call.na <- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres

new.node.labels <- c(tree.size+1,tree.size+2,tree.size+3)

current.depth<-tree.frame$depth[parent.node.index]

current.real.depth<-tree.frame$real.depth[parent.node.index]

new.depth <- rep(current.depth+1,3)

new.real.depth <- rep(c(current.real.depth+1,current.real.depth),c(2,1))

new.split.varname <- rep("<leaf>",3)

new.split.var.value <- rep(NA,3)

new.dec.rule <- c("<=",">",NA)

new.parent.node <- rep(parent.node.label,3)

new.call <- c(leaf.call.l, leaf.call.r, leaf.call.na)

new.fired.count <- c(fired.count.l,fired.count.r,fired.count.na)

new.nonfired.count <- c(nonfired.count.l,nonfired.count.r, nonfired.count.na)

new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,

new.split.varname, new.split.var.value,new.dec.rule,new.parent.node,new.call,new.fired.count,new.nonfired.count),

stringsAsFactors=F)

names(new.frame)<-c("node.label" , "depth", "real.depth", "split.var.name",

"split.var.value", "dec.rule", "parent.node","call","count.fired", "count.nonfired")

tree.frame<-rbind(tree.frame,new.frame)

tree.frame$node.label<-as.numeric(tree.frame$node.label)

tree.frame$depth<-as.numeric(tree.frame$depth)

tree.frame$real.depth<-as.numeric(tree.frame$real.depth)

tree.frame$parent.node<-as.numeric(tree.frame$parent.node)

tree.frame$count.fired<-as.numeric(tree.frame$count.fired)

tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)

# update parent node

tree.frame$split.var.name[parent.node.index]<-varname

tree.frame$split.var.value[parent.node.index]<-split.value

# print(paste("update parent.node", varname," ",split.value) )

return(list(frame= tree.frame, where=where.samp))

}

# Computes goodness(decrease in entropy) of a particular split based on the feature in split.var

# returns a goodness of split measure

# $y$ is the vector class labels

# $split.var$ is the feature vector(the splitting variable)

# $thres$ is the value of the splitting variable used for splitting the node

# $is.discrete$ is a flag that is 1 if the splitting variable is a discrete variable

goodness.of.split <- function (y,split.var,thres,is.discrete)

{

# If a split is not necessary, return 0

if ((sum(y==0)==0)||(sum(y==1)==0))

return(0)

na.node<-is.na(split.var)

num.sample<-length(y)

left.node<-rep(FALSE,num.sample)

right.node<-rep(FALSE,num.sample)

# print(num.sample)

#print(paste("threshold",thres))

if (is.discrete) {

left.node<-(as.character(split.var)==as.character(thres))

left.node[na.node]<- F

right.node<-(as.character(split.var)!=as.character(thres))

right.node[na.node]<- F

} else {

left.node<- split.var<=thres

left.node[na.node]<- F

right.node<- split.var > thres

right.node[na.node]<- F

}

y.l<-y[left.node]

y.r<-y[right.node]

y.na<-y[na.node]

# print(paste("class 0, class 1",sum(y==0) ,sum(y==1 )))

# print(paste("left.node size",length(y.l),sum(y.l==0) ,sum(y.l==1 )))

# print(paste("right.node size",length(y.r),sum(y.r==0) ,sum(y.r==1 )))

# print(paste("na.node size",length(y.na),sum(y.na==0) ,sum(y.na==1 )))

if ((length(y.l)==0)||(length(y.r)==0))

return(-5)

orig.imp <- impurity.at.node(sum(y==0) ,sum(y==1 ))

left.imp <- impurity.at.node(sum(y.l==0) ,sum(y.l==1 ))

right.imp <- impurity.at.node(sum(y.r==0) ,sum(y.r==1 ))

na.imp <- 0

if(length(y.na)!=0)

na.imp <- impurity.at.node(sum(y.na==0),sum(y.na==1))

wt.left<-length(y.l)

wt.right<-length(y.r)

wt.na<- length(y.na)

wt.left<- wt.left/ num.sample

wt.right<- wt.right/ num.sample

wt.na <- wt.na/ num.sample

imp.diff<- wt.left*left.imp+wt.right*right.imp+wt.na*na.imp-orig.imp

# print(paste("imp.diff",imp.diff))

return(-1*imp.diff)

}

# Entropy at a node

impurity.at.node <- function (y.1,y.2)

{

tot.y<-y.1+y.2

impurity<-0

if (tot.y==0) {

print("Entropy Comp. Error")

impurity<- -1

} else {

p.1<-y.1/(tot.y)

p.2<-y.2/(tot.y)

impurity<-0

if ((p.1!=0)&(p.2!=0)) {

impurity <- -1*(p.1*log2(p.1)+p.2*log2(p.2))

} else {

impurity <- 0

}

return(impurity)

}

#impurity.at.node<- function(y.1,y.2)

# tot.y<-y.1+y.2

# impurity<-0

# if (tot.y==0) {

# print("Entropy Comp. Error")

# impurity<- -1

# } else {

# impurity<- min(y.1,y.2)/tot.y

# }

# return(impurity)

#impurity.at.node<- function(y.1,y.2)

# tot.y<-y.1+y.2

# impurity<-0

# if (tot.y==0) {

# print("Entropy Comp. Error")

# impurity<- -1

# } else {

# impurity<- y.1*y.2/tot.y^2

# }

# return(impurity)

predict.na.tree<-function(tree.frame, test.sample, numeric.or.factor)

{

at.leaf.node<-F

at.node<- 0

test.where <- 0

child.frame.index <- 0

while(at.leaf.node==F) {

node.frame.index<- which(tree.frame$node.label==at.node)

# print(paste("node.frame.index",node.frame.index))

split.variable.name<-tree.frame$split.var.name[node.frame.index]

# print(paste("split.variable.name",split.variable.name))

# print(paste("numeric.or.factor",numeric.or.factor[split.variable.name]))

threshold.value<-tree.frame$split.var.value[node.frame.index]

# print(split.variable.name)

test.value<-test.sample[split.variable.name]

# print(numeric.or.factor)

# print(split.variable.name)

if (numeric.or.factor[split.variable.name]) {

# the splitting variable is continous

test.value <- as.numeric(test.value)

threshold.value <- as.numeric(threshold.value)

NA.child.node<-is.na(test.value)

if (!NA.child.node) {

left.node<- test.value<=threshold.value

if (left.node) {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule=="<=") )

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule==">") )

}

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (is.na(tree.frame$dec.rule) ) )

}

} else {

# if the splitting variable is discrete

test.value <- as.character(test.value)

threshold.value <- as.character(threshold.value)

NA.child.node<-is.na(test.value)

if (!NA.child.node) {

left.node<- test.value==threshold.value

if (left.node) {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule=="==") )

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule=="!=") )

}

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (is.na(tree.frame$dec.rule) ) )

}

#Child node becomes the current node

at.node<-tree.frame$node.label[child.frame.index]

#Check if this current node is a leaf.

at.leaf.node <- (tree.frame$split.var.name[child.frame.index] =="<leaf>")

# print(paste("child.frame.index",child.frame.index))

#print(paste("at.node",at.node))

#print(paste("at.leaf.node",at.leaf.node))

if (at.leaf.node) {

test.where<-at.node

}

#set the index to the row whose node.label is the the current node

node.frame.index<- which(tree.frame$node.label==at.node)

}

class.call<-tree.frame$call[node.frame.index]

if (is.na(class.call)) {

print(paste("leaf.node.index",node.frame.index))

print(tree.frame)

class.call<-F

}

return(list(call=class.call,where=test.where))

}

draw.node<-function(tree.frame,node.drawn)

{

node.index<-which(tree.frame$node.label==node.drawn)

if (node.index==0) {

print("No such node exists")

return()

}

parent.node.label<-tree.frame$parent.node[node.index]

parent.node.index<-which(tree.frame$node.label==parent.node.label)

num.fired<-tree.frame$count.fired[node.index]

num.nonfired<-tree.frame$count.nonfired[node.index]

num.total<-num.fired+num.nonfired

node.depth<-tree.frame$depth[node.index]

node.str<-character()

if (node.depth>0) {

for (i in 1:node.depth) {

node.str<-paste(node.str,"\t")

}

node.str<-paste(node.str,"Node ",node.drawn,": ")

if (is.na(parent.node.label)) {

node.str<-paste(node.str, "Root node ", " Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)

} else if (!is.na(tree.frame$dec.rule[node.index])) {

node.str<-paste(node.str, tree.frame$split.var.name[parent.node.index],tree.frame$dec.rule[node.index],

tree.frame$split.var.value[parent.node.index],

" Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)

} else {

node.str<-paste(node.str, tree.frame$split.var.name[parent.node.index] ,"is.na",

" Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)

}

node.str<-paste(node.str,"\n")

cat(node.str)

}

draw.tree<-function(tree.frame)

{

all.nodes.drawn<-F

current.node<-0

tree.size<-dim(tree.frame)[1]

drawn.nodes<-rep(0,tree.size)

node.queue<-c()

while(all.nodes.drawn==F) {

draw.node(tree.frame,current.node)

current.node.index <- which(tree.frame$node.label ==current.node)

child.nodes.index <- which(tree.frame$parent.node==current.node)

if (length(child.nodes.index)==0) {

} else {

child.nodes <- tree.frame$node.label[child.nodes.index]

node.queue <- c(child.nodes,node.queue)

}

node.queue <- node.queue[ node.queue!=current.node]

drawn.nodes[current.node.index] <-1

if(sum(drawn.nodes)==tree.size) {

all.nodes.drawn <- T

}

else current.node<-node.queue[1]

}

Index

Top

3.2. Uncommented Code

make.fired.5050<-function(d)

{

d.fired<-d[d$fired==1,]

d.not.fired<-d[!d$fired,]

k.fired<-dim(d.fired)[1]

k.not.fired<-dim(d.not.fired)[1]

rat<-floor(k.not.fired/k.fired)

if (rat>0)

{

d.temp<-d.fired

if (rat>1)

{

for (i in seq(1,rat-1))

{

d.temp<-rbind(d.temp,d.fired)

}

else

{

print("error with replication: no firings for this dataset\n")

}

rbind(d.temp,d.not.fired)

}

#######################################################################################

# Function finds best split in the node given data for all the variables in list.vars

# returns a list that includes the name of the feature best for a split, a measure for the goodness of the split, and the best splitting value for that feature

# d.node is the subset of data this node contains.

# list.vars is the list of column numbers of the features

# node.number is the node label of the node that's being considered for a split

best.split<-function(d.node,list.vars,node.number)

{

d<-d.node

y<-d$fired

node.n<-dim(d)[1]

goodness.list<-list()

var.key.list<-list()

var.value.list<-list()

for (i in list.vars) {

var.index.vec<-c()

goodness.vec<-c()

var.value.vec<-c()

if (is.numeric(d[,i])) {

is.discrete<-F

samp.qt<-c()

node.not.na<-sum(!is.na(d[,i]))

if (node.not.na >= 10){

list.qt<-seq(0.2,0.8,by=0.3)

samp.qt<- quantile(d[,i],probs=list.qt,na.rm=T)

} else {

samp.qt<-mean(d[,i],na.rm=T)

}

var.value.vec<-samp.qt

for (j in samp.qt) {

goodness.val<-goodness.of.split(y,d[,i],j,is.discrete)

goodness.vec<-c(goodness.vec, goodness.val)

var.index.vec<-c(var.index.vec,i)

}

} else {

disc.var<-as.factor(d[,i])

is.discrete<-T

disc.levels<-levels(disc.var)

var.value.vec<-disc.levels

for (thres in disc.levels) {

goodness.val<-goodness.of.split(y, disc.var, thres, is.discrete)

goodness.vec<-c(goodness.vec, goodness.val)

var.index.vec<-c(var.index.vec,i)

}

goodness.list<-list(goodness.list,goodness.vec)

var.key.list<-list(var.key.list,var.index.vec)

var.value.list<-list(var.value.list, var.value.vec)

}

goodness.unlist <- unlist(goodness.list)

var.key.unlist <- unlist(var.key.list)

var.value.unlist <- unlist(var.value.list)

argmax.goodness<-which.max(goodness.unlist)

var.best.split<-var.key.unlist[argmax.goodness]

if (is.numeric(d[, var.best.split])) {

var.thres<-as.numeric(var.value.unlist[argmax.goodness])

} else {

var.thres<-var.value.unlist[argmax.goodness]

}

best.goodness.in.list <- goodness.unlist[argmax.goodness]

if (best.goodness.in.list>1e-06) {

return(list(best.goodness = best.goodness.in.list, var.split.index = var.best.split ,

var.split.value = var.thres ))

} else {

return(NULL)

}

#######################################################################################

# Initializes tree frame data structure

tree.init <- function(d)

{

node.depth<-c(0)

node.real.depth<-c(0)

node.split.var.name<-c("<leaf>")

node.split.var.value<-c(NA)

labels<-c(0)

node.dec.rule<-NA

node.parent.node<-NA

num.fired<-sum(d$fired==1)

num.nonfired<-sum(d$fired==0)

class.call <-c(num.nonfired <= num.fired )

tree.frame<-as.data.frame(list(node.label=labels, depth=node.depth, real.depth=node.real.depth,

split.var.name=node.split.var.name, split.var.value =node.split.var.value,

dec.rule=node.dec.rule, parent.node = node.parent.node, call=class.call, count.fired=num.fired,

count.nonfired=num.nonfired),

stringsAsFactors=F)

# Return the tree frame.

return(tree.frame)

}

##############################################################################################

# Grows a tree starting from the root node using variables with indices in $var.list$

# returns a list containing the data frame that stores the description of the tree ,

# and a vector that stores in which node each sample in d ends up.

# $d$ is the dataset

# $tree.frame$ is the initalized data frame that stores the description of the tree

# $var.list$ is the list of column numbers of the features

# $roc.thres.ratio$ is the minimum ratio of fireds to nonfireds at which a node is classified

# as fired

tree.grow<- function(d, tree.frame,tree.size, var.list,roc.thres.ratio)

{

n<-dim(d)[1]

where.samp<-rep(0,n)

stop.grow<-F

no.further.splits<-T

while(stop.grow==F) {

leaf.labels<-tree.frame$node.label[tree.frame$split.var.name=="<leaf>"]

no.further.splits<-T

split.using.variable <- 0

for (i in leaf.labels) {

if ( sum(where.samp==i)==0) next

d.node<-d[where.samp==i,]

split.prop<-best.split(d.node, var.list, i)

if(!is.null(split.prop)){

no.further.splits<-F

split.using.variable <- split.prop$var.split.index

if (!is.numeric(d[,split.using.variable])){

split.tree<-split.discrete(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

} else {

split.tree<-split.cont(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

}

leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]

if ( (max(leaf.tree$depth)==2)||(no.further.splits)||(length(var.list)==0) ) {

stop.grow<-T

}

# If we have impurity in the NA/NA node of the tree

# that node is split again.

# if (0) {

leaf.tree<-tree.frame[tree.frame$split.var.name=="<leaf>",]

if (min(leaf.tree$real.depth)==0) {

all.na<-which.min(leaf.tree$real.depth)

i<-leaf.tree$node.label[all.na]

if ( sum(where.samp==i)!=0) {

d.node<-d[where.samp==i,]

split.prop<-best.split(d.node, var.list, i)

if(!is.null(split.prop)){

split.using.variable <- split.prop$var.split.index

if (!is.numeric(d[,split.using.variable])){

split.tree<-split.discrete(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

} else {

split.tree<-split.cont(d,tree.frame, i, split.prop, where.samp,roc.thres.ratio)

tree.frame<-split.tree$frame

where.samp<-split.tree$where

}

# }

return(list(frame=tree.frame, where=where.samp) )

}

# Split a node using a discrete variable and update(and return) tree frame data structure

# returns the updated tree frame data structure and the updated $where.samp$ vector

# $d$ is the dataset

# $tree.frame$ is the current data frame that stores the description of the tree

# $parent.node.label$ is the label of node to be split

# $split.prop$ is a list that contains the name for the splitting variable and the value to be used for splitting

# $where.samp$ a vector that stores in which node each sample in d ends up.

# $roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is classified as fired

split.discrete <- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)

{

y<-d$fired

tree.size <- dim(tree.frame)[1]-1

var.index<-split.prop$var.split.index

split.value<-split.prop$var.split.value

varname<-names(d)[var.index]

parent.node.index<-which(tree.frame$node.label==parent.node.label)

left.node.indic <- as.character(d[,varname])==as.character(split.value)

right.node.indic <- !left.node.indic

na.node.indic <- is.na( d[,varname])

left.node.indic [na.node.indic]<-F

right.node.indic[na.node.indic]<-F

left.node.indic <- left.node.indic &(where.samp==parent.node.label)

right.node.indic <- right.node.indic &(where.samp==parent.node.label)

na.node.indic <- na.node.indic &(where.samp==parent.node.label)

where.samp[ left.node.indic ] <- tree.size+1

where.samp[ right.node.indic ] <- tree.size+2

where.samp[ na.node.indic ] <- tree.size+3

fired.count.l<-sum(y[left.node.indic]==1)

fired.count.r<-sum(y[right.node.indic]==1)

fired.count.na<-sum(y[na.node.indic]==1)

nonfired.count.l<-sum(y[left.node.indic]==0)

nonfired.count.r<-sum(y[right.node.indic]==0)

nonfired.count.na<-sum(y[na.node.indic]==0)

leaf.call.l <- (fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres

leaf.call.r <- (fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres

leaf.call.na <- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres

new.node.labels <- c(tree.size+1,tree.size+2,tree.size+3)

current.depth<-tree.frame$depth[parent.node.index]

current.real.depth<-tree.frame$real.depth[parent.node.index]

new.depth <- rep(current.depth+1,3)

new.real.depth <- rep(c(current.real.depth+1,current.real.depth),c(2,1))

new.split.varname <- rep("<leaf>",3)

new.split.var.value <- rep(NA,3)

new.dec.rule <- c("==","!=",NA)

new.parent.node <- rep(parent.node.label,3)

new.call <- c(leaf.call.l, leaf.call.r, leaf.call.na)

new.fired.count <- c(fired.count.l,fired.count.r,fired.count.na)

new.nonfired.count <- c(nonfired.count.l,nonfired.count.r, nonfired.count.na)

new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,

new.split.varname, new.split.var.value, new.dec.rule, new.parent.node,new.call,new.fired.count,new.nonfired.count ),

stringsAsFactors=F)

names(new.frame)<-c("node.label" , "depth", "real.depth", "split.var.name",

"split.var.value", "dec.rule", "parent.node","call","count.fired", "count.nonfired")

tree.frame<-rbind(tree.frame,new.frame)

tree.frame$node.label<-as.numeric(tree.frame$node.label)

tree.frame$depth<-as.numeric(tree.frame$depth)

tree.frame$real.depth<-as.numeric(tree.frame$real.depth)

tree.frame$parent.node<-as.numeric(tree.frame$parent.node)

tree.frame$count.fired<-as.numeric(tree.frame$count.fired)

tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)

tree.frame$split.var.name[parent.node.index]<-varname

tree.frame$split.var.value[parent.node.index]<-split.value

return(list(frame= tree.frame, where=where.samp))

}

# Split a node using a continous variable and update(and return) tree frame data structure

# returns the updated tree frame data structure and the updated $where.samp$ vector

# $d$ is the dataset

# $tree.frame$ is the current data frame that stores the description of the tree

# $parent.node.label$ is the label of node to be split

# $split.prop$ is a list that contains the name for the splitting variable and the value to be used for splitting

# $where.samp$ a vector that stores in which node each sample in d ends up.

# $roc.thres$ is the minimum ratio of fireds to nonfireds at which a node is classified as fired

split.cont <- function(d,tree.frame,parent.node.label,split.prop,where.samp,roc.thres)

{

y<-d$fired

tree.size <- dim(tree.frame)[1] - 1

var.index<-split.prop$var.split.index

split.value<-split.prop$var.split.value

varname<-names(d)[var.index]

parent.node.index<-which(tree.frame$node.label==parent.node.label)

left.node.indic <- d[,varname]<=split.value

right.node.indic <- !left.node.indic

na.node.indic <- is.na( d[,varname])

left.node.indic [na.node.indic]<-F

right.node.indic[na.node.indic]<-F

left.node.indic <- left.node.indic &(where.samp==parent.node.label)

right.node.indic <- right.node.indic &(where.samp==parent.node.label)

na.node.indic <- na.node.indic &(where.samp==parent.node.label)

where.samp[ left.node.indic ] <- tree.size+1

where.samp[ right.node.indic ] <- tree.size+2

where.samp[ na.node.indic ] <- tree.size+3

fired.count.l<-sum(y[left.node.indic]==1)

fired.count.r<-sum(y[right.node.indic]==1)

fired.count.na<-sum(y[na.node.indic]==1)

nonfired.count.l<-sum(y[left.node.indic]==0)

nonfired.count.r<-sum(y[right.node.indic]==0)

nonfired.count.na<-sum(y[na.node.indic]==0)

leaf.call.l <- (fired.count.l/(fired.count.l+nonfired.count.l))>roc.thres

leaf.call.r <- (fired.count.r/(fired.count.r+nonfired.count.r))>roc.thres

leaf.call.na <- (fired.count.na/(fired.count.na+nonfired.count.na))>roc.thres

new.node.labels <- c(tree.size+1,tree.size+2,tree.size+3)

current.depth<-tree.frame$depth[parent.node.index]

current.real.depth<-tree.frame$real.depth[parent.node.index]

new.depth <- rep(current.depth+1,3)

new.real.depth <- rep(c(current.real.depth+1,current.real.depth),c(2,1))

new.split.varname <- rep("<leaf>",3)

new.split.var.value <- rep(NA,3)

new.dec.rule <- c("<=",">",NA)

new.parent.node <- rep(parent.node.label,3)

new.call <- c(leaf.call.l, leaf.call.r, leaf.call.na)

new.fired.count <- c(fired.count.l,fired.count.r,fired.count.na)

new.nonfired.count <- c(nonfired.count.l,nonfired.count.r, nonfired.count.na)

new.frame<-as.data.frame(cbind(new.node.labels,new.depth,new.real.depth,

new.split.varname, new.split.var.value,new.dec.rule,new.parent.node,new.call,new.fired.count,new.nonfired.count),

stringsAsFactors=F)

names(new.frame)<-c("node.label" , "depth", "real.depth", "split.var.name",

"split.var.value", "dec.rule", "parent.node","call","count.fired", "count.nonfired")

tree.frame<-rbind(tree.frame,new.frame)

tree.frame$node.label<-as.numeric(tree.frame$node.label)

tree.frame$depth<-as.numeric(tree.frame$depth)

tree.frame$real.depth<-as.numeric(tree.frame$real.depth)

tree.frame$parent.node<-as.numeric(tree.frame$parent.node)

tree.frame$count.fired<-as.numeric(tree.frame$count.fired)

tree.frame$count.nonfired<-as.numeric(tree.frame$count.nonfired)

tree.frame$split.var.name[parent.node.index]<-varname

tree.frame$split.var.value[parent.node.index]<-split.value

return(list(frame= tree.frame, where=where.samp))

}

# Computes goodness(decrease in entropy) of a particular split based on the feature in split.var

# returns a goodness of split measure

# $y$ is the vector class labels

# $split.var$ is the feature vector(the splitting variable)

# $thres$ is the value of the splitting variable used for splitting the node

# $is.discrete$ is a flag that is 1 if the splitting variable is a discrete variable

goodness.of.split <- function (y,split.var,thres,is.discrete)

{

if ((sum(y==0)==0)||(sum(y==1)==0))

return(0)

na.node<-is.na(split.var)

num.sample<-length(y)

left.node<-rep(FALSE,num.sample)

right.node<-rep(FALSE,num.sample)

if (is.discrete) {

left.node<-(as.character(split.var)==as.character(thres))

left.node[na.node]<- F

right.node<-(as.character(split.var)!=as.character(thres))

right.node[na.node]<- F

} else {

left.node<- split.var<=thres

left.node[na.node]<- F

right.node<- split.var > thres

right.node[na.node]<- F

}

y.l<-y[left.node]

y.r<-y[right.node]

y.na<-y[na.node]

if ((length(y.l)==0)||(length(y.r)==0))

return(-5)

orig.imp <- impurity.at.node(sum(y==0) ,sum(y==1 ))

left.imp <- impurity.at.node(sum(y.l==0) ,sum(y.l==1 ))

right.imp <- impurity.at.node(sum(y.r==0) ,sum(y.r==1 ))

na.imp <- 0

if(length(y.na)!=0)

na.imp <- impurity.at.node(sum(y.na==0),sum(y.na==1))

wt.left<-length(y.l)

wt.right<-length(y.r)

wt.na<- length(y.na)

wt.left<- wt.left/ num.sample

wt.right<- wt.right/ num.sample

wt.na <- wt.na/ num.sample

imp.diff<- wt.left*left.imp+wt.right*right.imp+wt.na*na.imp-orig.imp

return(-1*imp.diff)

}

# Entropy at a node

impurity.at.node <- function (y.1,y.2)

{

tot.y<-y.1+y.2

impurity<-0

if (tot.y==0) {

print("Entropy Comp. Error")

impurity<- -1

} else {

p.1<-y.1/(tot.y)

p.2<-y.2/(tot.y)

impurity<-0

if ((p.1!=0)&(p.2!=0)) {

impurity <- -1*(p.1*log2(p.1)+p.2*log2(p.2))

} else {

impurity <- 0

}

return(impurity)

}

#impurity.at.node<- function(y.1,y.2)

# tot.y<-y.1+y.2

# impurity<-0

# if (tot.y==0) {

# print("Entropy Comp. Error")

# impurity<- -1

# } else {

# impurity<- min(y.1,y.2)/tot.y

# }

# return(impurity)

#impurity.at.node<- function(y.1,y.2)

# tot.y<-y.1+y.2

# impurity<-0

# if (tot.y==0) {

# print("Entropy Comp. Error")

# impurity<- -1

# } else {

# impurity<- y.1*y.2/tot.y^2

# }

# return(impurity)

predict.na.tree<-function(tree.frame, test.sample, numeric.or.factor)

{

at.leaf.node<-F

at.node<- 0

test.where <- 0

child.frame.index <- 0

while(at.leaf.node==F) {

node.frame.index<- which(tree.frame$node.label==at.node)

split.variable.name<-tree.frame$split.var.name[node.frame.index]

threshold.value<-tree.frame$split.var.value[node.frame.index]

test.value<-test.sample[split.variable.name]

if (numeric.or.factor[split.variable.name]) {

test.value <- as.numeric(test.value)

threshold.value <- as.numeric(threshold.value)

NA.child.node<-is.na(test.value)

if (!NA.child.node) {

left.node<- test.value<=threshold.value

if (left.node) {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule=="<=") )

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule==">") )

}

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (is.na(tree.frame$dec.rule) ) )

}

} else {

test.value <- as.character(test.value)

threshold.value <- as.character(threshold.value)

NA.child.node<-is.na(test.value)

if (!NA.child.node) {

left.node<- test.value==threshold.value

if (left.node) {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule=="==") )

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (tree.frame$dec.rule=="!=") )

}

} else {

child.frame.index <- which((tree.frame$parent.node==at.node) & (is.na(tree.frame$dec.rule) ) )

}

at.node<-tree.frame$node.label[child.frame.index]

at.leaf.node <- (tree.frame$split.var.name[child.frame.index] =="<leaf>")

if (at.leaf.node) {

test.where<-at.node

}

node.frame.index<- which(tree.frame$node.label==at.node)

}

class.call<-tree.frame$call[node.frame.index]

if (is.na(class.call)) {

print(paste("leaf.node.index",node.frame.index))

print(tree.frame)

class.call<-F

}

return(list(call=class.call,where=test.where))

}

draw.node<-function(tree.frame,node.drawn)

{

node.index<-which(tree.frame$node.label==node.drawn)

if (node.index==0) {

print("No such node exists")

return()

}

parent.node.label<-tree.frame$parent.node[node.index]

parent.node.index<-which(tree.frame$node.label==parent.node.label)

num.fired<-tree.frame$count.fired[node.index]

num.nonfired<-tree.frame$count.nonfired[node.index]

num.total<-num.fired+num.nonfired

node.depth<-tree.frame$depth[node.index]

node.str<-character()

if (node.depth>0) {

for (i in 1:node.depth) {

node.str<-paste(node.str,"\t")

}

node.str<-paste(node.str,"Node ",node.drawn,": ")

if (is.na(parent.node.label)) {

node.str<-paste(node.str, "Root node ", " Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)

} else if (!is.na(tree.frame$dec.rule[node.index])) {

node.str<-paste(node.str, tree.frame$split.var.name[parent.node.index],tree.frame$dec.rule[node.index],

tree.frame$split.var.value[parent.node.index],

" Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)

} else {

node.str<-paste(node.str, tree.frame$split.var.name[parent.node.index] ,"is.na",

" Fired" ,num.fired, "Nonfired",num.nonfired, "Total", num.total)

}

node.str<-paste(node.str,"\n")

cat(node.str)

}

draw.tree<-function(tree.frame)

{

all.nodes.drawn<-F

current.node<-0

tree.size<-dim(tree.frame)[1]

drawn.nodes<-rep(0,tree.size)

node.queue<-c()

while(all.nodes.drawn==F) {

draw.node(tree.frame,current.node)

current.node.index <- which(tree.frame$node.label ==current.node)

child.nodes.index <- which(tree.frame$parent.node==current.node)

if (length(child.nodes.index)==0) {

} else {

child.nodes <- tree.frame$node.label[child.nodes.index]

node.queue <- c(child.nodes,node.queue)

}

node.queue <- node.queue[ node.queue!=current.node]

drawn.nodes[current.node.index] <-1

if(sum(drawn.nodes)==tree.size) {

all.nodes.drawn <- T

}

else current.node<-node.queue[1]

}

Index

Top