find best splitting variable and node — best.cut.node • ODRF

A function to select the splitting variables and nodes using one of four criteria.

Usage

best.cut.node(
  X,
  y,
  Xsplit = X,
  split,
  lambda = "log",
  weights = 1,
  MinLeaf = 10,
  numLabels = ifelse(split %in% c("gini", "entropy"), length(unique(y)), 0),
  glmnetParList = NULL
)

Arguments

X: An n by d numeric matrix (preferable) or data frame.
y: A response vector of length n.
Xsplit: Splitting variables used to construct linear model trees. The default value is NULL and is only valid when split="linear".
split: The criterion used for splitting the nodes. "entropy": information gain and "gini": gini impurity index for classification; "": mean square error for regression; "linear": mean square error for multiple linear regression.
lambda: The argument of split is used to determine the penalty level of the partition criterion. Three options are provided including, lambda=0: no penalty; lambda=2: AIC penalty; lambda='log' (Default): BIC penalty. In Addition, lambda can be any value from 0 to n (training set size).
weights: A vector of values which weigh the samples when considering a split.
MinLeaf: Minimal node size (Default 10).
numLabels: The number of categories.
glmnetParList: List of parameters used by the functions glmnet and cv.glmnet in package glmnet. If left unchanged, default values will be used, for details see glmnet and cv.glmnet.

Value

A list which contains:

BestCutVar: The best split variable.
BestCutVal: The best split points for the best split variable.
BestIndex: Each variable corresponds to maximum decrease in gini impurity index, information gain, and mean square error.
fitL and fitR: The multivariate linear models for the left and right nodes after splitting are trained using the function glmnet.

Examples

### Find the best split variable ###
#Classification
data(iris)
X <- as.matrix(iris[, 1:4])
y <- iris[[5]]
(bestcut <- best.cut.node(X, y, split = "gini"))
#> $BestCutVar
#> [1] 3
#> 
#> $BestCutVal
#> [1] 2.45
#> 
#> $BestIndex
#> [1] 33.70157 17.41287 52.08715 52.08715
#> 
(bestcut <- best.cut.node(X, y, split = "entropy"))
#> $BestCutVar
#> [1] 3
#> 
#> $BestCutVal
#> [1] 2.45
#> 
#> $BestIndex
#> [1] 38169.08 38169.08 38169.08 38169.08
#> 

#Regression
data(body_fat)
X=body_fat[, -1]
y=body_fat[, 1]
(bestcut <- best.cut.node(X, y, split = "mse"))
#> $BestCutVar
#> [1] 1
#> 
#> $BestCutVal
#> [1] 18.65
#> 
#> $BestIndex
#>  [1]  0.063141385  0.004454660  0.024274816 -0.002024766  0.016210553
#>  [6]  0.027524967  0.042242211  0.021316394  0.017042437  0.013917053
#> [11]  0.005538530  0.018069179  0.011068821  0.005087794
#> 

set.seed(10)
cutpoint=50
X=matrix(rnorm(100*10),100,10)
age=sample(seq(20,80),100,replace = TRUE)
height=sample(seq(50,200),100,replace = TRUE)
weight=sample(seq(5,150),100,replace = TRUE)
Xsplit=cbind(age=age,height=height,weight=weight)
mu=rep(0,100)
mu[age<=cutpoint]=X[age<=cutpoint,1]+X[age<=cutpoint,2]
mu[age>cutpoint]=X[age>cutpoint,1]+X[age>cutpoint,3]
y=mu+rnorm(100)
bestcut <- best.cut.node(X, y, Xsplit, split = "linear",
           glmnetParList=list(lambda = 0))