Skip to contents

A function to select the splitting variables and nodes using one of four criteria.

Usage

best.cut.node(
  X,
  y,
  Xsplit = X,
  split,
  lambda = "log",
  weights = 1,
  MinLeaf = 10,
  numLabels = ifelse(split %in% c("gini", "entropy"), length(unique(y)), 0),
  glmnetParList = NULL
)

Arguments

X

An n by d numeric matrix (preferable) or data frame.

y

A response vector of length n.

Xsplit

Splitting variables used to construct linear model trees. The default value is NULL and is only valid when split="linear".

split

The criterion used for splitting the nodes. "entropy": information gain and "gini": gini impurity index for classification; "": mean square error for regression; "linear": mean square error for multiple linear regression.

lambda

The argument of split is used to determine the penalty level of the partition criterion. Three options are provided including, lambda=0: no penalty; lambda=2: AIC penalty; lambda='log' (Default): BIC penalty. In Addition, lambda can be any value from 0 to n (training set size).

weights

A vector of values which weigh the samples when considering a split.

MinLeaf

Minimal node size (Default 10).

numLabels

The number of categories.

glmnetParList

List of parameters used by the functions glmnet and cv.glmnet in package glmnet. If left unchanged, default values will be used, for details see glmnet and cv.glmnet.

Value

A list which contains:

  • BestCutVar: The best split variable.

  • BestCutVal: The best split points for the best split variable.

  • BestIndex: Each variable corresponds to maximum decrease in gini impurity index, information gain, and mean square error.

  • fitL and fitR: The multivariate linear models for the left and right nodes after splitting are trained using the function glmnet.

Examples

### Find the best split variable ###
#Classification
data(iris)
X <- as.matrix(iris[, 1:4])
y <- iris[[5]]
(bestcut <- best.cut.node(X, y, split = "gini"))
#> $BestCutVar
#> [1] 3
#> 
#> $BestCutVal
#> [1] 2.45
#> 
#> $BestIndex
#> [1] 33.70157 17.41287 52.08715 52.08715
#> 
(bestcut <- best.cut.node(X, y, split = "entropy"))
#> $BestCutVar
#> [1] 3
#> 
#> $BestCutVal
#> [1] 2.45
#> 
#> $BestIndex
#> [1] 38169.08 38169.08 38169.08 38169.08
#> 

#Regression
data(body_fat)
X=body_fat[, -1]
y=body_fat[, 1]
(bestcut <- best.cut.node(X, y, split = "mse"))
#> $BestCutVar
#> [1] 1
#> 
#> $BestCutVal
#> [1] 18.65
#> 
#> $BestIndex
#>  [1]  0.063141385  0.004454660  0.024274816 -0.002024766  0.016210553
#>  [6]  0.027524967  0.042242211  0.021316394  0.017042437  0.013917053
#> [11]  0.005538530  0.018069179  0.011068821  0.005087794
#> 

set.seed(10)
cutpoint=50
X=matrix(rnorm(100*10),100,10)
age=sample(seq(20,80),100,replace = TRUE)
height=sample(seq(50,200),100,replace = TRUE)
weight=sample(seq(5,150),100,replace = TRUE)
Xsplit=cbind(age=age,height=height,weight=weight)
mu=rep(0,100)
mu[age<=cutpoint]=X[age<=cutpoint,1]+X[age<=cutpoint,2]
mu[age>cutpoint]=X[age>cutpoint,1]+X[age>cutpoint,3]
y=mu+rnorm(100)
bestcut <- best.cut.node(X, y, Xsplit, split = "linear",
           glmnetParList=list(lambda = 0))