A function to select the splitting variables and nodes using one of four criteria.
Arguments
- X
An n by d numeric matrix (preferable) or data frame.
- y
A response vector of length n.
- Xsplit
Splitting variables used to construct linear model trees. The default value is NULL and is only valid when split="linear".
- split
The criterion used for splitting the nodes. "entropy": information gain and "gini": gini impurity index for classification; "": mean square error for regression; "linear": mean square error for multiple linear regression.
- lambda
The argument of
split
is used to determine the penalty level of the partition criterion. Three options are provided including,lambda=0
: no penalty;lambda=2
: AIC penalty;lambda='log'
(Default): BIC penalty. In Addition, lambda can be any value from 0 to n (training set size).- weights
A vector of values which weigh the samples when considering a split.
- MinLeaf
Minimal node size (Default 10).
- numLabels
The number of categories.
- glmnetParList
List of parameters used by the functions
glmnet
andcv.glmnet
in packageglmnet
. If left unchanged, default values will be used, for details seeglmnet
andcv.glmnet
.
Value
A list which contains:
BestCutVar: The best split variable.
BestCutVal: The best split points for the best split variable.
BestIndex: Each variable corresponds to maximum decrease in gini impurity index, information gain, and mean square error.
fitL and fitR: The multivariate linear models for the left and right nodes after splitting are trained using the function
glmnet
.
Examples
### Find the best split variable ###
# Classification
data(iris)
X <- as.matrix(iris[, 1:4])
y <- iris[[5]]
(bestcut <- best.cut.node(X, y, split = "gini"))
#> $BestCutVar
#> [1] 3
#>
#> $BestCutVal
#> [1] 2.45
#>
#> $BestIndex
#> [1] 33.70157 17.41287 52.08715 52.08715
#>
(bestcut <- best.cut.node(X, y, split = "entropy"))
#> $BestCutVar
#> [1] 3
#>
#> $BestCutVal
#> [1] 2.45
#>
#> $BestIndex
#> [1] 38169.08 38169.08 38169.08 38169.08
#>
# Regression
data(body_fat)
X <- body_fat[, -1]
y <- body_fat[, 1]
(bestcut <- best.cut.node(X, y, split = "mse"))
#> $BestCutVar
#> [1] 1
#>
#> $BestCutVal
#> [1] 18.65
#>
#> $BestIndex
#> [1] 0.063141385 0.004454660 0.024274816 -0.002024766 0.016210553
#> [6] 0.027524967 0.042242211 0.021316394 0.017042437 0.013917053
#> [11] 0.005538530 0.018069179 0.011068821 0.005087794
#>
set.seed(10)
cutpoint <- 50
X <- matrix(rnorm(100 * 10), 100, 10)
age <- sample(seq(20, 80), 100, replace = TRUE)
height <- sample(seq(50, 200), 100, replace = TRUE)
weight <- sample(seq(5, 150), 100, replace = TRUE)
Xsplit <- cbind(age = age, height = height, weight = weight)
mu <- rep(0, 100)
mu[age <= cutpoint] <- X[age <= cutpoint, 1] + X[age <= cutpoint, 2]
mu[age > cutpoint] <- X[age > cutpoint, 1] + X[age > cutpoint, 3]
y <- mu + rnorm(100)
bestcut <- best.cut.node(X, y, Xsplit,
split = "linear",
glmnetParList = list(lambda = 0)
)