| Title: | Projection Pursuit Classification Tree Extensions |
|---|---|
| Description: | Implements extensions to the projection pursuit tree algorithm for supervised classification, see Lee, Y. (2013), <doi:10.1214/13-EJS810> and Lee, E-K. (2018) <doi:10.18637/jss.v083.i08>. The algorithm is changed in two ways: improving prediction boundaries by modifying the choice of split points-through class subsetting; and increasing flexibility by allowing multiple splits per group. |
| Authors: | Natalia da Silva [aut, cre] (ORCID: <https://orcid.org/0000-0002-6031-7451>), Dianne Cook [aut], Eun-Kyung Lee [aut] |
| Maintainer: | Natalia da Silva <[email protected]> |
| License: | GPL (>= 2) |
| Version: | 0.1.0 |
| Built: | 2026-06-05 07:12:14 UTC |
| Source: | https://github.com/natydasilva/PPtreeExt |
Measurements on rock crabs of the genus Leptograpsus. The data set contains 200 observations from two species of crab (blue and orange), there are 50 specimens of each sex of each species, collected on site at Fremantle, Western Australia.
is the class variable and has 4 classes with the combinations of specie and sex (BlueMale, BlueFemale, OrangeMale and OrangeFemale)
.
the size of the frontal lobe length, in mm
rear width, in mm
length of midline of the carapace, in mm
maximum width of carapace, in mm
depth of the body; for females, measured after displacement of the abdomen, in mm
data(crab)data(crab)
A data frame with 200 rows and 6 variables
Campbell, N. A. & Mahon, R. J. (1974), A Multivariate Study of Variation in Two Species of Rock Crab of genus Leptograpsus, Australian Journal of Zoology 22(3), 417 - 425.
Shiny app to compare PPtree, PPtreeExt and rpart boundaries in 2D with different simulation scenarios
explorapp(ui, server)explorapp(ui, server)
ui |
user interface |
server |
server function |
No return value, called for side effects. Shinyapp is launched.
if(interactive()){ explorapp(ui,server) }if(interactive()){ explorapp(ui,server) }
Finds an optimal 1D projection of multivariate data that best separates classes using Linear Discriminant Analysis (LDA) or Penalized Discriminant Analysis (PDA), then determines a cutpoint for classification based on entropy splitting.
findproj_Ext( origclass, origdata, PPmethod = "LDA", q = 1, weight = TRUE, lambda = 0.1 )findproj_Ext( origclass, origdata, PPmethod = "LDA", q = 1, weight = TRUE, lambda = 0.1 )
origclass |
Factor or numeric vector containing the class labels for each observation. |
origdata |
Numeric matrix or data frame containing the predictor variables. Each row represents an observation and each column represents a variable. |
PPmethod |
Character string specifying the projection pursuit method.
Either |
q |
Integer specifying the dimension of the projected data. Default is 1 for 1D projection. |
weight |
Logical indicating whether to use weighted LDA index calculation.
Default is |
lambda |
Numeric penalty parameter for the PDA method. Default is 0.1.
Only used when |
This function performs projection pursuit to find a one-dimensional projection that optimally separates classes in multivariate data. The process involves:
Finding the optimal projection direction using either LDA or PDA
Projecting all observations onto this direction
Determining an optimal cutpoint using entropy-based splitting
Creating binary classification indicators based on the cutpoint
The cutpoint is calculated to minimize the weighted entropy of the resulting split. In edge cases where the cutpoint equals the maximum projected value, the function uses the second-largest value to ensure a valid split.
A list with the following components:
Index |
Numeric value representing the optimization criterion achieved by the best projection. Higher values indicate better class separation. |
Alpha |
Numeric vector of length |
C |
Numeric scalar representing the optimal cutpoint (threshold) on the projected data. This value is determined using entropy-based splitting and divides observations into two groups for classification. |
IOindexL |
Logical vector of length |
IOindexR |
Logical vector of length |
The vectors IOindexL and IOindexR are complementary
(mutually exclusive and exhaustive), meaning every observation is assigned
to exactly one group.
Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection Pursuit Classification Tree, Electronic Journal of Statistics, 7:1369-1386.
There are 159 fishes of 7 species are caught and measured. Altogether there are 7 variables. All the fishes are caught from the same lake(Laengelmavesi) near Tampere in Finland.
has 7 fish classes, with 35 cases of Bream, 11 cases of Parkki, 56 cases of Perch 17 cases of Pike, 20 cases of Roach, 14 cases of Smelt and 6 cases of Whitewish.
Weight of the fish (in grams)
Length from the nose to the beginning of the tail (in cm)
Length from the nose to the notch of the tail (in cm)
Length from the nose to the end of the tail (in cm)
Maximal height as % of Length3
Maximal width as % of Length3
data(fishcatch)data(fishcatch)
A data frame with 159 rows and 7 variables
[http://www.amstat.org/publications/jse/jse_data_archive.htm](fishcatch)
Contains measurements 214 observations of 6 types of glass; defined in terms of their oxide content.
has 6 types of glasses
refractive index
Sodium (unit measurement: weight percent in corresponding oxide).
Magnesium
Aluminum
Silicon
Potassium
Calcium
Barium
Iron
data(glass)data(glass)
A data frame with 214 rows and 10 variables
contains 2310 observations of instances from 7 outdoor images
has 7 types of outdoor images, brickface, cement, foliage, grass, path, sky, and window.
the column of the center pixel of the region
the row of the center pixel of the region.
the number of pixels in a region = 9.
the results of a line extraction algorithm that counts how many lines of length 5 (any orientation) with low contrast, less than or equal to 5, go through the region.
measure the contrast of horizontally adjacent pixels in the region. There are 6, the mean and standard deviation are given. This attribute is used as a vertical edge detector.
X5 sd
measures the contrast of vertically adjacent pixels. Used for horizontal line detection.
sd X7
the average over the region of (R + G + B)/3
the average over the region of the R value.
the average over the region of the B value.
the average over the region of the G value.
measure the excess red: (2R - (G + B))
measure the excess blue: (2B - (G + R))
measure the excess green: (2G - (R + B))
3-d nonlinear transformation of RGB. (Algorithm can be found in Foley and VanDam, Fundamentals of Interactive Computer Graphics)
mean of X16
hue mean
data(image)data(image)
A data frame contains 2310 observations and 19 variables
Projection Pursuit Optimization Using LDA Index
LDAopt_Ext(origclass, origdata, q = 1, weight = TRUE, ...)LDAopt_Ext(origclass, origdata, q = 1, weight = TRUE, ...)
origclass |
Factor or numeric vector containing the class labels for each observation. |
origdata |
Numeric matrix or data frame containing the predictor variables without class information. Each row represents an observation and each column represents a variable. |
q |
Integer specifying the dimension of the projection space. Default is 1 for 1-dimensional projection. |
weight |
Logical indicating whether to use weighted LDA index calculation.
Default is |
... |
Additional arguments to be passed to internal optimization methods. |
Finds the q-dimensional optimal projection using the Linear Discriminant Analysis (LDA) projection pursuit index. This implementation follows the method described in PPtree.
The LDA projection pursuit index measures class separation by maximizing the ratio of between-class variance to within-class variance in the projected space. This function:
Calls LDAopt to find the optimal q-dimensional projection directions
Evaluates the LDA index for the optimal projection using LDAindex2
Returns both the projection matrix and its associated index value
When weight = TRUE, the index calculation accounts for class proportions,
giving appropriate weight to each class in the optimization.
An object of class "PPoptim", which is a list containing:
indexbest |
Numeric value representing the maximum LDA index achieved by the optimal projection. Higher values indicate better class separation. |
projbest |
Numeric matrix of optimal projection coefficients with dimensions
|
origclass |
The original class information vector passed as input, preserved for reference. |
origdata |
The original data matrix without class information, preserved for reference. |
Lee, EK., Cook, D., Klinke, S., and Lumley, T. (2005) Projection Pursuit for Exploratory Supervised Classification, Journal of Computational and Graphical Statistics, 14(4):831-846.
Leukemia data set
data(leukemia)data(leukemia)
This dataset comes from a study of gene expression in two types of acute leukemias, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML). Gene expression levels were measured using Affymetrix high density oligonucleotide arrays containing 6817 human genes. A data set containing 72 observations from 3 leukemia types classes.
has 3 classes with 38 cases of B-cell ALL, 25 cases of AML and 9 cases of T-cell ALL.
gene expression levels.
A data frame with 72 rows and 41 variables
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American statistical Association 97 77-87.
Gene expression in the three most prevalent adult lymphoid malignancies: B-cell chronic lymphocytic leukemia (B-CLL), follicular lymphoma (FL), and diffuse large B-cell lym- phoma (DLBCL). Gene expression levels were measured using a specialized cDNA microarray, the Lymphochip, containing genes that are preferentially expressed in lymphoid cells or that are of known immunologic or oncologic importance. This data set contain 80 observations from 3 lymphoma types.
Class variable has 3 classes with 29 cases of B-cell ALL (B-CLL), 42 cases of diffuse large B-cell lymphoma (DLBCL) and 9 cases of follicular lymphoma (FL)
.
gene expression
data(lymphoma)data(lymphoma)
A data frame with 80 rows and 51 variables
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Ex- pression Data. Journal of the American statistical Association 97 77-87.
cDNA microarrays were used to examine the variation in gene expression among the 60 cell lines. The cell lines are derived from tumors with different sites of origin. This data set contain 61 observations and 30 feature variables from 8 different tissue types.
has 8 different tissue types, 9 cases of breast, 5 cases of central nervous system (CNS), 7 cases pf colon, 8 cases of leukemia, 8 cases of melanoma, 9 cases of non-small-cell lung carcinoma (NSCLC), 6 cases of ovarian and 9 cases of renal.
gene expression information
data(NCI60)data(NCI60)
A data frame with 61 rows and 31 variables
Dudoit, S., Fridlyand, J. and Speed, T. P. (2002). Comparison of Discrimination Methods for the Classification of Tumors Using Gene Expression Data. Journal of the American statistical Association 97 77-87.
contains 572 observations and 10 variables
Three super-classes of Italy: North, South and the island of Sardinia
Nine collection areas: three from North, four from South and 2 from Sardinia
fatty acids percent x 100
fatty acids percent x 100
fatty acids percent x 100
fatty acids percent x 100
fatty acids percent x 100
fatty acids percent x 100
fatty acids percent x 100
fatty acids percent x 100
data(olive)data(olive)
A data frame contains 573 observations and 10 variables
A data set containing 195 observations from 2 parkinson types.
Class variable has 2 classes, there are 48 cases of healthy people and 147 cases with Parkinson. The feature variables are biomedical voice measures
.
Average vocal fundamental frequency
Maximum vocal fundamental frequency
Minimum vocal fundamental frequency
MDVP:Jitter(%) measures of variation in fundamental frequency
MDVP:Jitter(Abs) measures of variation in fundamental frequency
MDVP:RAP measures of variation in fundamental frequency
MDVP:PPQ measures of variation in fundamental frequency
Jitter:DDP measures of variation in fundamental frequency
MDVP:Shimmer measures of variation in amplitude
MDVP:Shimmer(dB) measures of variation in amplitude
Shimmer:APQ3 measures of variation in amplitude
Shimmer:APQ5 measures of variation in amplitude
MDVP:APQ measures of variation in amplitude
Shimmer:DDA measures of variation in amplitude
NHR measures of ratio of noise to tonal components in the voice
HNR measures of ratio of noise to tonal components in the voice
RPDE nonlinear dynamical complexity measures
D2 nonlinear dynamical complexity measures
DFA - Signal fractal scaling exponent
spread1 Nonlinear measures of fundamental frequency variation
spread2 Nonlinear measures of fundamental frequency variation
PPE Nonlinear measures of fundamental frequency variation
data(parkinson)data(parkinson)
A data frame with 195 rows and 23 variables
[https://archive.ics.uci.edu/ml/datasets/Parkinsons](Parkinson)
Projection Pursuit Optimization Using PDA Index
PDAopt_Ext(origclass, origdata, q = 1, weight = TRUE, lambda = 0.1, ...)PDAopt_Ext(origclass, origdata, q = 1, weight = TRUE, lambda = 0.1, ...)
origclass |
Factor or numeric vector containing the class labels for each observation. |
origdata |
Numeric matrix or data frame containing the predictor variables without class information. Each row represents an observation and each column represents a variable. |
q |
Integer specifying the dimension of the projection space. Default is 1 for 1-dimensional projection. |
weight |
Logical indicating whether to use weighted PDA index calculation.
Default is |
lambda |
Numeric penalty parameter for the PDA index. Controls the amount of regularization applied. Default is 0.1. Higher values increase regularization, which is useful for high-dimensional or collinear data. |
... |
Additional arguments to be passed to internal optimization methods. |
Finds the q-dimensional optimal projection using the Penalized Discriminant Analysis (PDA) projection pursuit index. This implementation follows the method described in PPtree and is particularly useful for high-dimensional data (large p, small n).
The Penalized Discriminant Analysis (PDA) projection pursuit index extends LDA by incorporating a penalty term, making it particularly suitable for:
High-dimensional data where the number of variables exceeds the number of observations (p > n)
Data with multicollinearity among predictor variables
Cases where standard LDA fails due to singular covariance matrices
The function performs the following steps:
Calls PDAopt to find the optimal q-dimensional projection directions with regularization
Evaluates the PDA index for the optimal projection using PDAindex2
Returns both the projection matrix and its associated index value
The lambda parameter controls the trade-off between maximizing class separation
and regularization. When weight = TRUE, the index calculation accounts for
class proportions in the optimization.
An object of class "PPoptim", which is a list containing:
indexbest |
Numeric value representing the maximum PDA index achieved by the optimal projection. Higher values indicate better class separation with appropriate regularization. |
projbest |
Numeric matrix of optimal projection coefficients with dimensions
|
origclass |
The original class information vector passed as input, preserved for reference. |
origdata |
The original data matrix without class information, preserved for reference. |
Lee, EK, Cook, D. (2010) A Projection Pursuit Index for Large p Small n Data, Statistics and Computing, 20:381-392.
Visualizes a Projection Pursuit (PP) classification tree using grid graphics. The function creates a hierarchical tree diagram showing the structure of splits and terminal nodes with class assignments. Supports automatic scaling for large trees.
## S3 method for class 'PPtreeExtclass' plot( x, font.size = 17, width.size = 1, main = "Projection Pursuit Classification Tree", sub = NULL, auto.scale = TRUE, min.width = NULL, min.height = NULL, ... )## S3 method for class 'PPtreeExtclass' plot( x, font.size = 17, width.size = 1, main = "Projection Pursuit Classification Tree", sub = NULL, auto.scale = TRUE, min.width = NULL, min.height = NULL, ... )
x |
An object of class |
font.size |
Numeric. Font size for text labels in the plot.
Default is 17. Will be automatically reduced for large trees when
|
width.size |
Numeric. Width scaling factor for graphical elements
(nodes, edges). Default is 1. Will be automatically adjusted for large
trees when |
main |
Character string. Main title for the plot. Default is "Projection Pursuit Classification Tree". |
sub |
Character string or NULL. Subtitle for the plot. Default is NULL (no subtitle). |
auto.scale |
Logical. If TRUE (default), automatically adjusts plot dimensions and font size based on tree size. Recommended for trees with more than 20 terminal nodes or depth greater than 15. |
min.width |
Numeric or NULL. Minimum width (number of terminal nodes)
for the plot when |
min.height |
Numeric or NULL. Minimum height (tree depth) for the plot
when |
... |
Additional arguments (currently not used). |
The plot displays:
Internal nodes: Shown as ellipses with the projection used for splitting (e.g., "proj1 * X")
Terminal nodes: Shown as gray rectangles with the assigned class label
Edges: Labeled with split rules ("< cutN" for left child, ">= cutN" for right child)
Node IDs: Small boxes at the top of each node
Auto-scaling behavior:
When auto.scale = TRUE and the tree has more than 20 terminal nodes
or depth greater than 15:
Font size is reduced: max(10, 17 - floor((n_terminal - 20) / 5))
Width size is reduced: max(0.7, 1 - (n_terminal - 20) * 0.01)
Width is set to: max(n_terminal, n_classes)
Height is set to: tree depth
Manual scaling:
When auto.scale = FALSE, you can control dimensions with
min.width and min.height.
Invisibly returns a list with:
width |
Numeric. The width used for plotting (number of terminal nodes) |
height |
Numeric. The height used for plotting (tree depth) |
font.size |
Numeric. The final font size used (after auto-scaling) |
This function requires the grid package. It will create a new graphics
page using grid.newpage().
For very large trees (>50 terminal nodes), consider:
Exporting to a large PNG or PDF file
Manually reducing font.size further
Pruning the tree before plotting
library(grid) # Example with penguins dataset data(penguins) penguins <- na.omit(penguins[, -c(2,7)]) penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep +flipper_len + body_mass, data = penguins, PPmethod = "PDA", srule = FALSE ) plot(penguins_ppt, main = "Penguins Classification with PPtreeExt", font.size = 8, width.size = 0.7)library(grid) # Example with penguins dataset data(penguins) penguins <- na.omit(penguins[, -c(2,7)]) penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep +flipper_len + body_mass, data = penguins, PPmethod = "PDA", srule = FALSE ) plot(penguins_ppt, main = "Penguins Classification with PPtreeExt", font.size = 8, width.size = 0.7)
Constructs a projection pursuit classification tree using various projection pursuit
indices. Optionally performs random variable selection at each split which can be used to include in a random forests methodology. When size.p = 1, this reduces to a PPtree algorithm.
PPtreeExt_split( formula, data, PPmethod = "LDA", size.p = 1, lambda = 0.1, entro = FALSE, entroindiv = FALSE, ... )PPtreeExt_split( formula, data, PPmethod = "LDA", size.p = 1, lambda = 0.1, entro = FALSE, entroindiv = FALSE, ... )
formula |
A formula of the form |
data |
Data frame containing both the class variable and predictor variables. |
PPmethod |
Character string specifying the projection pursuit index to use.
Either |
size.p |
Numeric value between 0 and 1 specifying the proportion of variables to randomly sample at each split. Default is 1, which uses all variables at each split (standard PPtree). Values less than 1 introduce randomness similar to random forests, which can improve robustness and reduce overfitting. |
lambda |
Numeric penalty parameter for the PDA index, ranging from 0 to 1.
When |
entro |
Logical indicating whether to use entropy-based stopping rules for
tree construction. Default is |
entroindiv |
Logical indicating whether to compute entropy for each individual
observation in the 1D projection. Default is |
... |
Additional arguments to be passed to internal tree construction methods. |
This function extends the standard PPtree algorithm by incorporating random variable selection at each split, and define the split based on subsetting groups. The algorithm:
At each node, randomly samples size.p * 100% of the predictor variables
Finds the optimal projection using the selected variables and specified index (LDA or PDA)
Determines a cutpoint based on entropy splitting if entropy parameters are set
Recursively splits the data until stopping criteria are met
The entro parameter enables entropy-based stopping rules that halt splitting
when nodes become sufficiently pure or small. The entroindiv parameter computes
entropy at the individual observation level in the projected space, which can provide
more refined splitting decisions.
When size.p = 1, all variables are used at each split and the function
behaves as a standard PPtree. Values of size.p < 1 introduce randomness
that can improve model robustness, especially for high-dimensional data or when
building ensemble models.
An object of class "PPtreeclass", which is a list containing:
Tree.Struct |
A matrix defining the tree structure of the projection pursuit classification tree. Each row represents a node with columns: node ID, left child node ID, right child node ID (or final class if terminal), coefficient ID, and index value. |
projbest.node |
A matrix where each row contains the optimal 1-dimensional projection coefficients for each split node. The number of columns equals the number of predictor variables. |
splitCutoff.node |
A data frame containing the cutoff values and splitting rules for each split node. Contains 8 rule columns defining the classification boundaries. |
origclass |
Factor vector of the original class labels from the input data. |
origdata |
Matrix of the original predictor variables (without the class variable). |
Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection pursuit classification tree, Electronic Journal of Statistics, 7:1369-1386.
TreeExt.construct, findproj_Ext,
LDAopt_Ext, PDAopt_Ext
data(penguins) penguins <- na.omit(penguins[, -c(2,7, 8)]) require(rsample) penguins_spl <- rsample::initial_split(penguins, strata=species) penguins_train <- training(penguins_spl) penguins_test <- testing(penguins_spl) penguins_ppt2 <- PPtreeExt_split(species~bill_len + bill_dep + flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow (penguins_train), tol = 0.5 , entro=TRUE)data(penguins) penguins <- na.omit(penguins[, -c(2,7, 8)]) require(rsample) penguins_spl <- rsample::initial_split(penguins, strata=species) penguins_train <- training(penguins_spl) penguins_test <- testing(penguins_spl) penguins_ppt2 <- PPtreeExt_split(species~bill_len + bill_dep + flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow (penguins_train), tol = 0.5 , entro=TRUE)
Projection Pursuit Classification Tree with Extensions
PPtreeExtclass(formula, data, PPmethod = "LDA", weight = TRUE, lambda = 0.1,srule, tot = nrow(data), tol = 0.5,...)PPtreeExtclass(formula, data, PPmethod = "LDA", weight = TRUE, lambda = 0.1,srule, tot = nrow(data), tol = 0.5,...)
formula |
An object of class |
data |
Data frame containing both the class variable and predictor variables specified in the formula. |
PPmethod |
Character string specifying the projection pursuit index to use.
Either |
weight |
Logical indicating whether to use weighted index calculation in LDA
and PDA. When |
lambda |
Numeric penalty parameter for the PDA index, ranging from 0 to 1.
Default is 0.1. Only used when |
srule |
Logical flag for stopping rule. If |
tot |
Integer specifying the total number of observations in the original dataset.
Default is |
tol |
Numeric tolerance value for the entropy-based stopping rule. Nodes with entropy below this threshold will not be split further. Default is 0.5. Lower values create deeper trees. |
... |
Additional arguments to be passed to internal tree construction methods. |
Constructs a projection pursuit classification tree using various projection pursuit indices (LDA or PDA) at each split. This extended version includes customizable stopping rules based on entropy and node size criteria.
This function builds a binary classification tree where each split is determined by finding an optimal projection of the data onto a one-dimensional space using either LDA or PDA indices. The algorithm works as follows:
At each node, find the optimal 1D projection that best separates classes
Project the data onto this direction and find an optimal cutpoint
Split observations based on the cutpoint into left and right child nodes
Recursively repeat until stopping criteria are met
When srule = TRUE, a node stops splitting if any of the following conditions hold:
The node is pure (contains only one class)
The node contains fewer than 5% of the total observations (n/tot <= 0.05)
The node entropy is below the tolerance threshold (entropy < tol)
When srule = FALSE, splitting only stops for pure or empty nodes, potentially
creating deeper, more complex trees.
LDA: Suitable for most classification problems with moderate dimensionality
PDA: Recommended for high-dimensional data (p > n) or data with multicollinearity
The tol parameter controls tree complexity: smaller values allow more splits
(deeper trees with potentially better training accuracy but higher risk of overfitting),
while larger values create simpler trees (better generalization but potentially
underfitting).
An object of class c("PPtreeExtclass", "PPtreeclass"), which is a list containing:
Tree.Struct |
A matrix defining the tree structure. Each row represents a node with 5 columns: node ID, left child node ID, right/final node ID (class label if terminal node), coefficient ID (projection index), and optimization index value. |
projbest.node |
A matrix where each row contains the optimal 1-dimensional
projection coefficients for each split node. Each row has length equal to
|
splitCutoff.node |
A numeric vector or matrix containing the cutoff values (thresholds) used at each split node for classification decisions. |
origclass |
Factor vector of the original class labels from the input data. |
origdata |
Matrix of the original predictor variables (without the class variable). |
terms |
The terms object from the model frame, preserving the formula structure. |
This function does not support interaction terms in the formula. Use only
additive terms (e.g., y ~ x1 + x2) and not multiplicative terms (e.g.,
y ~ x1 * x2).
Lee, YD, Cook, D., Park JW, and Lee, EK (2013) PPtree: Projection Pursuit Classification Tree, Electronic Journal of Statistics, 7:1369-1386.
TreeExt.construct, PPtreeExt_split,
findproj_Ext, predict.PPtreeExtclass
set.seed(234) data(penguins) penguins <- na.omit(penguins[, -c(2,7, 8)]) require(rsample) penguins_spl <- rsample::initial_split(penguins, strata=species) penguins_train <- training(penguins_spl) penguins_test <- testing(penguins_spl) penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep + flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow (penguins_train), tol = 0.2 , srule = TRUE)set.seed(234) data(penguins) penguins <- na.omit(penguins[, -c(2,7, 8)]) require(rsample) penguins_spl <- rsample::initial_split(penguins, strata=species) penguins_train <- training(penguins_spl) penguins_test <- testing(penguins_spl) penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep + flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot=nrow (penguins_train), tol = 0.2 , srule = TRUE)
Predicts class labels for new observations using a fitted projection pursuit classification tree and optionally calculates prediction error when true class labels are provided.
## S3 method for class 'PPtreeExtclass' predict(object, newdata, true.class = NULL, ...)## S3 method for class 'PPtreeExtclass' predict(object, newdata, true.class = NULL, ...)
object |
An object of class |
newdata |
A data frame or matrix containing the predictor variables for which predictions are to be made. Must contain the same variables (in the same order) as used in the training data, but without the class variable. |
true.class |
Optional vector of true class labels for the test data.
If provided, prediction error will be calculated. Can be either numeric or
factor. Default is |
... |
Additional arguments (currently not used). |
A list with two components:
predict.class |
A character vector of predicted class labels for each
observation in |
predict.error |
Integer count of prediction errors (misclassifications).
Only computed when |
data(penguins) penguins <- na.omit(penguins[, -c(2,7, 8)]) require(rsample) penguins_spl <- rsample::initial_split(penguins, strata=species) penguins_train <- training(penguins_spl) penguins_test <- testing(penguins_spl) penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep + flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot =nrow (penguins_train), tol=0.5) predict(object = penguins_ppt, newdata = penguins_test[,-1], true.class = penguins_test$species)data(penguins) penguins <- na.omit(penguins[, -c(2,7, 8)]) require(rsample) penguins_spl <- rsample::initial_split(penguins, strata=species) penguins_train <- training(penguins_spl) penguins_test <- testing(penguins_spl) penguins_ppt <- PPtreeExtclass(species~bill_len + bill_dep + flipper_len + body_mass, data = penguins_train, PPmethod = "LDA", tot =nrow (penguins_train), tol=0.5) predict(object = penguins_ppt, newdata = penguins_test[,-1], true.class = penguins_test$species)
Prints a summary of a fitted projection pursuit classification tree, including the tree structure, optionally the projection coefficients and cutoff values, and the training error rate.
## S3 method for class 'PPtreeExtclass' print(x, coef.print = FALSE, cutoff.print = FALSE, verbose = TRUE, ...)## S3 method for class 'PPtreeExtclass' print(x, coef.print = FALSE, cutoff.print = FALSE, verbose = TRUE, ...)
x |
An object of class |
coef.print |
Logical indicating whether to print the projection coefficients
for each split node. Default is |
cutoff.print |
Logical indicating whether to print the cutoff values for
each split node. Default is |
verbose |
Logical indicating whether to print the tree structure and error
rate. If |
... |
Additional arguments (currently not used). |
The function traverses the tree structure stored in x$Tree.Struct and
creates a hierarchical text representation. When coef.print = TRUE,
the projection coefficients (linear combinations of features) used at each
split are displayed. When cutoff.print = TRUE, the threshold values
used to determine left/right splits are shown.
The training error rate is computed by applying the fitted tree to the original training data.
The object x, invisibly
PPtreeExtclass, PPtreeExt_split,
predict.PPtreeExtclass
Construct the projection pursuit classification tree extensions
TreeExt.construct(origclass, origdata, Tree.Struct, id, rep, rep1, rep2, projbest.node, splitCutoff.node, PPmethod, lambda = NULL, q = 1, weight = TRUE, srule=TRUE, tot=NULL, tol = .5,...)TreeExt.construct(origclass, origdata, Tree.Struct, id, rep, rep1, rep2, projbest.node, splitCutoff.node, PPmethod, lambda = NULL, q = 1, weight = TRUE, srule=TRUE, tot=NULL, tol = .5,...)
origclass |
factor or numeric vector containing the class labels for each observation. |
origdata |
data frame with the original data without class variable |
Tree.Struct |
tree structure of projection pursuit classification tree |
id |
tree node id |
rep |
internal counter for nodes |
rep1 |
internal counter for nodes |
rep2 |
internal counter for nodes |
projbest.node |
bests projection node |
splitCutoff.node |
cutof node |
PPmethod |
method for projection pursuit; "LDA", "PDA" |
lambda |
lambda in PDA index |
q |
numeric value with dimension of the projected data, if it is 1 then 1D projection is used |
weight |
weight flag in LDA, PDA |
srule |
stopping rule flag; if TRUE use stopping rule, if FALSE stop only for pure or empty nodes |
tot |
total number of observations |
tol |
tolerance value for entropy stopping rule for splitting a node |
... |
additional arguments to pass trough |
Find tree structure using various projection pursuit indices of classification in each split.
This function recursively constructs a binary classification tree using projection pursuit. At each node, it finds the optimal projection direction that best separates classes, determines a cutpoint, and creates child nodes until stopping criteria are met (pure nodes, small node size, or low entropy).
A list containing the complete tree structure and node information:
Tree.Struct |
A matrix where each row represents a node in the projection pursuit classification tree. The matrix has 5 columns:
|
projbest.node |
A matrix where each row contains the optimal projection coefficients (Alpha vector) for each split node. |
splitCutoff.node |
A matrix/vector containing the optimal cutpoint values used at each split node. |
rep |
Integer counter tracking the current node being processed (internal use). |
rep1 |
Integer counter for assigning child node IDs (internal use). |
rep2 |
Integer counter for tracking projection indices (internal use). |
A data set containing 178 observations from 3 wine grown cultivares in Italy.
Class variable has 3 classes that are 3 different wine grown cultivares in Italy.
Check vbles
data(wine)data(wine)
A data frame with 178 rows and 14 variables