TY - JOUR

T1 - PLS1-MD

T2 - A partial least squares regression algorithm for solving missing data problems

AU - González, Víctor

AU - Giraldo, Ramón

AU - Leiva, Víctor

N1 - Publisher Copyright:
© 2023 Elsevier B.V.

PY - 2023/9/15

Y1 - 2023/9/15

N2 - In this article, we propose a methodology that modifies the partial least squares (PLS) regression algorithm. Certain steps of the algorithm are adjusted to address the estimation problem in multiple linear regression when there are missing data (MD). The modified algorithm is called PLS1-MD and is based on the available data principle, allowing for multiple regression analysis even when there are missing values in the response or some of the explanatory variables, without the need for imputation. PLS1-MD can be applied under conditions of multicollinearity (where the explanatory variables are correlated, resulting in linear combinations among columns of the design matrix) and high dimensionality (where the number of individuals is less than the number of variables). The PLS1-MD algorithm ensures orthogonality, orthonormality of the coefficient vector, and optimality at each stage. The procedure is illustrated using the Cornell and Yarn datasets, which are widely known in the context of PLS1 regression. For this purpose, 10% of the data is randomly deleted and labeled as MD. The results indicate that the estimates obtained with the PLS1-MD algorithm are very similar to those generated when applying PLS1 to the set of observations with no MD. This new algorithm does not require imputing missing values, thus preserving the properties of centrality and orthogonality. We compare the results obtained using our approach with those obtained using the R libraries named pls and plsdepot. Under the scenario of no MD, we obtain the same results. In the presence of MD, the library pls cannot be used and only plsdepot solves the problem when there are MD in the explanatory variables.

AB - In this article, we propose a methodology that modifies the partial least squares (PLS) regression algorithm. Certain steps of the algorithm are adjusted to address the estimation problem in multiple linear regression when there are missing data (MD). The modified algorithm is called PLS1-MD and is based on the available data principle, allowing for multiple regression analysis even when there are missing values in the response or some of the explanatory variables, without the need for imputation. PLS1-MD can be applied under conditions of multicollinearity (where the explanatory variables are correlated, resulting in linear combinations among columns of the design matrix) and high dimensionality (where the number of individuals is less than the number of variables). The PLS1-MD algorithm ensures orthogonality, orthonormality of the coefficient vector, and optimality at each stage. The procedure is illustrated using the Cornell and Yarn datasets, which are widely known in the context of PLS1 regression. For this purpose, 10% of the data is randomly deleted and labeled as MD. The results indicate that the estimates obtained with the PLS1-MD algorithm are very similar to those generated when applying PLS1 to the set of observations with no MD. This new algorithm does not require imputing missing values, thus preserving the properties of centrality and orthogonality. We compare the results obtained using our approach with those obtained using the R libraries named pls and plsdepot. Under the scenario of no MD, we obtain the same results. In the presence of MD, the library pls cannot be used and only plsdepot solves the problem when there are MD in the explanatory variables.

KW - Missing data

KW - Orthonormality

KW - PLS1 regression

KW - Sparse matrices

UR - http://www.scopus.com/inward/record.url?scp=85161972379&partnerID=8YFLogxK

U2 - 10.1016/j.chemolab.2023.104876

DO - 10.1016/j.chemolab.2023.104876

M3 - Article

AN - SCOPUS:85161972379

SN - 0169-7439

VL - 240

JO - Chemometrics and Intelligent Laboratory Systems

JF - Chemometrics and Intelligent Laboratory Systems

M1 - 104876

ER -