Some notes on the consequences of pretreatment of multivariate data

Author's Department

Mathematics & Actuarial Science Department

Find in your Library

https://doi.org/10.1016/j.ins.2024.121580

All Authors

Ali S. Hadi Rida Moustafa

Document Type

Research Article

Publication Title

Information Sciences

Publication Date

2-1-2025

doi

10.1016/j.ins.2024.121580

Abstract

With the advent of data technologies, we have various types of data, such as structured, unstructured and semi-structured. Performing certain statistical or machine learning techniques may require careful preprocessing or pretreatment of the data to make them suitable for analysis. For example, given a data matrix X, which represents n multivariate observations or cases on p variables or features, the columns/rows of X may be pretreated before applying statistical or machine learning techniques to the data. While centering and/or scaling the variables do not alter the correlation structure nor the graphical representation of the data, centering/scaling the observations do. We investigate various row pretreatment methods more closely and show with theoretical proofs and numerical examples that centering/scaling the rows of X changes both the graphical structure of the observations in the multi-dimensional space and the correlation structure among the variables. There may be good reasons for performing row centering/scaling on the data and we are not against it, but analysts who use such row operations should be aware of the geometrical and correlation structures one has performed on the data and should also demonstrate that the process results in a new, more appropriate structure for their questions.

Share

COinS