Principal Component Analysis
Before delving deep into the workings of the algorithm, it's vital to get the context right, how principal component comes in and why it is essential when it comes to machine learning.
Assuming you are a car dealer and you were to evaluate the value of a vehicle depending on its features. Well, a vehicle has very many features, specifications and parts that affect its price and it would be very complicated to try and factor in each and every feature/ component of the vehicle in the price determination process.
Try and imagine even factoring in the screws that have been used to hold the car together. That would be too complicated and hectic, what would be easier to do would be to determine the most important features of the vehicle that greatly affect the price because there are features (such as the screw mentioned above )that are less important. Having to consider fewer features makes the evaluation process much simple. That’s how PCA comes in. PCA tries to eliminate all the intricacies of a dataset to promote simplicity of a dataset for easier processing and for easier application of machine learning algorithms.
Key terms: Variance, Dimensionality reduction, Eigenvectors, eigenvalues, singular value decomposition.
What is PCA?
Principal Component Analysis is a dimensionality reduction technique that aims to simplify complex datasets while capturing the very most important aspects of the data. This algorithm does this by flattening the data and focusing on attributes of the dataset that are different in order to capture almost all aspects of the data.
From a statistical point of view, PCA aims to capture attributes of a dataset that contribute to the maximum variance of data. It is worth noting that when applying the PCA algorithm on data, the original quality of the data is lost. Ideally, we are trading a little bit of the data’s accuracy for simplicity.
Just for the record, PCA’s history can be traced back to the year 1901 and was invented by Karl Pearson as an analogue of the principal axis theorem in mechanics; it was later independently developed and named by Harold Hotelling in the 1930s.
Statistical interpretation
This section will purpose to uncloak the statistical methods that are used to arrive at principal components.
1. Standardisation
Under this process, a couple of mathematical computations take place such as
(i) compute the mean of the rows using the below formulae.
x¯=∑i=1nxi
(ii)Calculate the average matrix of the data set.
This is done by multiplying the mean of the rows (that we have obtained in the above step) by a matrix of ones. The formulae below explains.
x¯ = [1,1,1,1] * [x¯]
x¯ is now an average matrix of the data set
(iii)Subtract the mean from the data matrix.
M = x¯ — x¯
M = X (data matrix ) — X (mean of the rows)
2. Co-variance matrix of rows using the formulae below
K = M(Transpose) *M
The main aim of computing Co-variance is to determine the direction of the relationship between variables.
3.Perform Eigen decomposition.
Eigen decomposition is a matrix decomposition method that aims to reduce a matrix into its constituent parts which are eigenvalues and eigenvectors that are used in the next step of the process given the matrix K, the eigendecomposition process will yield eigenvalues and eigenvectors which will be denoted by F and G
therefore K = FG.
4. Get the principal components
The principal components are a product of the eigenvectors and the standardized data matrix that was denoted by the value M
T = M * F
where T are the principal components and F are referred to as loadings/eigenvectors and M the result of the standardization process.
In a nutshell, this whole process explained above is meant to decompose a matrix in the direction of maximum variance in order to capture the most important features of a given data matrix.
Principal components are normally uncorrelated since the main aim is to capture a lot of information of different types.
The number of principal components that are produced from this process is equal to the total number of variables that are there in the data.
The first principal component captures the maximum variance of the data.
Lastly, it is worth noting that principal components can hardly be interpreted since they are linear combinations of a given data matrix making it quite complicated to comprehend.