PCA - Principal Component Analysis

PCA is a standard technique for visualizing high dimensional data and for data pre-processing. PCA reduces the dimensionality (the number of variables) of a data set by maintaining as much variance as possible.

PCA and Bioinformatics

Illustrated are three-dimensional gene expression data which are mainly located within a two-dimensional subspace. PCA is used to visualize these data by reducing the dimensionality of the data: The three original variables (genes) are reduced to a lower number of two new variables termed principal components (PCs). Left: Using PCA, we can identify the two-dimensional plane that optimally describes the highest variance of the data. This two-dimensional subspace can then be rotated and presented as a two-dimensional component space (right).
Such two-dimensional visualization of the samples allow us to draw qualitative conclusions about the separability of experimental conditions (marked by different colors).

PCA transformation

Principal component analysis (PCA) rotates the original data space such that the axes of the new coordinate system point into the directions of highest variance of the data. The axes or new variables are termed principal components (PCs) and are ordered by variance: The first component, PC 1, represents the direction of the highest variance of the data. The direction of the second component, PC 2, represents the highest of the remaining variance orthogonal to the first component. This can be naturally extended to obtain the required number of components which together span a component space covering the desired amount of variance.
Since components describe specific directions in the data space, each component depends by certain amounts on each of the original variables: Each component is a linear combination of all original variables.

Dimensionality reduction

Low variance can often be assumed to represent undesired background noise. The dimensionality of the data can therefore be reduced, without loss of relevant information, by extracting a lower dimensional component space covering the highest variance. Using a lower number of principal components instead of the high-dimensional original data is a common pre-processing step that often improves results of subsequent analyses such as classification.
For visualization, the first and second component can be plotted against each other to obtain a two-dimensional representation of the data that captures most of the variance (assumed to be most of the relevant information), useful to analyze and interpret the structure of a data set.