## Wednesday, July 28, 2010

### [Math] Principle Component Analysis

Principle Component Analysis (PCA) is a easy and useful technique to identify patterns in high dimensional data.

Matlab has a function princomp doing the PCA analysis.

[coeff, score, latent]=princomp(X)

X is the data arranged by row, i.e. each row is an observation or an instance, each column corresponds to a random variable, suppose the dimension of X is [n,p], that is there are totally n observations and each is of p dimension;

coeff: is the returned eigenvector matrix, each column is an eigenvector, they are ordered according the corresponding eigenvalues from large to small. coeff is of the dimension of [p,p];

score: is the reconstructed version of X using the eigenvectors. If all the eigenvectors are used, it should be the same as X; while if small eigenvectors are ignored, there will be small difference;

latent: are the eigenvalues corresponding to the eigenvectors in coeff.

With principle components (i.e. eigenvectors) the reconstructed data are got by:

score = ( X - mean(X) ) * coeff

The eigenvectors and eigenvalues are actually the eigenvectors and eigenvalues of the covariance matrix of the data.

Thus to compute them,
1) first subtract the mean of the data;
2) compute the covariance matrix of the data;
3) compute the eigenvectors and eigenvalues of the covariance matrix;

then it's done.