Principal Components Analysis

The purpose of this TP is to learn to compute and interpret principal components using R.



As usual, make sure that you read the help for any new functions that you use.


Iris data

Start off loading the iris dataset, and learn a little about it. For getting the principal components you will only need the first 4 columns, so it might be easiest to extract those. Using some of the functions from the earlier practicals, explore the data by obtaining summary statistics and graphs. In addition to any univariate graphs, you can also look at pairwise scatterplots of the variables with the pairs function.

data(iris)
names(iris)
ir <- as.matrix(iris[,1:4])

Use the function prcomp to obtain the principal components without scaling the variables, make sure to save the result (as, for example, ir.pca):

ir.pca <- prcomp(ir)
ir.pca

You can visualize the data in principal component space by plotting, for example, as follows (here the different species are plotted in different colors; note also that this is not necessarily the most efficient way to do it, but is probably the easiest way to see how the code works):

plot(ir.pca$x[,1],ir.pca$x[,2],type="n")
points(ir.pca$x[1:50,1],ir.pca$x[1:50,2],col=2)
points(ir.pca$x[51:100,1],ir.pca$x[51:100,2],col=3)
points(ir.pca$x[101:150,1],ir.pca$x[101:150,2],col=4)

You should try to assess how many components it makes sense to retain. There are a few ways to do this. First, try looking at the screeplot (use the argument type="lines"). Next, see how many eigenvalues are bigger than the average eigenvalue; if you have read the help documentation for prcomp carefully enough, you should be able to find this pretty easily (Hint: look at the output component sdev; you can get the mean eigenvalue as mean(ir.pca$sdev^2)).

Which variable(s) seem most important? You can get an idea of this by looking at the component loadings (coefficients of the variables in the linear combination (ir.pca$rotation)).

Now, redo your analysis scaling the variables (that is, using the correlation matrix rather than the covariance matrix). This is easily done by setting the scale. argument to prcomp (notice the '.'). Do you come to the same conclusions? What do you recommend?

Food data

If you have time and feel like a little more fun, you can download the food dataset (below the lab link).

To read it into R, type (assuming you have saved food.txt in your working directory)

food <- read.table("food.txt",header=TRUE,row.names=1)

Again, make a pairs plot, and obtain the principal components (using scaling). Look at the screeplot and eigenvalues to decide a reasonable number of principal components to retain. You can also look at a scatterplot of the data on the first two principal components as follows (assuming you have saved the output from prcomp as food.pc):

plot(food.pc$x[,1],food.pc$x[,2],type="n")
text(food.pc$x[,1],food.pc$x[,2],labels=row.names(food))

There is a general pattern that most of the foods fall into; which foods are different? Which foods are at the left end of the first dimension? Which foods are high on the second dimension?

Go back to the pairs plot: you should notice a number of outliers in some of the plots. These may be influential in determining the principal components. Can you think of a way to deal with this problem in the analysis?