Linear Regression Analysis

The purpose of this TP is to practice using R for linear modeling. We will apply linear regression to two problems with data in the HSAUR package. To carry out these analyses, first install and then load the R packages HSAUR3 and gamair.

As usual, make sure that you read the help for any new functions that you use.

Install R

If you have not done so already, install R on your computer. You can find the software at the Comprehensive R Archive Network, or CRAN. There is a Swiss mirror site at http://cran.ch.r-project.org/. If you go to that site, you will find several links under 'Documentation' (the fourth major entry on the left side). 'Official' documentation is available under 'Manuals'; other helpful documentation is under 'Contributed'. 

If you are not used to using R, or if you need a refresher, you can work on the first few sections (Basic R and the hellung data of TP 0) 

To proceed, you will need to start R. You can copy/paste the code below at the R prompt. 

Hubble data

Here, these data will be used to estimate the age of the universe. The hubble data are found in the the gamair package: 

library(gamair)
data(hubble)
names(hubble)

Make sure to read the help about the hubble data. 

The function plot is used to make a scatterplot of velocity (y) vs. distance (x) \- read the help for plot to see how it is used. You can access the components of hubble using the $ operator: i.e. hubble$x for the "x" component. 

Now fit a linear model to predict velocity from distance. Here, we will fit without an intercept term (`-1'), as it does not make physical sense in this case. 

hmod <- lm(y ~ x - 1, data=hubble)

and extract the estimated model coefficient via: 

coef(hmod) 

You can visualize the results with a number of plots: 

layout(matrix(1:2,ncol=2))
plot(y ~ x, data = hubble)
abline(hmod)
plot(hmod, which = 1) 

You can have a look at some of the other plot types for lm objects by varying `which' between 1 and 6 (you can also read the help for the plot method on an lm object, plot.lm). 

We can use the estimated coefficient to find an approximate value for the age of the universe. The Hubble constant has units of km^(-1) x sec^(-1) x Mpc^(-1). A mega-parsec (Mpc) is 3.09 x 10^(19) km, so we need to divide the estimated value b1 by this amount to obtain Hubble's constant wiht units of sec^(-1). The approximate age of the universe in seconds will then be the inverse of this calculation (you will want to divide Mpc by the number of seconds in a year to obtain a result in years. 

Carry out the appropriate calculations in R; you should obtain a result of approximately 12.8 billion years. 

Cloud Seeding Data

After loading the HSAUR3 package, have a look at the clouds dataset. In this problem, we are interested in modeling rainfall as a function of seeding and other variables. 

Construct boxplots of rainfall in each category for the dichotomous variables and scatterplots for the continuous variables: 

data("clouds", package = "HSAUR3")
layout(matrix(1:2, nrow = 2))
bxpseeding <- boxplot(rainfall ~ seeding, data = clouds, ylab = "Rainfall", xlab = "Seeding")
bxpecho <- boxplot(rainfall ~ echomotion, data = clouds, ylab = "Rainfall", xlab = "Echo Motion")
layout(matrix(1:4, nrow = 2))
plot(rainfall ~ time, data = clouds)
plot(rainfall ~ cloudcover, data = clouds)
plot(rainfall ~ sne, data = clouds)
plot(rainfall ~ prewetness, data = clouds)
clouds.formula <- rainfall ~ seeding + seeding:(sne + cloudcover + prewetness + echomotion) + time
clouds.lm <- lm(clouds.formula, data = clouds)

Get a summary of the model results and extract the coefficients as you did above. 

What do these results suggest? 

Using the lm plotting method, you can examine diagnostic plots for these results to assess model assumptions and identify any influential points. You could instead access the residuals and fitted values and make plots with those directly if you want: 

clouds.resid <- residuals(clouds.lm)
clouds.fitted <- fitted(clouds.lm) 

We can also look at the regression relationship between S-Ne and rainfall with and without seeding: 

layout(1)
psymb <- as.numeric(clouds$seeding)
plot(rainfall ~ sne, data = clouds, pch = psymb, xlab = "S-Ne criterion")
abline(lm(rainfall ~ sne, data = clouds, subset = seeding == "no"))
abline(lm(rainfall ~ sne, data = clouds, subset = seeding == "yes"), lty = 2)
legend("topright", legend = c("No seeding", "Seeding"), pch = 1:2, lty = 1:2, bty = "n") 

Make sure you can interpret this plot. Is there an interaction between seeding and S-Ne? Explain.