An Introduction to Statistical Learning:

2.3 Lab: Introduction to R

2.3.1 Basic Commands

Here's an example of the R function c() that we can use to concatenate values into a vector or a list.

x <- c(1, 3, 2, 5)
x
## [1] 1 3 2 5
x <- c(1, 6, 2)
x
## [1] 1 6 2
y <- c(1, 4, 3)

We can check the length of a vector or list with the length() function.

length(x)
## [1] 3
length(y)
## [1] 3
x + y
## [1]  2 10  5

To get a list of all objects in the current environment we use the ls() function. Objects can be removed from the environment with the rm() function.

ls()
## [1] "reinstall_rgitbook" "x"                  "y"
rm(x, y)
ls()
## [1] "reinstall_rgitbook"

To remove all objects from the environment, we first get the list of all objects with the ls() function, and pass this list to the rm() function.

rm(list = ls())

To get help on any function, we can use the ? operator.

?matrix

A matrix is defined using the matrix() function.

x=matrix(data=c(1,2,3,4), nrow=2, ncol=2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

When calling a function in R, positional arguments can be speficied without the names of the arguments as shown below. Note that the resulting matrix is identical to the one in the previous call to matrix()

x <- matrix(c(1, 2, 3, 4), 2, 2)
x
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

Finally, we can also tell the matrix() function whether the data provided are in row-order or column-order. The following example shows, matrix creation with data in row-order.

matrix(c(1, 2, 3, 4), 2, 2, byrow = TRUE)
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4

We can take the square-root of each element of the matrix with the sqrt() function and use the ^ operator raise each element of the matrix to a power.

sqrt(x)
##          [,1]     [,2]
## [1,] 1.000000 1.732051
## [2,] 1.414214 2.000000
x^2
##      [,1] [,2]
## [1,]    1    9
## [2,]    4   16

The rnorm() function can be used to generate random numbers. In the following example, we use rnorm() to generate a vector of 50 random numbers.

x <- rnorm(50)
y <- x + rnorm(50, mean = 50, sd = 0.1)
cor(x, y)
## [1] 0.9940236

To get consistent results from repeated calls to rnorm(), we use the set.seed().

set.seed(1303)
rnorm(50)
##  [1] -1.1439763145  1.3421293656  2.1853904757  0.5363925179  0.0631929665
##  [6]  0.5022344825 -0.0004167247  0.5658198405 -0.5725226890 -1.1102250073
## [11] -0.0486871234 -0.6956562176  0.8289174803  0.2066528551 -0.2356745091
## [16] -0.5563104914 -0.3647543571  0.8623550343 -0.6307715354  0.3136021252
## [21] -0.9314953177  0.8238676185  0.5233707021  0.7069214120  0.4202043256
## [26] -0.2690521547 -1.5103172999 -0.6902124766 -0.1434719524 -1.0135274099
## [31]  1.5732737361  0.0127465055  0.8726470499  0.4220661905 -0.0188157917
## [36]  2.6157489689 -0.6931401748 -0.2663217810 -0.7206364412  1.3677342065
## [41]  0.2640073322  0.6321868074 -1.3306509858  0.0268888182  1.0406363208
## [46]  1.3120237985 -0.0300020767 -0.2500257125  0.0234144857  1.6598706557

The functions mean(), var(), and sd() can be used to calculate the mean, variance and standard deviation of a vector.

set.seed(3)
y <- rnorm(100)
mean(y)
## [1] 0.01103557
var(y)
## [1] 0.7328675
sqrt(var(y))
## [1] 0.8560768
sd(y)
## [1] 0.8560768

2.3.2 Graphics

Basic plotting functionality is provided by the plot() function in R. The following example shows how to produce a scatter plot and how to label the x and y axis.

x <- rnorm(100)
y <- rnorm(100)
plot(x, y)

plot(x, y, xlab = "this is the x-axis", ylab = "this is the y-axis", main = "Plot of X vs Y")

We can save the plots generated in R to a file in PDF format, or to a graphics image format such as JPEG or PNG. In the following example, we use the pdf() function to create a PDF file called Figure.pdf.

pdf("Figure.pdf")
plot(x, y, col = "green")
dev.off()
## quartz_off_screen 
##                 2

To generate sequence of numbers, we can call the seq() function.

x <- seq(1, 10) > x
x <- 1:10
x
##  [1]  1  2  3  4  5  6  7  8  9 10
x <- seq(-pi, pi, length = 50)

Contour plots can be created in R with the contour() function.

y <- x
f <- outer(x, y, function(x, y) cos(y)/(1 + x^2))
contour(x, y, f)
contour(x, y, f, nlevels = 45, add = T)

fa <- (f - t(f))/2
contour(x, y, fa, nlevels = 15)

Heatmaps can be generated with the image() function.

image(x, y, fa)

Perspective plots of a three-dimentional surface are plottted using the persp() function.

persp(x, y, fa)

persp(x, y, fa, theta = 30)

persp(x, y, fa, theta = 30, phi = 20)

persp(x, y, fa, theta = 30, phi = 70)

persp(x, y, fa, theta = 30, phi = 40)

2.3.3 Indexing Data

Indexing elements of a vector or a matrix is done using the square-brackets [ and ]. In this example we create a 4x4 matrix and then index the element at row 2, column 3.

A <- matrix(1:16, 4, 4)
A
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
## [3,]    3    7   11   15
## [4,]    4    8   12   16
A[2, 3]
## [1] 10

We can also select multiple rows or columns at once by providing the row or the column indices as vectors instead of single integers.

A[c(1, 3), c(2, 4)]
##      [,1] [,2]
## [1,]    5   13
## [2,]    7   15
A[1:3, 2:4]
##      [,1] [,2] [,3]
## [1,]    5    9   13
## [2,]    6   10   14
## [3,]    7   11   15
A[1:2, ]
##      [,1] [,2] [,3] [,4]
## [1,]    1    5    9   13
## [2,]    2    6   10   14
A[, 1:2]
##      [,1] [,2]
## [1,]    1    5
## [2,]    2    6
## [3,]    3    7
## [4,]    4    8

If a row or column index is omitted, then R assumes that we're looking for all rows or all columns.

A[1, ]
## [1]  1  5  9 13

To exclude certain elements, we can use the negative sign - before the index.

A[-c(1, 3), ]
##      [,1] [,2] [,3] [,4]
## [1,]    2    6   10   14
## [2,]    4    8   12   16
A[-c(1, 3), -c(1, 3, 4)]
## [1] 6 8

The dim() can be used to show the dimensions of a vector or a matrix.

dim(A)
## [1] 4 4

2.3.4 Loading Data

The read.table() and read.csv() functions are used to import data from a file into R.

Auto <- read.table("http://www-bcf.usc.edu/~gareth/ISL/Auto.data")
head(Auto)
##     V1        V2           V3         V4     V5           V6   V7     V8
## 1  mpg cylinders displacement horsepower weight acceleration year origin
## 2 18.0         8        307.0      130.0  3504.         12.0   70      1
## 3 15.0         8        350.0      165.0  3693.         11.5   70      1
## 4 18.0         8        318.0      150.0  3436.         11.0   70      1
## 5 16.0         8        304.0      150.0  3433.         12.0   70      1
## 6 17.0         8        302.0      140.0  3449.         10.5   70      1
##                          V9
## 1                      name
## 2 chevrolet chevelle malibu
## 3         buick skylark 320
## 4        plymouth satellite
## 5             amc rebel sst
## 6               ford torino

The read.table() and read.csv() family of functions offer a number of options for controlling how the data are imported, such as control over header rows, blank rows, or missing values.

Auto <- read.csv("http://www-bcf.usc.edu/~gareth/ISL/Auto.csv", header = T, na.strings = "?")
head(Auto)
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
## 5  17         8          302        140   3449         10.5   70      1
## 6  15         8          429        198   4341         10.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
## 5               ford torino
## 6          ford galaxie 500
dim(Auto)
## [1] 397   9
Auto[1:4, ]
##   mpg cylinders displacement horsepower weight acceleration year origin
## 1  18         8          307        130   3504         12.0   70      1
## 2  15         8          350        165   3693         11.5   70      1
## 3  18         8          318        150   3436         11.0   70      1
## 4  16         8          304        150   3433         12.0   70      1
##                        name
## 1 chevrolet chevelle malibu
## 2         buick skylark 320
## 3        plymouth satellite
## 4             amc rebel sst
Auto <- na.omit(Auto)
dim(Auto)
## [1] 392   9

We can get a list of all variables in the dataset we imported with the names() function.

names(Auto)
## [1] "mpg"          "cylinders"    "displacement" "horsepower"  
## [5] "weight"       "acceleration" "year"         "origin"      
## [9] "name"

2.3.5 Additional Graphical and Numerical Summaries

There are two ways to reference a variable from a dataset. The first uses the $ symbol to tell R that the variable cylinders is from the Auto dataset.

plot(Auto$cylinders, Auto$mpg)

The second method uses the attach() function which tells R to search for variable names in the Auto dataset.

attach(Auto)
plot(cylinders, mpg)

We can convert quantitative variables to qualitative or categorical variables with the as.factor() function.

cylinders <- as.factor(cylinders)

Catrgotrical variables are automatically plotted as box-plots by R.

plot(cylinders, mpg)

plot(cylinders, mpg, col = "red")

plot(cylinders, mpg, col = "red", varwidth = T)

plot(cylinders, mpg, col = "red", varwidth = T, horizontal = T)

plot(cylinders, mpg, col = "red", varwidth = T, xlab = "cylinders", ylab = " MPG ")

We can also get a hsitogram of the data with the hist() function.

hist(mpg)

hist(mpg, col = 2)

hist(mpg, col = 2, breaks = 15)

To generate a matrix of scatterplots, we can use the pairs()

pairs(Auto)

pairs(~mpg + displacement + horsepower + weight + acceleration, Auto)

The identify() function can be used to identify specific points on a plot. In an interactive session in R-Studio, identify() allows you to click on specific points and labels them on screen for convenient identification.

plot(horsepower, mpg)
identify(horsepower, mpg, name)

## integer(0)

The summary() function can be used to print a summary of the dataset.

summary(Auto)
##       mpg          cylinders      displacement     horsepower   
##  Min.   : 9.00   Min.   :3.000   Min.   : 68.0   Min.   : 46.0  
##  1st Qu.:17.00   1st Qu.:4.000   1st Qu.:105.0   1st Qu.: 75.0  
##  Median :22.75   Median :4.000   Median :151.0   Median : 93.5  
##  Mean   :23.45   Mean   :5.472   Mean   :194.4   Mean   :104.5  
##  3rd Qu.:29.00   3rd Qu.:8.000   3rd Qu.:275.8   3rd Qu.:126.0  
##  Max.   :46.60   Max.   :8.000   Max.   :455.0   Max.   :230.0  
##                                                                 
##      weight      acceleration        year           origin     
##  Min.   :1613   Min.   : 8.00   Min.   :70.00   Min.   :1.000  
##  1st Qu.:2225   1st Qu.:13.78   1st Qu.:73.00   1st Qu.:1.000  
##  Median :2804   Median :15.50   Median :76.00   Median :1.000  
##  Mean   :2978   Mean   :15.54   Mean   :75.98   Mean   :1.577  
##  3rd Qu.:3615   3rd Qu.:17.02   3rd Qu.:79.00   3rd Qu.:2.000  
##  Max.   :5140   Max.   :24.80   Max.   :82.00   Max.   :3.000  
##                                                                
##                  name    
##  amc matador       :  5  
##  ford pinto        :  5  
##  toyota corolla    :  5  
##  amc gremlin       :  4  
##  amc hornet        :  4  
##  chevrolet chevette:  4  
##  (Other)           :365
summary(mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00   17.00   22.75   23.45   29.00   46.60