The following parts are optional and not madatory to pass the course, They include detailed descriptions and exercise about tables, lists and graphics and how to save them on your computer. You can try them out, if you want to learn more about how to work with tables/data.frames in R

Section 1: Tables

Before we can read a table into ‘R’, we need to know where we are and where the file is.

getwd()

gives you the “working directory” of ‘R’. You can change it with

setwd("newdirectory")

On Windows your directroy usuall start with “C://Users/YOURNAME/”, on Linux and Mac they start with “~/”. If you use Windows, paths are noted by \(\backslash\). However, that character has another meaning in ‘R’ and hence it has to be replaced manually by \(/\).

In order to read-in a table in text-format we can use

my_tbl=read.table("diabetes.txt")

If you can’t read the table, check where the diabetes.txt is located on your computer and manually adjust the path. See the help for read.table for additional advice.

A click on my_tbl in the top-right window shows it as a formatted table.

If we check the table, we see that column names got into the first data row! Quite bad!

my_tbl=read.table("diabetes.txt",header=TRUE)

In the first column we actually don’t have data but different names in each row. Hence let’s assign them as row names:

my_tbl=read.table("diabetes.txt",header=TRUE,row.names=1)

Achtung! German decimals!

my_tbl=read.table("diabetes.txt",header=TRUE,row.names=1,
             dec=",")

It is easier to do that with the menu File->import dataset. However you should know the underlying commands.

We know already how to address columns

my_tbl$Age
##  [1] 64 66 69 68 68 74 70 73 77 62 70 64 65 66 73 71 66 70 69 62

or

my_tbl[[2]]
##  [1] 64 66 69 68 68 74 70 73 77 62 70 64 65 66 73 71 66 70 69 62

yields the vector with the ages, while

my_tbl[2]
##           Age
## Person_1   64
## Person_2   66
## Person_3   69
## Person_4   68
## Person_5   68
## Person_6   74
## Person_7   70
## Person_8   73
## Person_9   77
## Person_10  62
## Person_11  70
## Person_12  64
## Person_13  65
## Person_14  66
## Person_15  73
## Person_16  71
## Person_17  66
## Person_18  70
## Person_19  69
## Person_20  62

yields a new table with the single column ‘Age’

Additional ways to access data

my_tbl[1,2]
## [1] 64

or

my_tbl["Person_1","Age"]
## [1] 64

yields the first element of the second column (same as with a matrix!)

my_tbl[1,]
##          Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1         0  64 28.71      m  55  39           95   CC   AG

yields a table with the first row

For new tables from columns it would be logical to use my_tbl[,2], however we already know a command, namely

my_tbl[2]  # without comma!
##           Age
## Person_1   64
## Person_2   66
## Person_3   69
## Person_4   68
## Person_5   68
## Person_6   74
## Person_7   70
## Person_8   73
## Person_9   77
## Person_10  62
## Person_11  70
## Person_12  64
## Person_13  65
## Person_14  66
## Person_15  73
## Person_16  71
## Person_17  66
## Person_18  70
## Person_19  69
## Person_20  62

Several columns at once:

my_tbl[c(2,3)]
##           Age   BMI
## Person_1   64 28.71
## Person_2   66 29.72
## Person_3   69 31.51
## Person_4   68 30.44
## Person_5   68 28.52
## Person_6   74 28.73
## Person_7   70 23.39
## Person_8   73 31.25
## Person_9   77 21.61
## Person_10  62 27.51
## Person_11  70 29.45
## Person_12  64 24.45
## Person_13  65 29.24
## Person_14  66 23.99
## Person_15  73 31.24
## Person_16  71 26.58
## Person_17  66 28.70
## Person_18  70 24.28
## Person_19  69 31.70
## Person_20  62 23.24

or

my_tbl[c("Age","BMI")]
##           Age   BMI
## Person_1   64 28.71
## Person_2   66 29.72
## Person_3   69 31.51
## Person_4   68 30.44
## Person_5   68 28.52
## Person_6   74 28.73
## Person_7   70 23.39
## Person_8   73 31.25
## Person_9   77 21.61
## Person_10  62 27.51
## Person_11  70 29.45
## Person_12  64 24.45
## Person_13  65 29.24
## Person_14  66 23.99
## Person_15  73 31.24
## Person_16  71 26.58
## Person_17  66 28.70
## Person_18  70 24.28
## Person_19  69 31.70
## Person_20  62 23.24

yields a table containing only the columns two and three

Several rows at once:

my_tbl[c(1,4),]  # with comma!
##          Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1         0  64 28.71      m  55  39           95   CC   AG
## Person_4         0  68 30.44      m  55 156          204   CC   AG

yields a table with the rows 1 and 4

my_tbl[1:10,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1          0  64 28.71      m  55  39           95   CC   AG
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_10         1  62 27.51      f  69 179          102   AA   AA

yields a table with the rows 1 til 10

my_tbl[my_tbl$Age > 70,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA

yields only persons older than 70

How does this work? Try out the inner term alone

my_tbl$Age > 70
##  [1] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [13] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE

If we want to test on equality, we need two equal signs

my_tbl[my_tbl$Age == 70,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG

You should be able to guess what happens, if there is only one …

Let’s combine conditions

my_tbl[my_tbl$Age > 70 & my_tbl$SNP1 == "CC",]
##          Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_9         0  77 21.61      f 114 156          106   CC   AA

An what does this one mean?

my_tbl[my_tbl$Age >= 70 | my_tbl$SNP1 != "CC",]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_20         1  62 23.24      m  48  82          150   AA   AG
my_tbl$fake=1:20

adds a column with name fake (containing the numbers 1 to 20) What happens, if we do not supply 20 numbers? Try it out!

my_tbl$fake=NULL

removes the column (no way to get it back! There is no trash bin!)

How else could one remove the column?

Sorting tables

my_tbl[20:1,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_20         1  62 23.24      m  48  82          150   AA   AG
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_17         0  66 28.70      m 115 123           60   CC   AA
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_14         0  66 23.99      m  48  97           80   CC   AA
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_12         0  64 24.45      f  51 164          381   CC   GG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_1          0  64 28.71      m  55  39           95   CC   AG

reverses the order of rows

my_tbl[order(my_tbl$Age),]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_20         1  62 23.24      m  48  82          150   AA   AG
## Person_1          0  64 28.71      m  55  39           95   CC   AG
## Person_12         0  64 24.45      f  51 164          381   CC   GG
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_14         0  66 23.99      m  48  97           80   CC   AA
## Person_17         0  66 28.70      m 115 123           60   CC   AA
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA

sorts the entries with increasing age

my_tbl[order(my_tbl$Age,decreasing=TRUE),]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_14         0  66 23.99      m  48  97           80   CC   AA
## Person_17         0  66 28.70      m 115 123           60   CC   AA
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_1          0  64 28.71      m  55  39           95   CC   AG
## Person_12         0  64 24.45      f  51 164          381   CC   GG
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_20         1  62 23.24      m  48  82          150   AA   AG

sorts the entries with decreasing age

Attention: The variable my_tbl remains unchanged! In order to work with the sorted table, one has to store it either in the same or (preferably) in another variable.

Factors

How do ‘factors’ work?

Lets add a new column for the smoking status of a person. We have two categories (1 and 2), and two labels for them (“smoker” and “non-smoker”").

my_tbl$smoking_status=factor(c(1,1,2,1,2,2,1,1,1,1,
                               2,1,2,2,2,2,1,2,2,1),
                             labels=c("smoker","non-smoker"))

Check the table with

my_tbl
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1          0  64 28.71      m  55  39           95   CC   AG
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_12         0  64 24.45      f  51 164          381   CC   GG
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_14         0  66 23.99      m  48  97           80   CC   AA
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_17         0  66 28.70      m 115 123           60   CC   AA
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_20         1  62 23.24      m  48  82          150   AA   AG
##           smoking_status
## Person_1          smoker
## Person_2          smoker
## Person_3      non-smoker
## Person_4          smoker
## Person_5      non-smoker
## Person_6      non-smoker
## Person_7          smoker
## Person_8          smoker
## Person_9          smoker
## Person_10         smoker
## Person_11     non-smoker
## Person_12         smoker
## Person_13     non-smoker
## Person_14     non-smoker
## Person_15     non-smoker
## Person_16     non-smoker
## Person_17         smoker
## Person_18     non-smoker
## Person_19     non-smoker
## Person_20         smoker

Saving a table

Check the file on your computer after each command to see the differences.

The basic command for saving a table is

write.table(my_tbl,file="mydiabetes.txt")

If we want to have a comma as decimal sign

write.table(my_tbl,file="mydiabetes.txt",dec=",")

In order to get rid of the quotes, we explicitly must say so

write.table(my_tbl,file="mydiabetes.txt",dec=",",quote=FALSE)

Exercise

  1. Add a column for the “artherosclerosis index” defined as \[\frac{LDL}{HDL}\]

  2. Add another column that gives the asclerotic risk status of a person. For males “risk” is defined as having an asclerotic index higher than 3.5 and for females higher than 3.

Section 2: Lists

A list can contain anything (even other lists). Lets create an example list:

my_vector=c(1,2,3)
my_matrix=matrix(1:4,nrow = 2)
my_list=list(my_vector,my_matrix,my_tbl)
str(my_list)
## List of 3
##  $ : num [1:3] 1 2 3
##  $ : int [1:2, 1:2] 1 2 3 4
##  $ :'data.frame':    20 obs. of  10 variables:
##   ..$ Affection     : int [1:20] 0 0 1 0 1 1 0 0 0 1 ...
##   ..$ Age           : int [1:20] 64 66 69 68 68 74 70 73 77 62 ...
##   ..$ BMI           : num [1:20] 28.7 29.7 31.5 30.4 28.5 ...
##   ..$ Gender        : chr [1:20] "m" "f" "m" "m" ...
##   ..$ HDL           : int [1:20] 55 67 29 55 85 60 47 56 114 69 ...
##   ..$ LDL           : int [1:20] 39 161 171 156 159 179 115 99 156 179 ...
##   ..$ Triglyceride  : int [1:20] 95 148 180 204 84 132 94 153 106 102 ...
##   ..$ SNP1          : chr [1:20] "CC" "AC" "AA" "CC" ...
##   ..$ SNP2          : chr [1:20] "AG" "AA" "AA" "AG" ...
##   ..$ smoking_status: Factor w/ 2 levels "smoker","non-smoker": 1 1 2 1 2 2 1 1 1 1 ...

As with tables,

my_list[[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

returns the second element of the list (the matrix my_matrix) and

my_list[2]
## [[1]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

returns a sub-list with the single element my_matrix.

Exercise

  1. What yields my_list[2][1,1] and what my_list[[2]][1,1] ? Can you explain the differenes

  2. Load the dataset kitchen using load(“kitchen.RData”). The variable Kitchen is a nested dataset with lists, tables and vectors.

  1. Find two different ways to show only the cookies. Hint: str might be a good start to have a look at the dataset.
  2. Describe the location" inside the dataset in words (e.g. “first entry, second table of the third list …”).

Section 3: Descriptive statistics and Basic Graphics

Typical quantities in descriptive statistics:

sum(my_tbl$Age) 
## [1] 1367
mean(my_tbl$Age)     # average
## [1] 68.35
var(my_tbl$Age)      # variance
## [1] 16.45
sd(my_tbl$Age)       # standard deviation
## [1] 4.05586
min(my_tbl$Age)      # minimum
## [1] 62
max(my_tbl$Age)      # maximum
## [1] 77
median(my_tbl$Age)   # median
## [1] 68.5
quantile(my_tbl$Age) # calculates 5 quantiles!
##    0%   25%   50%   75%  100% 
## 62.00 65.75 68.50 70.25 77.00
summary(my_tbl$Age)  # a summary of multiple statistics
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   62.00   65.75   68.50   68.35   70.25   77.00

If there are NAs (missing values in the experiment or the table), one has to add the option na.rm=TRUE!

Basic graphics with tables

Three quantiles together in a single plot!

boxplot(my_tbl$Age)

The box shows the quartiles \(q_{0.25}\), \(q_{0.75}\) and the median \(q_{0.5}\). The meaning of the “whiskers” is not generally fixed. ‘R’ calculates them by the formulas \[\text{upper whisker} = min(max(x), q_{0.75} + 1.5 * (q_{0.75}-q_{0.25}))\] \[\text{lower whisker} = max(min(x), q_{0.25} - 1.5 * (q_{0.75}-q_{0.25}))\]

What about a histogram?

hist(my_tbl$Age)

The number of rows/persons belonging to each category can be counted by

table(my_tbl$Gender)
## 
##  f  m 
## 10 10

and we can plot this immediately by

barplot(table(my_tbl$Gender))

or the genotypes of SNP1

barplot(table(my_tbl$SNP1))

or the same plot as a pie

pie(table(my_tbl$SNP1))

Two variables yield a scatter diagram

plot(my_tbl$Age,my_tbl$BMI)

Impressive, but not always senseful:

plot(my_tbl)

If we now want to save our plot on our computer in a pdf-file, we can use the following structure:

pdf("plotname.pdf") # starts the pdf command. You can specifiy a name (or even the direct Path) in quotation-marks 
barplot(table(my_tbl$Gender)) # everything you want to include comes here
# each line is a separate page
# comments are not included
dev.off() # closes the command and saves the pdf

Exercise

  1. Make a barplot of the genotypes of SNP2 and save them into a pdf called “MySNP2_plot.pdf”

  2. Now change the output path of the pdf to another directory than your working directoy and save again the pdf. Hint: In case you forgot, you can see your current working directory by using getwd()