The following parts are optional and not madatory to pass the course, They include detailed descriptions and exercise about tables, lists and graphics and how to save them on your computer. You can try them out, if you want to learn more about how to work with tables/data.frames in R

## Section 1: Tables

Before we can read a table into ‘R’, we need to know where we are and where the file is.

getwd()

gives you the “working directory” of ‘R’. You can change it with

setwd("newdirectory")

On Windows your directroy usuall start with “C://Users/YOURNAME/”, on Linux and Mac they start with “~/”. If you use Windows, paths are noted by $$\backslash$$. However, that character has another meaning in ‘R’ and hence it has to be replaced manually by $$/$$.

In order to read-in a table in text-format we can use

my_tbl=read.table("diabetes.txt")

A click on my_tbl in the top-right window shows it as a formatted table.

If we check the table, we see that column names got into the first data row! Quite bad!

my_tbl=read.table("diabetes.txt",header=TRUE)

In the first column we actually don’t have data but different names in each row. Hence let’s assign them as row names:

my_tbl=read.table("diabetes.txt",header=TRUE,row.names=1)

Achtung! German decimals!

my_tbl=read.table("diabetes.txt",header=TRUE,row.names=1,
dec=",")

It is easier to do that with the menu File->import dataset. However you should know the underlying commands.

my_tbl$Age ## [1] 64 66 69 68 68 74 70 73 77 62 70 64 65 66 73 71 66 70 69 62 or my_tbl[[2]] ## [1] 64 66 69 68 68 74 70 73 77 62 70 64 65 66 73 71 66 70 69 62 yields the vector with the ages, while my_tbl[2] ## Age ## Person_1 64 ## Person_2 66 ## Person_3 69 ## Person_4 68 ## Person_5 68 ## Person_6 74 ## Person_7 70 ## Person_8 73 ## Person_9 77 ## Person_10 62 ## Person_11 70 ## Person_12 64 ## Person_13 65 ## Person_14 66 ## Person_15 73 ## Person_16 71 ## Person_17 66 ## Person_18 70 ## Person_19 69 ## Person_20 62 yields a new table with the single column ‘Age’ ### Additional ways to access data my_tbl[1,2] ## [1] 64 or my_tbl["Person_1","Age"] ## [1] 64 yields the first element of the second column (same as with a matrix!) my_tbl[1,] ## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2 ## Person_1 0 64 28.71 m 55 39 95 CC AG yields a table with the first row For new tables from columns it would be logical to use my_tbl[,2], however we already know a command, namely my_tbl[2] # without comma! ## Age ## Person_1 64 ## Person_2 66 ## Person_3 69 ## Person_4 68 ## Person_5 68 ## Person_6 74 ## Person_7 70 ## Person_8 73 ## Person_9 77 ## Person_10 62 ## Person_11 70 ## Person_12 64 ## Person_13 65 ## Person_14 66 ## Person_15 73 ## Person_16 71 ## Person_17 66 ## Person_18 70 ## Person_19 69 ## Person_20 62 Several columns at once: my_tbl[c(2,3)] ## Age BMI ## Person_1 64 28.71 ## Person_2 66 29.72 ## Person_3 69 31.51 ## Person_4 68 30.44 ## Person_5 68 28.52 ## Person_6 74 28.73 ## Person_7 70 23.39 ## Person_8 73 31.25 ## Person_9 77 21.61 ## Person_10 62 27.51 ## Person_11 70 29.45 ## Person_12 64 24.45 ## Person_13 65 29.24 ## Person_14 66 23.99 ## Person_15 73 31.24 ## Person_16 71 26.58 ## Person_17 66 28.70 ## Person_18 70 24.28 ## Person_19 69 31.70 ## Person_20 62 23.24 or my_tbl[c("Age","BMI")] ## Age BMI ## Person_1 64 28.71 ## Person_2 66 29.72 ## Person_3 69 31.51 ## Person_4 68 30.44 ## Person_5 68 28.52 ## Person_6 74 28.73 ## Person_7 70 23.39 ## Person_8 73 31.25 ## Person_9 77 21.61 ## Person_10 62 27.51 ## Person_11 70 29.45 ## Person_12 64 24.45 ## Person_13 65 29.24 ## Person_14 66 23.99 ## Person_15 73 31.24 ## Person_16 71 26.58 ## Person_17 66 28.70 ## Person_18 70 24.28 ## Person_19 69 31.70 ## Person_20 62 23.24 yields a table containing only the columns two and three Several rows at once: my_tbl[c(1,4),] # with comma! ## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2 ## Person_1 0 64 28.71 m 55 39 95 CC AG ## Person_4 0 68 30.44 m 55 156 204 CC AG yields a table with the rows 1 and 4 my_tbl[1:10,] ## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2 ## Person_1 0 64 28.71 m 55 39 95 CC AG ## Person_2 0 66 29.72 f 67 161 148 AC AA ## Person_3 1 69 31.51 m 29 171 180 AA AA ## Person_4 0 68 30.44 m 55 156 204 CC AG ## Person_5 1 68 28.52 m 85 159 84 AA AA ## Person_6 1 74 28.73 f 60 179 132 AC AG ## Person_7 0 70 23.39 f 47 115 94 CC AG ## Person_8 0 73 31.25 m 56 99 153 AC AG ## Person_9 0 77 21.61 f 114 156 106 CC AA ## Person_10 1 62 27.51 f 69 179 102 AA AA yields a table with the rows 1 til 10 my_tbl[my_tbl$Age > 70,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA

yields only persons older than 70

How does this work? Try out the inner term alone

my_tbl$Age > 70 ## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE ## [13] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE If we want to test on equality, we need two equal signs my_tbl[my_tbl$Age == 70,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG

You should be able to guess what happens, if there is only one …

Let’s combine conditions

my_tbl[my_tbl$Age > 70 & my_tbl$SNP1 == "CC",]
##          Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_9         0  77 21.61      f 114 156          106   CC   AA

An what does this one mean?

my_tbl[my_tbl$Age >= 70 | my_tbl$SNP1 != "CC",]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_20         1  62 23.24      m  48  82          150   AA   AG
my_tbl$fake=1:20 adds a column with name fake (containing the numbers 1 to 20) What happens, if we do not supply 20 numbers? Try it out! my_tbl$fake=NULL

removes the column (no way to get it back! There is no trash bin!)

How else could one remove the column?

### Sorting tables

my_tbl[20:1,]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_20         1  62 23.24      m  48  82          150   AA   AG
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_17         0  66 28.70      m 115 123           60   CC   AA
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_14         0  66 23.99      m  48  97           80   CC   AA
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_12         0  64 24.45      f  51 164          381   CC   GG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_1          0  64 28.71      m  55  39           95   CC   AG

reverses the order of rows

my_tbl[order(my_tbl$Age),] ## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2 ## Person_10 1 62 27.51 f 69 179 102 AA AA ## Person_20 1 62 23.24 m 48 82 150 AA AG ## Person_1 0 64 28.71 m 55 39 95 CC AG ## Person_12 0 64 24.45 f 51 164 381 CC GG ## Person_13 1 65 29.24 m 50 164 162 AC AG ## Person_2 0 66 29.72 f 67 161 148 AC AA ## Person_14 0 66 23.99 m 48 97 80 CC AA ## Person_17 0 66 28.70 m 115 123 60 CC AA ## Person_4 0 68 30.44 m 55 156 204 CC AG ## Person_5 1 68 28.52 m 85 159 84 AA AA ## Person_3 1 69 31.51 m 29 171 180 AA AA ## Person_19 1 69 31.70 f 76 274 191 AA AG ## Person_7 0 70 23.39 f 47 115 94 CC AG ## Person_11 1 70 29.45 f 54 129 105 AA AG ## Person_18 0 70 24.28 m 54 128 325 AC AG ## Person_16 NA 71 26.58 f 70 153 93 AC AA ## Person_8 0 73 31.25 m 56 99 153 AC AG ## Person_15 1 73 31.24 f 88 141 234 AC AG ## Person_6 1 74 28.73 f 60 179 132 AC AG ## Person_9 0 77 21.61 f 114 156 106 CC AA sorts the entries with increasing age my_tbl[order(my_tbl$Age,decreasing=TRUE),]
##           Affection Age   BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_9          0  77 21.61      f 114 156          106   CC   AA
## Person_6          1  74 28.73      f  60 179          132   AC   AG
## Person_8          0  73 31.25      m  56  99          153   AC   AG
## Person_15         1  73 31.24      f  88 141          234   AC   AG
## Person_16        NA  71 26.58      f  70 153           93   AC   AA
## Person_7          0  70 23.39      f  47 115           94   CC   AG
## Person_11         1  70 29.45      f  54 129          105   AA   AG
## Person_18         0  70 24.28      m  54 128          325   AC   AG
## Person_3          1  69 31.51      m  29 171          180   AA   AA
## Person_19         1  69 31.70      f  76 274          191   AA   AG
## Person_4          0  68 30.44      m  55 156          204   CC   AG
## Person_5          1  68 28.52      m  85 159           84   AA   AA
## Person_2          0  66 29.72      f  67 161          148   AC   AA
## Person_14         0  66 23.99      m  48  97           80   CC   AA
## Person_17         0  66 28.70      m 115 123           60   CC   AA
## Person_13         1  65 29.24      m  50 164          162   AC   AG
## Person_1          0  64 28.71      m  55  39           95   CC   AG
## Person_12         0  64 24.45      f  51 164          381   CC   GG
## Person_10         1  62 27.51      f  69 179          102   AA   AA
## Person_20         1  62 23.24      m  48  82          150   AA   AG

sorts the entries with decreasing age

Attention: The variable my_tbl remains unchanged! In order to work with the sorted table, one has to store it either in the same or (preferably) in another variable.

### Factors

How do ‘factors’ work?

Lets add a new column for the smoking status of a person. We have two categories (1 and 2), and two labels for them (“smoker” and “non-smoker”").

my_tbl$smoking_status=factor(c(1,1,2,1,2,2,1,1,1,1, 2,1,2,2,2,2,1,2,2,1), labels=c("smoker","non-smoker")) Check the table with my_tbl ## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2 ## Person_1 0 64 28.71 m 55 39 95 CC AG ## Person_2 0 66 29.72 f 67 161 148 AC AA ## Person_3 1 69 31.51 m 29 171 180 AA AA ## Person_4 0 68 30.44 m 55 156 204 CC AG ## Person_5 1 68 28.52 m 85 159 84 AA AA ## Person_6 1 74 28.73 f 60 179 132 AC AG ## Person_7 0 70 23.39 f 47 115 94 CC AG ## Person_8 0 73 31.25 m 56 99 153 AC AG ## Person_9 0 77 21.61 f 114 156 106 CC AA ## Person_10 1 62 27.51 f 69 179 102 AA AA ## Person_11 1 70 29.45 f 54 129 105 AA AG ## Person_12 0 64 24.45 f 51 164 381 CC GG ## Person_13 1 65 29.24 m 50 164 162 AC AG ## Person_14 0 66 23.99 m 48 97 80 CC AA ## Person_15 1 73 31.24 f 88 141 234 AC AG ## Person_16 NA 71 26.58 f 70 153 93 AC AA ## Person_17 0 66 28.70 m 115 123 60 CC AA ## Person_18 0 70 24.28 m 54 128 325 AC AG ## Person_19 1 69 31.70 f 76 274 191 AA AG ## Person_20 1 62 23.24 m 48 82 150 AA AG ## smoking_status ## Person_1 smoker ## Person_2 smoker ## Person_3 non-smoker ## Person_4 smoker ## Person_5 non-smoker ## Person_6 non-smoker ## Person_7 smoker ## Person_8 smoker ## Person_9 smoker ## Person_10 smoker ## Person_11 non-smoker ## Person_12 smoker ## Person_13 non-smoker ## Person_14 non-smoker ## Person_15 non-smoker ## Person_16 non-smoker ## Person_17 smoker ## Person_18 non-smoker ## Person_19 non-smoker ## Person_20 smoker ### Saving a table Check the file on your computer after each command to see the differences. The basic command for saving a table is write.table(my_tbl,file="mydiabetes.txt") If we want to have a comma as decimal sign write.table(my_tbl,file="mydiabetes.txt",dec=",") In order to get rid of the quotes, we explicitly must say so write.table(my_tbl,file="mydiabetes.txt",dec=",",quote=FALSE) ## Exercise 1. Add a column for the “artherosclerosis index” defined as $\frac{LDL}{HDL}$ 2. Add another column that gives the asclerotic risk status of a person. For males “risk” is defined as having an asclerotic index higher than 3.5 and for females higher than 3. ## Section 2: Lists A list can contain anything (even other lists). Lets create an example list: my_vector=c(1,2,3) my_matrix=matrix(1:4,nrow = 2) my_list=list(my_vector,my_matrix,my_tbl) str(my_list) ## List of 3 ##$ : num [1:3] 1 2 3
##  $: int [1:2, 1:2] 1 2 3 4 ##$ :'data.frame':    20 obs. of  10 variables:
##   ..$Affection : int [1:20] 0 0 1 0 1 1 0 0 0 1 ... ## ..$ Age           : int [1:20] 64 66 69 68 68 74 70 73 77 62 ...
##   ..$BMI : num [1:20] 28.7 29.7 31.5 30.4 28.5 ... ## ..$ Gender        : chr [1:20] "m" "f" "m" "m" ...
##   ..$HDL : int [1:20] 55 67 29 55 85 60 47 56 114 69 ... ## ..$ LDL           : int [1:20] 39 161 171 156 159 179 115 99 156 179 ...
##   ..$Triglyceride : int [1:20] 95 148 180 204 84 132 94 153 106 102 ... ## ..$ SNP1          : chr [1:20] "CC" "AC" "AA" "CC" ...
##   ..$SNP2 : chr [1:20] "AG" "AA" "AA" "AG" ... ## ..$ smoking_status: Factor w/ 2 levels "smoker","non-smoker": 1 1 2 1 2 2 1 1 1 1 ...

As with tables,

my_list[[2]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

returns the second element of the list (the matrix my_matrix) and

my_list[2]
## [[1]]
##      [,1] [,2]
## [1,]    1    3
## [2,]    2    4

returns a sub-list with the single element my_matrix.

## Exercise

1. What yields my_list[2][1,1] and what my_list[[2]][1,1] ? Can you explain the differenes

2. Load the dataset kitchen using load(“kitchen.RData”). The variable Kitchen is a nested dataset with lists, tables and vectors.

1. Find two different ways to show only the cookies. Hint: str might be a good start to have a look at the dataset.
2. Describe the location" inside the dataset in words (e.g. “first entry, second table of the third list …”).

## Section 3: Descriptive statistics and Basic Graphics

Typical quantities in descriptive statistics:

sum(my_tbl$Age)  ## [1] 1367 mean(my_tbl$Age)     # average
## [1] 68.35
var(my_tbl$Age) # variance ## [1] 16.45 sd(my_tbl$Age)       # standard deviation
## [1] 4.05586
min(my_tbl$Age) # minimum ## [1] 62 max(my_tbl$Age)      # maximum
## [1] 77
median(my_tbl$Age) # median ## [1] 68.5 quantile(my_tbl$Age) # calculates 5 quantiles!
##    0%   25%   50%   75%  100%
## 62.00 65.75 68.50 70.25 77.00
summary(my_tbl$Age) # a summary of multiple statistics ## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 62.00 65.75 68.50 68.35 70.25 77.00 If there are NAs (missing values in the experiment or the table), one has to add the option na.rm=TRUE! ## Basic graphics with tables Three quantiles together in a single plot! boxplot(my_tbl$Age)

The box shows the quartiles $$q_{0.25}$$, $$q_{0.75}$$ and the median $$q_{0.5}$$. The meaning of the “whiskers” is not generally fixed. ‘R’ calculates them by the formulas $\text{upper whisker} = min(max(x), q_{0.75} + 1.5 * (q_{0.75}-q_{0.25}))$ $\text{lower whisker} = max(min(x), q_{0.25} - 1.5 * (q_{0.75}-q_{0.25}))$

hist(my_tbl$Age) The number of rows/persons belonging to each category can be counted by table(my_tbl$Gender)
##
##  f  m
## 10 10

and we can plot this immediately by

barplot(table(my_tbl$Gender)) or the genotypes of SNP1 barplot(table(my_tbl$SNP1))

or the same plot as a pie

pie(table(my_tbl$SNP1)) Two variables yield a scatter diagram plot(my_tbl$Age,my_tbl$BMI) Impressive, but not always senseful: plot(my_tbl) If we now want to save our plot on our computer in a pdf-file, we can use the following structure: pdf("plotname.pdf") # starts the pdf command. You can specifiy a name (or even the direct Path) in quotation-marks barplot(table(my_tbl$Gender)) # everything you want to include comes here
# each line is a separate page
dev.off() # closes the command and saves the pdf