The following parts are optional and not madatory to pass the course, They include detailed descriptions and exercise about tables, lists and graphics and how to save them on your computer. You can try them out, if you want to learn more about how to work with tables/data.frames in R
Before we can read a table into ‘R’, we need to know where we are and where the file is.
getwd()
gives you the “working directory” of ‘R’. You can change it with
setwd("newdirectory")
On Windows your directroy usuall start with “C://Users/YOURNAME/”, on Linux and Mac they start with “~/”. If you use Windows, paths are noted by \(\backslash\). However, that character has another meaning in ‘R’ and hence it has to be replaced manually by \(/\).
In order to read-in a table in text-format we can use
my_tbl=read.table("diabetes.txt")
If you can’t read the table, check where the diabetes.txt is located on your computer and manually adjust the path. See the help for read.table for additional advice.
A click on my_tbl
in the top-right window shows it as a formatted table.
If we check the table, we see that column names got into the first data row! Quite bad!
my_tbl=read.table("diabetes.txt",header=TRUE)
In the first column we actually don’t have data but different names in each row. Hence let’s assign them as row names:
my_tbl=read.table("diabetes.txt",header=TRUE,row.names=1)
Achtung! German decimals!
my_tbl=read.table("diabetes.txt",header=TRUE,row.names=1,
dec=",")
It is easier to do that with the menu File->import dataset
. However you should know the underlying commands.
We know already how to address columns
my_tbl$Age
## [1] 64 66 69 68 68 74 70 73 77 62 70 64 65 66 73 71 66 70 69 62
or
my_tbl[[2]]
## [1] 64 66 69 68 68 74 70 73 77 62 70 64 65 66 73 71 66 70 69 62
yields the vector with the ages, while
my_tbl[2]
## Age
## Person_1 64
## Person_2 66
## Person_3 69
## Person_4 68
## Person_5 68
## Person_6 74
## Person_7 70
## Person_8 73
## Person_9 77
## Person_10 62
## Person_11 70
## Person_12 64
## Person_13 65
## Person_14 66
## Person_15 73
## Person_16 71
## Person_17 66
## Person_18 70
## Person_19 69
## Person_20 62
yields a new table with the single column ‘Age’
my_tbl[1,2]
## [1] 64
or
my_tbl["Person_1","Age"]
## [1] 64
yields the first element of the second column (same as with a matrix!)
my_tbl[1,]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1 0 64 28.71 m 55 39 95 CC AG
yields a table with the first row
For new tables from columns it would be logical to use my_tbl[,2]
, however we already know a command, namely
my_tbl[2] # without comma!
## Age
## Person_1 64
## Person_2 66
## Person_3 69
## Person_4 68
## Person_5 68
## Person_6 74
## Person_7 70
## Person_8 73
## Person_9 77
## Person_10 62
## Person_11 70
## Person_12 64
## Person_13 65
## Person_14 66
## Person_15 73
## Person_16 71
## Person_17 66
## Person_18 70
## Person_19 69
## Person_20 62
Several columns at once:
my_tbl[c(2,3)]
## Age BMI
## Person_1 64 28.71
## Person_2 66 29.72
## Person_3 69 31.51
## Person_4 68 30.44
## Person_5 68 28.52
## Person_6 74 28.73
## Person_7 70 23.39
## Person_8 73 31.25
## Person_9 77 21.61
## Person_10 62 27.51
## Person_11 70 29.45
## Person_12 64 24.45
## Person_13 65 29.24
## Person_14 66 23.99
## Person_15 73 31.24
## Person_16 71 26.58
## Person_17 66 28.70
## Person_18 70 24.28
## Person_19 69 31.70
## Person_20 62 23.24
or
my_tbl[c("Age","BMI")]
## Age BMI
## Person_1 64 28.71
## Person_2 66 29.72
## Person_3 69 31.51
## Person_4 68 30.44
## Person_5 68 28.52
## Person_6 74 28.73
## Person_7 70 23.39
## Person_8 73 31.25
## Person_9 77 21.61
## Person_10 62 27.51
## Person_11 70 29.45
## Person_12 64 24.45
## Person_13 65 29.24
## Person_14 66 23.99
## Person_15 73 31.24
## Person_16 71 26.58
## Person_17 66 28.70
## Person_18 70 24.28
## Person_19 69 31.70
## Person_20 62 23.24
yields a table containing only the columns two and three
Several rows at once:
my_tbl[c(1,4),] # with comma!
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1 0 64 28.71 m 55 39 95 CC AG
## Person_4 0 68 30.44 m 55 156 204 CC AG
yields a table with the rows 1 and 4
my_tbl[1:10,]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1 0 64 28.71 m 55 39 95 CC AG
## Person_2 0 66 29.72 f 67 161 148 AC AA
## Person_3 1 69 31.51 m 29 171 180 AA AA
## Person_4 0 68 30.44 m 55 156 204 CC AG
## Person_5 1 68 28.52 m 85 159 84 AA AA
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_9 0 77 21.61 f 114 156 106 CC AA
## Person_10 1 62 27.51 f 69 179 102 AA AA
yields a table with the rows 1 til 10
my_tbl[my_tbl$Age > 70,]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_9 0 77 21.61 f 114 156 106 CC AA
## Person_15 1 73 31.24 f 88 141 234 AC AG
## Person_16 NA 71 26.58 f 70 153 93 AC AA
yields only persons older than 70
How does this work? Try out the inner term alone
my_tbl$Age > 70
## [1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE
## [13] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE
If we want to test on equality, we need two equal signs
my_tbl[my_tbl$Age == 70,]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_11 1 70 29.45 f 54 129 105 AA AG
## Person_18 0 70 24.28 m 54 128 325 AC AG
You should be able to guess what happens, if there is only one …
Let’s combine conditions
my_tbl[my_tbl$Age > 70 & my_tbl$SNP1 == "CC",]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_9 0 77 21.61 f 114 156 106 CC AA
An what does this one mean?
my_tbl[my_tbl$Age >= 70 | my_tbl$SNP1 != "CC",]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_2 0 66 29.72 f 67 161 148 AC AA
## Person_3 1 69 31.51 m 29 171 180 AA AA
## Person_5 1 68 28.52 m 85 159 84 AA AA
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_9 0 77 21.61 f 114 156 106 CC AA
## Person_10 1 62 27.51 f 69 179 102 AA AA
## Person_11 1 70 29.45 f 54 129 105 AA AG
## Person_13 1 65 29.24 m 50 164 162 AC AG
## Person_15 1 73 31.24 f 88 141 234 AC AG
## Person_16 NA 71 26.58 f 70 153 93 AC AA
## Person_18 0 70 24.28 m 54 128 325 AC AG
## Person_19 1 69 31.70 f 76 274 191 AA AG
## Person_20 1 62 23.24 m 48 82 150 AA AG
my_tbl$fake=1:20
adds a column with name fake
(containing the numbers 1 to 20) What happens, if we do not supply 20 numbers? Try it out!
my_tbl$fake=NULL
removes the column (no way to get it back! There is no trash bin!)
How else could one remove the column?
my_tbl[20:1,]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_20 1 62 23.24 m 48 82 150 AA AG
## Person_19 1 69 31.70 f 76 274 191 AA AG
## Person_18 0 70 24.28 m 54 128 325 AC AG
## Person_17 0 66 28.70 m 115 123 60 CC AA
## Person_16 NA 71 26.58 f 70 153 93 AC AA
## Person_15 1 73 31.24 f 88 141 234 AC AG
## Person_14 0 66 23.99 m 48 97 80 CC AA
## Person_13 1 65 29.24 m 50 164 162 AC AG
## Person_12 0 64 24.45 f 51 164 381 CC GG
## Person_11 1 70 29.45 f 54 129 105 AA AG
## Person_10 1 62 27.51 f 69 179 102 AA AA
## Person_9 0 77 21.61 f 114 156 106 CC AA
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_5 1 68 28.52 m 85 159 84 AA AA
## Person_4 0 68 30.44 m 55 156 204 CC AG
## Person_3 1 69 31.51 m 29 171 180 AA AA
## Person_2 0 66 29.72 f 67 161 148 AC AA
## Person_1 0 64 28.71 m 55 39 95 CC AG
reverses the order of rows
my_tbl[order(my_tbl$Age),]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_10 1 62 27.51 f 69 179 102 AA AA
## Person_20 1 62 23.24 m 48 82 150 AA AG
## Person_1 0 64 28.71 m 55 39 95 CC AG
## Person_12 0 64 24.45 f 51 164 381 CC GG
## Person_13 1 65 29.24 m 50 164 162 AC AG
## Person_2 0 66 29.72 f 67 161 148 AC AA
## Person_14 0 66 23.99 m 48 97 80 CC AA
## Person_17 0 66 28.70 m 115 123 60 CC AA
## Person_4 0 68 30.44 m 55 156 204 CC AG
## Person_5 1 68 28.52 m 85 159 84 AA AA
## Person_3 1 69 31.51 m 29 171 180 AA AA
## Person_19 1 69 31.70 f 76 274 191 AA AG
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_11 1 70 29.45 f 54 129 105 AA AG
## Person_18 0 70 24.28 m 54 128 325 AC AG
## Person_16 NA 71 26.58 f 70 153 93 AC AA
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_15 1 73 31.24 f 88 141 234 AC AG
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_9 0 77 21.61 f 114 156 106 CC AA
sorts the entries with increasing age
my_tbl[order(my_tbl$Age,decreasing=TRUE),]
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_9 0 77 21.61 f 114 156 106 CC AA
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_15 1 73 31.24 f 88 141 234 AC AG
## Person_16 NA 71 26.58 f 70 153 93 AC AA
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_11 1 70 29.45 f 54 129 105 AA AG
## Person_18 0 70 24.28 m 54 128 325 AC AG
## Person_3 1 69 31.51 m 29 171 180 AA AA
## Person_19 1 69 31.70 f 76 274 191 AA AG
## Person_4 0 68 30.44 m 55 156 204 CC AG
## Person_5 1 68 28.52 m 85 159 84 AA AA
## Person_2 0 66 29.72 f 67 161 148 AC AA
## Person_14 0 66 23.99 m 48 97 80 CC AA
## Person_17 0 66 28.70 m 115 123 60 CC AA
## Person_13 1 65 29.24 m 50 164 162 AC AG
## Person_1 0 64 28.71 m 55 39 95 CC AG
## Person_12 0 64 24.45 f 51 164 381 CC GG
## Person_10 1 62 27.51 f 69 179 102 AA AA
## Person_20 1 62 23.24 m 48 82 150 AA AG
sorts the entries with decreasing age
Attention: The variable my_tbl
remains unchanged! In order to work with the sorted table, one has to store it either in the same or (preferably) in another variable.
How do ‘factors’ work?
Lets add a new column for the smoking status of a person. We have two categories (1 and 2), and two labels for them (“smoker” and “non-smoker”").
my_tbl$smoking_status=factor(c(1,1,2,1,2,2,1,1,1,1,
2,1,2,2,2,2,1,2,2,1),
labels=c("smoker","non-smoker"))
Check the table with
my_tbl
## Affection Age BMI Gender HDL LDL Triglyceride SNP1 SNP2
## Person_1 0 64 28.71 m 55 39 95 CC AG
## Person_2 0 66 29.72 f 67 161 148 AC AA
## Person_3 1 69 31.51 m 29 171 180 AA AA
## Person_4 0 68 30.44 m 55 156 204 CC AG
## Person_5 1 68 28.52 m 85 159 84 AA AA
## Person_6 1 74 28.73 f 60 179 132 AC AG
## Person_7 0 70 23.39 f 47 115 94 CC AG
## Person_8 0 73 31.25 m 56 99 153 AC AG
## Person_9 0 77 21.61 f 114 156 106 CC AA
## Person_10 1 62 27.51 f 69 179 102 AA AA
## Person_11 1 70 29.45 f 54 129 105 AA AG
## Person_12 0 64 24.45 f 51 164 381 CC GG
## Person_13 1 65 29.24 m 50 164 162 AC AG
## Person_14 0 66 23.99 m 48 97 80 CC AA
## Person_15 1 73 31.24 f 88 141 234 AC AG
## Person_16 NA 71 26.58 f 70 153 93 AC AA
## Person_17 0 66 28.70 m 115 123 60 CC AA
## Person_18 0 70 24.28 m 54 128 325 AC AG
## Person_19 1 69 31.70 f 76 274 191 AA AG
## Person_20 1 62 23.24 m 48 82 150 AA AG
## smoking_status
## Person_1 smoker
## Person_2 smoker
## Person_3 non-smoker
## Person_4 smoker
## Person_5 non-smoker
## Person_6 non-smoker
## Person_7 smoker
## Person_8 smoker
## Person_9 smoker
## Person_10 smoker
## Person_11 non-smoker
## Person_12 smoker
## Person_13 non-smoker
## Person_14 non-smoker
## Person_15 non-smoker
## Person_16 non-smoker
## Person_17 smoker
## Person_18 non-smoker
## Person_19 non-smoker
## Person_20 smoker
Check the file on your computer after each command to see the differences.
The basic command for saving a table is
write.table(my_tbl,file="mydiabetes.txt")
If we want to have a comma as decimal sign
write.table(my_tbl,file="mydiabetes.txt",dec=",")
In order to get rid of the quotes, we explicitly must say so
write.table(my_tbl,file="mydiabetes.txt",dec=",",quote=FALSE)
Add a column for the “artherosclerosis index” defined as \[\frac{LDL}{HDL}\]
Add another column that gives the asclerotic risk status of a person. For males “risk” is defined as having an asclerotic index higher than 3.5 and for females higher than 3.
A list can contain anything (even other lists). Lets create an example list:
my_vector=c(1,2,3)
my_matrix=matrix(1:4,nrow = 2)
my_list=list(my_vector,my_matrix,my_tbl)
str(my_list)
## List of 3
## $ : num [1:3] 1 2 3
## $ : int [1:2, 1:2] 1 2 3 4
## $ :'data.frame': 20 obs. of 10 variables:
## ..$ Affection : int [1:20] 0 0 1 0 1 1 0 0 0 1 ...
## ..$ Age : int [1:20] 64 66 69 68 68 74 70 73 77 62 ...
## ..$ BMI : num [1:20] 28.7 29.7 31.5 30.4 28.5 ...
## ..$ Gender : chr [1:20] "m" "f" "m" "m" ...
## ..$ HDL : int [1:20] 55 67 29 55 85 60 47 56 114 69 ...
## ..$ LDL : int [1:20] 39 161 171 156 159 179 115 99 156 179 ...
## ..$ Triglyceride : int [1:20] 95 148 180 204 84 132 94 153 106 102 ...
## ..$ SNP1 : chr [1:20] "CC" "AC" "AA" "CC" ...
## ..$ SNP2 : chr [1:20] "AG" "AA" "AA" "AG" ...
## ..$ smoking_status: Factor w/ 2 levels "smoker","non-smoker": 1 1 2 1 2 2 1 1 1 1 ...
As with tables,
my_list[[2]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
returns the second element of the list (the matrix my_matrix
) and
my_list[2]
## [[1]]
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
returns a sub-list with the single element my_matrix
.
What yields my_list[2][1,1]
and what my_list[[2]][1,1]
? Can you explain the differenes
Load the dataset kitchen using load(“kitchen.RData”). The variable Kitchen is a nested dataset with lists, tables and vectors.
Typical quantities in descriptive statistics:
sum(my_tbl$Age)
## [1] 1367
mean(my_tbl$Age) # average
## [1] 68.35
var(my_tbl$Age) # variance
## [1] 16.45
sd(my_tbl$Age) # standard deviation
## [1] 4.05586
min(my_tbl$Age) # minimum
## [1] 62
max(my_tbl$Age) # maximum
## [1] 77
median(my_tbl$Age) # median
## [1] 68.5
quantile(my_tbl$Age) # calculates 5 quantiles!
## 0% 25% 50% 75% 100%
## 62.00 65.75 68.50 70.25 77.00
summary(my_tbl$Age) # a summary of multiple statistics
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 62.00 65.75 68.50 68.35 70.25 77.00
If there are NA
s (missing values in the experiment or the table), one has to add the option na.rm=TRUE
!
Three quantiles together in a single plot!
boxplot(my_tbl$Age)
The box shows the quartiles \(q_{0.25}\), \(q_{0.75}\) and the median \(q_{0.5}\). The meaning of the “whiskers” is not generally fixed. ‘R’ calculates them by the formulas \[\text{upper whisker} = min(max(x), q_{0.75} + 1.5 * (q_{0.75}-q_{0.25}))\] \[\text{lower whisker} = max(min(x), q_{0.25} - 1.5 * (q_{0.75}-q_{0.25}))\]
What about a histogram?
hist(my_tbl$Age)
The number of rows/persons belonging to each category can be counted by
table(my_tbl$Gender)
##
## f m
## 10 10
and we can plot this immediately by
barplot(table(my_tbl$Gender))
or the genotypes of SNP1
barplot(table(my_tbl$SNP1))
or the same plot as a pie
pie(table(my_tbl$SNP1))
Two variables yield a scatter diagram
plot(my_tbl$Age,my_tbl$BMI)
Impressive, but not always senseful:
plot(my_tbl)
If we now want to save our plot on our computer in a pdf-file, we can use the following structure:
pdf("plotname.pdf") # starts the pdf command. You can specifiy a name (or even the direct Path) in quotation-marks
barplot(table(my_tbl$Gender)) # everything you want to include comes here
# each line is a separate page
# comments are not included
dev.off() # closes the command and saves the pdf
Make a barplot of the genotypes of SNP2 and save them into a pdf called “MySNP2_plot.pdf”
Now change the output path of the pdf to another directory than your working directoy and save again the pdf. Hint: In case you forgot, you can see your current working directory by using getwd()