We will show here a typical pipeline for analysis of gene expression data. This tutorial is based on the use of Monocytes and Macrophage data from the following paper:

van de Laar L, Saelens W, De Prijck S, Martens L et al.Ā Yolk Sac Macrophages, Fetal Liver, and Adult Monocytes Can Colonize an Empty Niche and Develop into Functional Tissue-Resident Macrophages. Immunity 2016 Apr 19;44(4):755-68. PMID: 26992565

This tutorial contains the following steps.

  1. Download the data from GEO
  2. Process the expression table for analysis
  3. DE Analysis by limma
  4. PCA (Principal component analysis)
  5. Clustering by Heatmap (only top 250 genes)
  6. GO (Gene Ontology) Enrichment Analysis
  7. Combining other dataset for PCA and clustering

Steps 1 through 3 correspond to code that can be automatically generated by GEO2R from Gene Expression Omnibus (GEO), as shown during the course.

Here you can find the files necessary to complete this tutorial. Inside this zip folder, course_files.zip, you have:

File Description
handout.pdf PDF version of this page, a step-by-step explanation of the tasks.
Imun_gen.ex.Rdump Preprocessed object from GEO for loading in the analysis.
pheno.table A table storing the phenotype data for comparison.

Set the working directory

Before you start to work on this handout, please set your working directory to the folder where you saved the files from the handout and where you will perform this exercise. For example,

setwd(file.path("~", "project/courses/practical_bioinformatics_june2022/handouts"))
# Please use your path instead of the above example

Loading the libraries

Before running the analysis, all the necessary libraries need to be loaded. We assume you have successfully installed all libraries as described here (http://costalab.org/r-installation/)

library(Biobase)
library(GEOquery)
library(gplots)
library(RColorBrewer)
library(gProfileR)
library(sva)
library(limma)

Biobase contains base functions for Bioconductor;
GEOquery is the bridge between GEO and BioConductor;
limma is a library for the analysis of gene expression microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression;
gplots is for making graphical figures;
RColorBrewer can make colorful palettes;
gProfileR is the package to perform gene enrichment analysis;
sva (ComBat) is used here to remove known batch effects from microarray data.

1 Download the data from GEO

First, we need to download the data from GEO by the GEO number.

gset <- getGEO("GSE76999", GSEMatrix =TRUE, AnnotGPL=TRUE)
if (length(gset) > 1) idx <- grep("GPL6246", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

ExpressionSet (gset here) is a Container for high-throughput assays and experimental metadata.

2 Process the expression table for analysis

Here we select only the first 12 samples for downstream analysis and eliminate others. These 12 samples are BM-MOs (bone marrow monocytes), FL-MOs (fetal liver monocytes), and YS-Macs (yolk sac-derived macrophages), with 4 replicates for each.

# make proper column names to match toptable
fvarLabels(gset) <- make.names(fvarLabels(gset))
# keep only first 12 samples
gset <- gset[ ,1:12]
# pData function allows us to check the sample data information in the gset container
# as a sanity check, we examine the sample information to make sure we picked the correct samples
# the columns "title" and "tissue:ch1" give us the relevant information for this
pData(gset)[,c("title","tissue:ch1")]
##                                                                                 title
## GSM2042244 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 1
## GSM2042245 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 2
## GSM2042246 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 3
## GSM2042247 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 4
## GSM2042248          Monocyte extracted from E15.5 Fetal Liver, biological replicate 1
## GSM2042249          Monocyte extracted from E15.5 Fetal Liver, biological replicate 2
## GSM2042250          Monocyte extracted from E15.5 Fetal Liver, biological replicate 3
## GSM2042251          Monocyte extracted from E15.5 Fetal Liver, biological replicate 4
## GSM2042252           Macrophage extracted from E12.5 Yolk Sac, biological replicate 1
## GSM2042253           Macrophage extracted from E12.5 Yolk Sac, biological replicate 2
## GSM2042254           Macrophage extracted from E12.5 Yolk Sac, biological replicate 3
## GSM2042255           Macrophage extracted from E12.5 Yolk Sac, biological replicate 4
##             tissue:ch1
## GSM2042244 Bone Marrow
## GSM2042245 Bone Marrow
## GSM2042246 Bone Marrow
## GSM2042247 Bone Marrow
## GSM2042248 Fetal Liver
## GSM2042249 Fetal Liver
## GSM2042250 Fetal Liver
## GSM2042251 Fetal Liver
## GSM2042252    Yolk Sac
## GSM2042253    Yolk Sac
## GSM2042254    Yolk Sac
## GSM2042255    Yolk Sac
# make label (remember 12 samples, 3 different types, 4 replicates of each)
factor_label <- as.factor(c(rep(0,4), rep(1,4), rep(2,4)))
# line below creates the same label but directly from the sample data information 
# factor_label <- as.factor(as.numeric(as.factor(pData(gset)[,"tissue:ch1"]))-1)

labels <- c("BM-MOs", "FL-MOs", "YS-Macs")

Before we do any further analysis, we need to do quality control of the data. We use boxplot to check whether we need further normalization on the data.

palette(c("#dfeaf4","#f4dfdf", "#f2cb98"))
par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1))
title <- paste ("GSE76999", '/', annotation(gset), " selected samples", sep ='')
boxplot(exprs(gset), boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=factor_label)
legend("topleft", labels, fill=palette(), bty="n")

The boxplot indicates all samples have the same distribution, therefore there is no need for any normalization.

3 Differential Expression Analysis by limma

Next, we want to find genes that are differentially expressed between BM-MOs, FL-MOs and YS-Macs using limma. To this end, we usually perform a log2 transformation of the expression data. This is how gene expression data is often displayed. However, this dataset is already log2 transformed. Then, we create a design matrix defining the groups to compare. Finally, using limma, we fit a model to the data given the design matrix we defined. Here we show you how to get the top 250 DE genes.

# transform into log2 scale if necessary
# exprs(gset) <- log2(exprs(gset))

# set up the data and proceed with analysis
tmp <- paste("G", factor_label, sep="") # set group names
factor_label <- as.factor(tmp)
gset$description <- factor_label
design <- model.matrix(~ description - 1, gset) # "- 1" removes the intercept term
colnames(design) <- levels(factor_label)
fit <- lmFit(gset, design) # fit model to data given design
cont.matrix <- makeContrasts(G2-G0, G1-G0, G2-G1, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2, 0.05)
tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250)
tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","F","Gene.symbol","Gene.title"))
write.table(tT, file="de_genes.csv", row.names=F, sep="\t")

Please open the output table de_genes.csv and try to understand each column.

The above script is (basically) from GEO2R and only gets the top 250 DE genes. In order to get all significant DE genes, we use the script below:

tT <- topTable(fit2, adjust="fdr", sort.by="B", p.value=0.05, number=Inf)

Up to this point most of the code can be automatically generated with GEO2R. For these data specifically, you can do it from https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE76999, by selecting the correct sample groups and clicking the R Script tab.

MA plot

The MA plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. Here we take BM-MOs (G0) v.s. YS-Macs (G2) as an example.

ma <- fit2[,"G2 - G0"]
limma::plotMA(ma)

Volcano plot

A volcano plot is a type of scatter-plot that is used to quickly identify changes in large data sets. It plots significance versus fold-change on the y and x axes, respectively.

volcanoplot(fit2, coef=1, col="gray")

# If you want to save the previous plot in PDF format, you can run the 3 lines below
# pdf("volcanoplot.pdf", width=4, height=4) # "open" the pdf file
# volcanoplot(fit2, coef=1, col="gray") # create the plot in the pdf file
# dev.off() # "close" the pdf file

4 PCA (Principal component analysis)

Principal component analysis (PCA) is a mathematical algorithm that reduces the dimensionality of the data while retaining most of the variation in the data set1. It accomplishes this reduction by identifying directions, called principal components, along which the variation in the data is maximal. By using a few components, each sample can be represented by relatively few numbers instead of by values for thousands of variables. Samples can then be plotted, making it possible to visually assess similarities and differences between samples and determine whether samples can be grouped.

(see Nature Biotechnology 26, 303 - 304 (2008))

palette(c("#dfeaf4","#f4dfdf", "#f2cb98")) # Same color palette as before
# palette(c("Green","Blue","Red")) # use this one alternatively for better visual distinction

ex <- exprs(gset)
pca <- prcomp(t(ex))
plot(pca$x[,1:2], col=factor_label, pch=19)
legend("topright", inset=.05, labels, pch=19, col=1:3, horiz=TRUE)

5 Clustering by Heatmap (only top 250 genes)

Another way to visualize the similarity and the difference of the samples is to perform hierarchical clustering and plot the results as a heatmap.

# define the color palette for the plot
hmcol <- colorRampPalette(rev(brewer.pal(9, "RdBu")))(256)
# perform complete linkage clustering
clust <- function(x) hclust(x, method="complete")
# use the inverse of correlation as distance.
dist <- function(x) as.dist((1-cor(t(x)))/2)

hp <- heatmap.2(
  ex[match(tT$ID, rownames(ex)),],     # data to plot
  scale="row",                         # scale the values for each gene (row)
  density.info="none",                 # turns off density plot inside color legend
  trace="none",                        # turns off trace lines inside the heat map
  col=hmcol,                           # define the color map
  labRow=FALSE,                        # Not showing row labels
  dendrogram="both",                   # Show dendrograms for both axis
  labCol=rep(labels,each=4),           # Sample labels
  ColSideColors=rep(c("#dfeaf4","#f4dfdf","#f2cb98"),each=4), # Show colors for each sample type
  hclust=clust,                        # Define hierarchical clustering method
  distfun=dist,                        # Using correlation coefficient for distance function
  margins = c(8,3)
)

As expected samples with the same type cluster together. We can also easily observe gene clusters that differentiate the different sample types.

Get genes from clustering

If we want to get the clusters from the dendrogram on the left hand side, we can use the script below:

clusters <- cutree(as.hclust(hp$rowDendrogram),k=5)
clustered_genes <- list(
  tT$Gene.symbol[match(names(clusters)[clusters==1],tT$ID)],
  tT$Gene.symbol[match(names(clusters)[clusters==2],tT$ID)],
  tT$Gene.symbol[match(names(clusters)[clusters==3],tT$ID)],
  tT$Gene.symbol[match(names(clusters)[clusters==4],tT$ID)],
  tT$Gene.symbol[match(names(clusters)[clusters==5],tT$ID)]
)
write.table(clustered_genes[[1]],file="cluster1_genes.txt",quote=F,row.names=F,col.names=F)
write.table(clustered_genes[[2]],file="cluster2_genes.txt",quote=F,row.names=F,col.names=F)
write.table(clustered_genes[[3]],file="cluster3_genes.txt",quote=F,row.names=F,col.names=F)
write.table(clustered_genes[[4]],file="cluster4_genes.txt",quote=F,row.names=F,col.names=F)
write.table(clustered_genes[[5]],file="cluster5_genes.txt",quote=F,row.names=F,col.names=F)

6 GO (Gene Ontology) Enrichment Analysis

GO analysis requires a list of genes to find which GO terms are over-represented. Here we take BM-MOs (G0) v.s. YS-Macs (G2) as an example.

go_enrich_up <- gprofiler(as.vector(tT$Gene.symbol[tT$G2...G0 > 0]),organism="mmusculus")
go_enrich_down <- gprofiler(as.vector(tT$Gene.symbol[tT$G2...G0 < 0]),organism="mmusculus")
write.table(go_enrich_up, "GO_BM-MOs_YS-Macs_up_genes.csv")
write.table(go_enrich_down, "GO_BM-MOs_YS-Macs_down_genes.csv")

7 Combining other dataset for PCA and clustering

Finally, as an advanced example of combined data, we have selected Monocyte and Macrophage populations from the ImmGen (Immunological Genome Project) data.

For simplicity, we will load the pre-processed data and a table containing details on the experiments (pheno data).

# get imm gene data from saved object. Make sure this file is in your working directory.
ex_imun_gen <- dget(file.path("Imun_gen.ex.Rdump"))
# get all pheno data from tab separated file. Make sure this file is in your working directory.
phenoData_table <- read.table(file=file.path("pheno.table.csv"), header=T, sep=",", quote="")
rownames(phenoData_table)<-phenoData_table$geo_acession

ex_all <- merge(ex_imun_gen, ex, by="row.names")
rownames(ex_all) <- ex_all$"Row.names"
ex_all$"Row.names" <- NULL
ex_all <- as.matrix(ex_all)

To make these data comparable, we need to remove batch effects and technical variation caused by different labs/experiment runs/etc.. Optimally, we want to remove differences caused by technical artifacts but not biological differences. To this end we will use a tool called ComBat.

# We use the tissue of origin as a covariate.
# This lets the tool know that there is specific biological
# variation that shouldn't be removed.
modcombat <- model.matrix(~Tissue, data=phenoData_table) 

combat_edata <- ComBat(
  dat=ex_all,                  # expression data
  batch=phenoData_table$batch, # batch information
  mod=modcombat,               # model with covariates
  par.prior=TRUE
)

PCA on merged data

Once the data has been combined, we can perform PCA and clustering analysis in both data sets (before and after batch correction) and compare how well we fared.

s_cols<-c(
  colorRampPalette(brewer.pal(9,"Set1"))(length(unique(phenoData_table$title))-3),
  "#dfeaf4", "#f4dfdf", "#f2cb98"
)
names(s_cols)<-unique(phenoData_table$title)

boxplot(
  ex_all,
  notch=TRUE,
  main="Expression before batch correction",
  outline=FALSE,
  xaxt="n",
  col=s_cols[as.character(phenoData_table$title)]
)

pca_im=prcomp(t(ex_all))
plot(pca_im$x[,1:2],pch=19,col=s_cols[as.character(phenoData_table$title)])
text(pca_im$x[,1]+1, pca_im$x[,2]+3, phenoData_table[,2], cex=0.5)
title("PCA before batch correction")

Now we plot exactly the same plots but using the batch corrected data.

boxplot(
  combat_edata,
  notch=TRUE,
  main="Expression after batch correction",
  outline=FALSE,
  xaxt="n",
  col=s_cols[as.character(phenoData_table$title)]
)

pca_im=prcomp(t(combat_edata))
plot(pca_im$x[,1:2],pch=19,col=s_cols[as.character(phenoData_table$title)])
text(pca_im$x[,1]+1, pca_im$x[,2]+3, phenoData_table[,2], cex=0.5)
title("PCA after batch correction")

As you can observe, batch correction greatly diminishes the technical variation in the data.

Clustering of merged data

Here, again, we show how non-batch corrected data compares against batch corrected data but this time in the context of clustering.

sel <- match(tT$ID, rownames(combat_edata))
sel <- sel[!is.na(sel)]


heatmap.2(
  ex_all[sel,], 
  dendrogram="both", 
  col=hmcol, 
  scale="row",
  labRow=FALSE, 
  main="Clustering before batch correction",
  density.info="none", 
  trace="none", 
  ColSideColors=s_cols[as.character(phenoData_table$title)],
  margins=c(9,3),
  labCol=phenoData_table[,2]
)

heatmap.2(
  combat_edata[sel,], 
  dendrogram="both", 
  col=hmcol, 
  scale="row",
  labRow=FALSE, 
  density.info="none", 
  main="Clustering after batch correction",
  trace="none", 
  ColSideColors=s_cols[as.character(phenoData_table$title)],
  margins=c(9,3),
  labCol=phenoData_table[,2]
)

8 Gene Set Enrichment analysis (based on limma)

What should we do if we wish to test a specific gene set, to see whether they are up-regulated or down-regulated compared to other genes in terms of differential expression? Then here is the Gene set enrichment GSEA. We assume that the first 20 genes are a set of genes interested. We use camera from limma to see whether it is up or down.

# assume the set of first 20 genes is the interested gene set
interested_gene_set <- rownames(exprs(gset))[1:20]
camera(exprs(gset), list(set1 = interested_gene_set), design)
##      NGenes Direction    PValue
## set1     20        Up 0.1745419