We will show here a typical pipeline for analysis of gene expression data. This tutorial is based on the use of Monocytes and Macrophage data from the following paper:
van de Laar L, Saelens W, De Prijck S, Martens L et al. Yolk Sac Macrophages, Fetal Liver, and Adult Monocytes Can Colonize an Empty Niche and Develop into Functional Tissue-Resident Macrophages. Immunity 2016 Apr 19;44(4):755-68. PMID: 26992565
This tutorial contains the following steps.
Steps 1 through 3 correspond to code that can be automatically generated by GEO2R from Gene Expression Omnibus (GEO), as shown during the course.
Here you can find the files necessary to complete this tutorial. Inside this zip folder, course_files.zip, you have:
|handout.pdf||PDF version of this page, a step-by-step explanation of the tasks.|
|Imun_gen.ex.Rdump||Preprocessed object from GEO for loading in the analysis.|
|pheno.table||A table storing the phenotype data for comparison.|
Before you start to work on this handout, please set your working directory to the folder where you saved the files from the handout and where you will perform this exercise. For example,
setwd(file.path("~", "project/courses/practical_bioinformatics_june2022/handouts")) # Please use your path instead of the above example
Before running the analysis, all the necessary libraries need to be loaded. We assume you have successfully installed all libraries as described here (http://costalab.org/r-installation/)
library(Biobase) library(GEOquery) library(gplots) library(RColorBrewer) library(gProfileR) library(sva) library(limma)
Biobase contains base functions for Bioconductor;
GEOquery is the bridge between GEO and BioConductor;
limma is a library for the analysis of gene expression microarray data, especially the use of linear models for analyzing designed experiments and the assessment of differential expression;
gplots is for making graphical figures;
RColorBrewer can make colorful palettes;
gProfileR is the package to perform gene enrichment analysis;
sva (ComBat) is used here to remove known batch effects from microarray data.
First, we need to download the data from GEO by the GEO number.
gset <- getGEO("GSE76999", GSEMatrix =TRUE, AnnotGPL=TRUE) if (length(gset) > 1) idx <- grep("GPL6246", attr(gset, "names")) else idx <- 1 gset <- gset[[idx]]
ExpressionSet (gset here) is a Container for high-throughput assays and experimental metadata.
Here we select only the first 12 samples for downstream analysis and eliminate others. These 12 samples are BM-MOs (bone marrow monocytes), FL-MOs (fetal liver monocytes), and YS-Macs (yolk sac-derived macrophages), with 4 replicates for each.
# make proper column names to match toptable fvarLabels(gset) <- make.names(fvarLabels(gset)) # keep only first 12 samples gset <- gset[ ,1:12] # pData function allows us to check the sample data information in the gset container # as a sanity check, we examine the sample information to make sure we picked the correct samples # the columns "title" and "tissue:ch1" give us the relevant information for this pData(gset)[,c("title","tissue:ch1")]
## title ## GSM2042244 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 1 ## GSM2042245 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 2 ## GSM2042246 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 3 ## GSM2042247 Monocyte extracted from adult (wk6-12) Bone Marrow, biological replicate 4 ## GSM2042248 Monocyte extracted from E15.5 Fetal Liver, biological replicate 1 ## GSM2042249 Monocyte extracted from E15.5 Fetal Liver, biological replicate 2 ## GSM2042250 Monocyte extracted from E15.5 Fetal Liver, biological replicate 3 ## GSM2042251 Monocyte extracted from E15.5 Fetal Liver, biological replicate 4 ## GSM2042252 Macrophage extracted from E12.5 Yolk Sac, biological replicate 1 ## GSM2042253 Macrophage extracted from E12.5 Yolk Sac, biological replicate 2 ## GSM2042254 Macrophage extracted from E12.5 Yolk Sac, biological replicate 3 ## GSM2042255 Macrophage extracted from E12.5 Yolk Sac, biological replicate 4 ## tissue:ch1 ## GSM2042244 Bone Marrow ## GSM2042245 Bone Marrow ## GSM2042246 Bone Marrow ## GSM2042247 Bone Marrow ## GSM2042248 Fetal Liver ## GSM2042249 Fetal Liver ## GSM2042250 Fetal Liver ## GSM2042251 Fetal Liver ## GSM2042252 Yolk Sac ## GSM2042253 Yolk Sac ## GSM2042254 Yolk Sac ## GSM2042255 Yolk Sac
# make label (remember 12 samples, 3 different types, 4 replicates of each) factor_label <- as.factor(c(rep(0,4), rep(1,4), rep(2,4))) # line below creates the same label but directly from the sample data information # factor_label <- as.factor(as.numeric(as.factor(pData(gset)[,"tissue:ch1"]))-1) labels <- c("BM-MOs", "FL-MOs", "YS-Macs")
Before we do any further analysis, we need to do quality control of the data. We use boxplot to check whether we need further normalization on the data.
palette(c("#dfeaf4","#f4dfdf", "#f2cb98")) par(mar=c(2+round(max(nchar(sampleNames(gset)))/2),4,2,1)) title <- paste ("GSE76999", '/', annotation(gset), " selected samples", sep ='') boxplot(exprs(gset), boxwex=0.6, notch=T, main=title, outline=FALSE, las=2, col=factor_label) legend("topleft", labels, fill=palette(), bty="n")
The boxplot indicates all samples have the same distribution, therefore there is no need for any normalization.
Next, we want to find genes that are differentially expressed between BM-MOs, FL-MOs and YS-Macs using limma. To this end, we usually perform a log2 transformation of the expression data. This is how gene expression data is often displayed. However, this dataset is already log2 transformed. Then, we create a design matrix defining the groups to compare. Finally, using limma, we fit a model to the data given the design matrix we defined. Here we show you how to get the top 250 DE genes.
# transform into log2 scale if necessary # exprs(gset) <- log2(exprs(gset)) # set up the data and proceed with analysis tmp <- paste("G", factor_label, sep="") # set group names factor_label <- as.factor(tmp) gset$description <- factor_label design <- model.matrix(~ description - 1, gset) # "- 1" removes the intercept term colnames(design) <- levels(factor_label) fit <- lmFit(gset, design) # fit model to data given design cont.matrix <- makeContrasts(G2-G0, G1-G0, G2-G1, levels=design) fit2 <- contrasts.fit(fit, cont.matrix) fit2 <- eBayes(fit2, 0.05) tT <- topTable(fit2, adjust="fdr", sort.by="B", number=250) tT <- subset(tT, select=c("ID","adj.P.Val","P.Value","F","Gene.symbol","Gene.title")) write.table(tT, file="de_genes.csv", row.names=F, sep="\t")
Please open the output table de_genes.csv and try to understand each column.
The above script is (basically) from GEO2R and only gets the top 250 DE genes. In order to get all significant DE genes, we use the script below:
tT <- topTable(fit2, adjust="fdr", sort.by="B", p.value=0.05, number=Inf)
Up to this point most of the code can be automatically generated with GEO2R. For these data specifically, you can do it from https://www.ncbi.nlm.nih.gov/geo/geo2r/?acc=GSE76999, by selecting the correct sample groups and clicking the R Script tab.
The MA plot visualizes the differences between measurements taken in two samples, by transforming the data onto M (log ratio) and A (mean average) scales, then plotting these values. Here we take BM-MOs (G0) v.s. YS-Macs (G2) as an example.
ma <- fit2[,"G2 - G0"] limma::plotMA(ma)