1 Intro

This exercise will show how to obtain clinical and genomic data from the Cancer Genome Atlas (TGCA) and to perform classical analysis important for clinical data.

These include:

  1. Download the data (clinical and expression) from TGCA
  2. Processing of the data (normalization) and saving it locally using simple table formats.
  3. Unsupervised analysis includes differential expression, PCA and clustering.
  4. Build a machine learning model (classifier) to predict cancer.
  5. Perform survival analysis of molecular markers detected in previous analysis.

First, we start by loading all libraries necessary for this exercise. Please check their documentation if you want to know more.

# Load packages
library("TCGAbiolinks")
library("limma")
library("edgeR")
library("glmnet")
library("factoextra")
library("FactoMineR")
library("caret")
library("SummarizedExperiment")
library("gplots")
library("survival")
library("survminer")
library("RColorBrewer")
library("clusterProfiler")
library("genefilter")

2 TCGA data

In this tutorial, we will focus on Liver Hepatocellular Carcinoma, which is identified in TCGA as LIHC. For LIHC, TCGA provides data for 377 patients including: clinical, expression, DNA methylation and genotyping data. In this tutorial, we will work with clinical and expression data (RNA-seq). Go to https://portal.gdc.cancer.gov/ and search for TCGA-LIHC if you want to understand the data deposited in TCGA. You can also try to find your way through the previous data to look for other data sets of your interest.

We will make use of the TCGAbiolinks library, which allows us to query, prepare and download data from the TCGA portal. TCGAbiolinks provides important functionality as matching data of same the donors across distinct data types (clinical vs expression) and provides data structures to make its analysis in R easy.

To download TCGA data with TCGAbiolinks, you need to follow 3 steps. First, you will query the TCGA database through R with the function GDCquery. This will allow you to investigate the data available at the TCGA database. Next, we use GDCdownload to download raw version of desired files into your computer. Finally GDCprepare will read these files and make R data structures so that we can further analyse them.

Before we get there however we need to know what we are searching for. We can check all the available projects at TCGA with the command bellow. Since there are many lets look at the first 6 projects using the command head().

GDCprojects = getGDCprojects()

head(GDCprojects[c("project_id", "name")])
##      project_id
## 1 CGCI-HTMCP-CC
## 2    TARGET-AML
## 3     GENIE-JHU
## 4     GENIE-MSK
## 5    GENIE-VICC
## 6     GENIE-MDA
##                                                                                          name
## 1                             HIV+ Tumor Molecular Characterization Project - Cervical Cancer
## 2                                                                      Acute Myeloid Leukemia
## 3 AACR Project GENIE - Contributed by Johns Hopkins Sidney Kimmel Comprehensive Cancer Center
## 4                  AACR Project GENIE - Contributed by Memorial Sloan Kettering Cancer Center
## 5                         AACR Project GENIE - Contributed by Vanderbilt-Ingram Cancer Center
## 6                               AACR Project GENIE - Contributed by MD Anderson Cancer Center

As a general rule in R (and especially if you are working in RStudio) whenever some method returns some value or table you are not familiar with, you should check its structure and dimensions. You can always use functions such as head() to only show the first entries and dim() to check the dimension of the data.

We already know that Liver Hepatocellular Carcinoma has as id TCGA-LIHC. We can use the following function to get details on all data deposited for TCGA-LIHC.

TCGAbiolinks:::getProjectSummary("TCGA-LIHC")
## $file_count
## [1] 20162
## 
## $data_categories
##   file_count case_count               data_category
## 1       2634        377            Sequencing Reads
## 2       1634        377                 Biospecimen
## 3       4175        377       Copy Number Variation
## 4       6048        377 Simple Nucleotide Variation
## 5       1698        376     Transcriptome Profiling
## 6       1290        377             DNA Methylation
## 7        803        377                    Clinical
## 8       1696        371        Structural Variation
## 9        184        184          Proteome Profiling
## 
## $case_count
## [1] 377
## 
## $file_size
## [1] 7.142195e+13

Of note, not all patients were measured for all data types. Also, some data types have more files than samples. This is the case when more experiments were performed per patient, i.e.Ā transcriptome profiling was performed both in mRNA and miRNA, or that data have been analysed by distinct computational strategies.

Let us start by querying all RNA-seq data from LIHC project. When using GDCquery we always need to specify the id of the project, i.e.Ā ā€œTCGA_LIHCā€, and the data category we are interested in, i.e.Ā ā€œTranscriptome Profilingā€. Here, we will focus on a particular type of data summarization for mRNA-seq data (workflow.type), which is based on raw counts estimated with HTSeq.

Note that performing this query will take a few of minutes

query_TCGA = GDCquery(
  project = "TCGA-LIHC",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  data.type = "Gene Expression Quantification",
  experimental.strategy = "RNA-Seq",
  #workflow.type = "HTSeq - Counts"
  workflow.type = "STAR - Counts"
)

To visualize the query results in a more readable way, we can use the command getResults.

lihc_res = getResults(query_TCGA) # make results as table
# head(lihc_res) # data of the first 6 patients.
colnames(lihc_res) # columns present in the table
##  [1] "id"                        "data_format"              
##  [3] "cases"                     "access"                   
##  [5] "file_name"                 "submitter_id"             
##  [7] "data_category"             "type"                     
##  [9] "file_size"                 "created_datetime"         
## [11] "md5sum"                    "updated_datetime"         
## [13] "file_id"                   "data_type"                
## [15] "state"                     "experimental_strategy"    
## [17] "version"                   "data_release"             
## [19] "project"                   "analysis_id"              
## [21] "analysis_state"            "analysis_submitter_id"    
## [23] "analysis_workflow_link"    "analysis_workflow_type"   
## [25] "analysis_workflow_version" "sample_type"              
## [27] "is_ffpe"                   "cases.submitter_id"       
## [29] "sample.submitter_id"

One interesting question is the tissue type measured at an experiment (normal, solid tissue, cell line). This information is present at column sample_type.

head(lihc_res$sample_type) # first 6 types of tissue.
## [1] "Solid Tissue Normal" "Primary Tumor"       "Primary Tumor"      
## [4] "Primary Tumor"       "Primary Tumor"       "Primary Tumor"

sample_type is a categorical variable, which can be better visualized with the table function, so that we can check how many different categories exist in the data.

table(lihc_res$sample_type) # summary of distinct tissues types present in this study
## 
##       Primary Tumor     Recurrent Tumor Solid Tissue Normal 
##                 371                   3                  50

As you see, there are 50 controls (Solid Tissue Normal) and 371 cancer samples (Primary Tumors). For simplicity, we will ignore the small class of recurrent tumors. Therefore, we will redo the query as

query_TCGA = GDCquery(
  project = "TCGA-LIHC",
  data.category = "Transcriptome Profiling", # parameter enforced by GDCquery
  data.type = "Gene Expression Quantification",
  experimental.strategy = "RNA-Seq",
  workflow.type = "STAR - Counts",
  sample.type = c("Primary Tumor", "Solid Tissue Normal"))

Next, we need to download the files from the query. Before, be sure that you set your current working directory to the place you want to save your data. TCGA will save the data in a directory structure starting with a directory ā€œGDCdataā€.

Let us now download the files specified in the query.

GDCdownload(query = query_TCGA)

Given that you need to download many files, the previous operation might take some time. Often the download fails for one or another file. You can re-run the previous command until no error message is given. The method will recognize that the data has already been downloaded and won’t download the data again.

Finally, lets load the actual RNAseq data into R. Remember that the output directory set must be the same to where you downloaded the data.

tcga_data = GDCprepare(query_TCGA)

We can then check the size of the object with the command.

dim(tcga_data)
## [1] 60660   421

There are 3 functions that allow us to access to most important data present in this object, these are: colData(), rowData(), assays(). Use the command ?SummarizedExperiment to find more details. colData() allows us to access the clinical data associated with our samples. The functions colnames() and rownames() can be used to extract the column and rows names from a given table respectively. But be careful if you have large objects. The colnames() and rownames() functions return all column and row names, even if you have tens of thousands. The combination header(colnames()) can make more sense in these cases.

# In R (and other programming languages) you can often
# chain functions to save time and space
colnames(colData(tcga_data))
##  [1] "barcode"                     "patient"                    
##  [3] "sample"                      "shortLetterCode"            
##  [5] "definition"                  "sample_submitter_id"        
##  [7] "sample_type_id"              "tumor_descriptor"           
##  [9] "sample_id"                   "sample_type"                
## [11] "composition"                 "days_to_collection"         
## [13] "state"                       "initial_weight"             
## [15] "pathology_report_uuid"       "submitter_id"               
## [17] "oct_embedded"                "is_ffpe"                    
## [19] "tissue_type"                 "synchronous_malignancy"     
## [21] "ajcc_pathologic_stage"       "days_to_diagnosis"          
## [23] "treatments"                  "last_known_disease_status"  
## [25] "tissue_or_organ_of_origin"   "days_to_last_follow_up"     
## [27] "age_at_diagnosis"            "primary_diagnosis"          
## [29] "prior_malignancy"            "year_of_diagnosis"          
## [31] "prior_treatment"             "ajcc_staging_system_edition"
## [33] "ajcc_pathologic_t"           "morphology"                 
## [35] "ajcc_pathologic_n"           "ajcc_pathologic_m"          
## [37] "classification_of_tumor"     "diagnosis_id"               
## [39] "icd_10_code"                 "site_of_resection_or_biopsy"
## [41] "tumor_grade"                 "progression_or_recurrence"  
## [43] "alcohol_history"             "exposure_id"                
## [45] "race"                        "gender"                     
## [47] "ethnicity"                   "vital_status"               
## [49] "age_at_index"                "days_to_birth"              
## [51] "year_of_birth"               "demographic_id"             
## [53] "days_to_death"               "year_of_death"              
## [55] "bcr_patient_barcode"         "primary_site"               
## [57] "project_id"                  "disease_type"               
## [59] "name"                        "releasable"                 
## [61] "released"                    "sample.aux"

This link provides a basic explanation about tcga_data. Note that both clinical and expression data are present in this object.

Lets look at some potentially interesting features. The table() function (in this context) produces a small summary with the sum of each of the factors present in a given column.

table(tcga_data@colData$vital_status)
## 
##        Alive         Dead Not Reported 
##          255          164            2
table(tcga_data@colData$ajcc_pathologic_stage)
## 
##    Stage I   Stage II  Stage III Stage IIIA Stage IIIB Stage IIIC   Stage IV 
##        189         97          6         73          8         10          3 
##  Stage IVA  Stage IVB 
##          1          2
table(tcga_data@colData$definition)
## 
## Primary solid Tumor Solid Tissue Normal 
##                 371                  50
table(tcga_data@colData$tissue_or_organ_of_origin)
## 
## Liver 
##   421
table(tcga_data@colData$gender)
## 
## female   male 
##    143    278
table(tcga_data@colData$race)
## 
## american indian or alaska native                            asian 
##                                2                              164 
##        black or african american                     not reported 
##                               24                               13 
##                            white 
##                              218

Is there a particular column (feature) that allows you to distinguish tumor tissue from normal tissue?

What about the RNA-seq data? We can use the assay function to obtain the RNA-seq count matrices and rowData to see gene mapping information. Can you tell how many genes and how many samples are included there?

dim(assay(tcga_data))     # gene expression matrices.
## [1] 60660   421
head(assay(tcga_data)[,1:2]) # expression of first 6 genes and first 2 samples
##                    TCGA-DD-A1EG-11A-11R-A213-07 TCGA-RC-A7SF-01A-11R-A352-07
## ENSG00000000003.15                         3729                         2289
## ENSG00000000005.6                            14                            1
## ENSG00000000419.13                          954                          504
## ENSG00000000457.14                          327                          311
## ENSG00000000460.17                           56                           77
## ENSG00000000938.13                          598                          116
head(rowData(tcga_data))     # ensembl id and gene id of the first 6 genes.
## DataFrame with 6 rows and 10 columns
##                      source     type     score     phase            gene_id
##                    <factor> <factor> <numeric> <integer>        <character>
## ENSG00000000003.15   HAVANA     gene        NA        NA ENSG00000000003.15
## ENSG00000000005.6    HAVANA     gene        NA        NA  ENSG00000000005.6
## ENSG00000000419.13   HAVANA     gene        NA        NA ENSG00000000419.13
## ENSG00000000457.14   HAVANA     gene        NA        NA ENSG00000000457.14
## ENSG00000000460.17   HAVANA     gene        NA        NA ENSG00000000460.17
## ENSG00000000938.13   HAVANA     gene        NA        NA ENSG00000000938.13
##                         gene_type   gene_name       level     hgnc_id
##                       <character> <character> <character> <character>
## ENSG00000000003.15 protein_coding      TSPAN6           2  HGNC:11858
## ENSG00000000005.6  protein_coding        TNMD           2  HGNC:17757
## ENSG00000000419.13 protein_coding        DPM1           2   HGNC:3005
## ENSG00000000457.14 protein_coding       SCYL3           2  HGNC:19285
## ENSG00000000460.17 protein_coding    C1orf112           2  HGNC:25565
## ENSG00000000938.13 protein_coding         FGR           2   HGNC:3697
##                             havana_gene
##                             <character>
## ENSG00000000003.15 OTTHUMG00000022002.2
## ENSG00000000005.6  OTTHUMG00000022001.2
## ENSG00000000419.13 OTTHUMG00000032742.2
## ENSG00000000457.14 OTTHUMG00000035941.6
## ENSG00000000460.17 OTTHUMG00000035821.9
## ENSG00000000938.13 OTTHUMG00000003516.3

Finally, we can use some core functionality of R to save the TCGA_data as a .RDS file. This is faster than repeating the previous operations and useful if you need to work in your data for several days. To keep things tidy, we create a directory called data and save our file there.

dir.create("data")
## Warning in dir.create("data"): 'data' already exists
# Save the data as a file, if you need it later, you can just load this file
# instead of having to run the whole pipeline again
saveRDS(object = tcga_data,
        file = file.path("data", "tcga_data.RDS"),
        compress = FALSE)

The data can be loaded with the following command:

tcga_data = readRDS(file = file.path("data","tcga_data.RDS"))

which is the same as:

tcga_data = readRDS(file = "./data/tcga_data.RDS")

3 RNASeq Normalization

3.1 Defining a pipeline

Basics

A typical task on RNA-Seq data is differential expression (DE) analysis, based on some clinical phenotypes. This, in turn, requires normalization of the data, as in its raw format it may have batch effects and other artifacts.

A common approach to such complex tasks is to define a computational pipeline, performing several steps in sequence, allowing the user to select different parameters.

We will now define and run one such pipeline, through the use of an R function.

The function is called limma_pipeline(tcga_data, condition_variable, reference_group), where tcga_data is the data object we have gotten from TCGA and condition_variable is the interesting variable/condition by which you want to group your patient samples. You can also decide which one of the values of your conditional variable is going to be the reference group, with the reference_group parameter.

This function returns a list with three different objects:

  • A complex object, resulting from running voom, this contains the TMM+voom normalized data;
  • A complex object, resulting from running eBayes, this contains the the fitted model plus a number of statistics related to each of the probes;
  • A simple table, resulting from running topTable, which contains the top 100 differentially expressed genes sorted by p-value. This is how the code of this function:
limma_pipeline = function(
  tcga_data,
  condition_variable,
  reference_group = NULL){

  design_factor = colData(tcga_data)[, condition_variable, drop = TRUE]

  group = factor(design_factor)

  if(!is.null(reference_group)){
    group = relevel(group, ref = reference_group)
  }

  design = model.matrix(~ group)

  dge = DGEList(
    counts = assay(tcga_data),
    samples = colData(tcga_data),
    genes = as.data.frame(rowData(tcga_data))
  )

  # filtering
  keep = filterByExpr(dge, design)
  dge = dge[keep, , keep.lib.sizes = FALSE]
  rm(keep)

  # Normalization (TMM followed by voom)
  dge = calcNormFactors(dge)
  v = voom(dge, design, plot = TRUE)

  # Fit model to data given design
  fit = lmFit(v, design)
  fit = eBayes(fit)

  # Show top genes
  topGenes = topTable(fit, coef = ncol(design), number = 100, sort.by = "p")

  return(
    list(
      voomObj = v, # normalized data
      fit = fit, # fitted model and statistics
      topGenes = topGenes # the 100 most differentially expressed genes
    )
  )
}

Please read the Details tab for a step by step explanation of what this pipeline does. For now, copy the limma_pipeline function in your R console, so we can start using it on the TCGA data.

With the following command, we can obtain the DE analysis comparing Primary solid Tumor samples against Solid Tissue Normal. This will be used in the next section, on the classification task.

limma_res = limma_pipeline(
  tcga_data = tcga_data,
  condition_variable = "definition",
  reference_group = "Solid Tissue Normal"
)

Let’s save this object to file, like we did with tcga_data:

# Save the data as a file, if you need it later, you can just load this file
# instead of having to run the whole pipeline again
saveRDS(
  object = limma_res,
  file = file.path("data", "limma_res.RDS"),
  compress = FALSE
)

As an example, we also show here how we can use limma_pipeline to perform DE analysis by grouping patients by gender instead of by tissue type.

gender_limma_res = limma_pipeline(
  tcga_data = tcga_data,
  condition_variable = "gender",
  reference_group = "female"
)

Details

In our pipeline function, we use the package limma. We will select a particular clinical feature of the data to use as class for grouping the samples as either tumor vs normal tissue. This data is available under the column definition for tcga_data, but needs the use of the function colData to be accessed. In addition, limma requires this data to be a factor, so we convert it as such:

clinical_data = colData(tcga_data)
group = factor(clinical_data$definition)

As seen before, we have 2 distinct groups of tissues defined in this column, Solid Tissue Normal (our control samples) and Primary solid Tumor (our samples of interest). In this factor, we also want to define Solid Tissue Normal as being our base or reference level.

group = relevel(group, ref = "Solid Tissue Normal")

Next, we need to create a design matrix, which will indicate the conditions to be compared by the DE analysis. The ~ symbol represents that we are constructing a formula.

design = model.matrix(~group)
head(design)
##   (Intercept) groupPrimary solid Tumor
## 1           1                        0
## 2           1                        1
## 3           1                        1
## 4           1                        1
## 5           1                        1
## 6           1                        1

Before performing DE analysis, we remove genes, which have low amount of counts. We transform our tcga_data object as DGEList, which provides functions for filtering. By default genes with counts with less than 10 reads are removed.

dge = DGEList( # creating a DGEList object
  counts = assay(tcga_data),
  samples = colData(tcga_data),
  genes = as.data.frame(rowData(tcga_data))
)

# filtering
keep = filterByExpr(dge, design) # defining which genes to keep
dge = dge[keep, , keep.lib.sizes = FALSE] # filtering the dge object
rm(keep) #  use rm() to remove objects from memory if you don't need them anymore

Before we fit a model to our data, we normalize the data to minimize batch effects and technical variation with the TMM (trimmed mean of M-values) normalization method. Moreover, to apply limma on RNA-seq, we need to convert the data to have a similar variance as arrays. This is done with the VOOM method.

dge = calcNormFactors(dge, method = "TMM")
v = voom(dge, design, plot = TRUE)

Finally, using lmFit lets fit a series of linear models, one to each of the probes. These data will then be fed to eBayes to produce a complex object which holds a number of statistics that we can use to rank the differentially expressed genes.

fit = lmFit(v, design)
fit = eBayes(fit)

Using the function topTable we can check the top10 genes classified as being differentially expressed. ’

topGenes = topTable(fit, coef = 1, sort.by = "p")
print(topGenes)
##                    source type score phase            gene_id      gene_type
## ENSG00000009307.16 HAVANA gene    NA    NA ENSG00000009307.16 protein_coding
## ENSG00000011304.22 HAVANA gene    NA    NA ENSG00000011304.22 protein_coding
## ENSG00000022840.16 HAVANA gene    NA    NA ENSG00000022840.16 protein_coding
## ENSG00000044115.21 HAVANA gene    NA    NA ENSG00000044115.21 protein_coding
## ENSG00000048828.17 HAVANA gene    NA    NA ENSG00000048828.17 protein_coding
## ENSG00000054118.15 HAVANA gene    NA    NA ENSG00000054118.15 protein_coding
## ENSG00000057608.17 HAVANA gene    NA    NA ENSG00000057608.17 protein_coding
## ENSG00000067560.13 HAVANA gene    NA    NA ENSG00000067560.13 protein_coding
## ENSG00000068697.7  HAVANA gene    NA    NA  ENSG00000068697.7 protein_coding
## ENSG00000070831.17 HAVANA gene    NA    NA ENSG00000070831.17 protein_coding
##                    gene_name level    hgnc_id           havana_gene    logFC
## ENSG00000009307.16     CSDE1     1 HGNC:29905  OTTHUMG00000012060.6 8.426555
## ENSG00000011304.22     PTBP1     2  HGNC:9583 OTTHUMG00000181789.13 7.266500
## ENSG00000022840.16     RNF10     1 HGNC:10055 OTTHUMG00000168999.17 6.839278
## ENSG00000044115.21    CTNNA1     1  HGNC:2509  OTTHUMG00000163502.5 7.433791
## ENSG00000048828.17   FAM120A     2 HGNC:13247  OTTHUMG00000020252.4 7.202252
## ENSG00000054118.15    THRAP3     2 HGNC:22964  OTTHUMG00000007866.8 6.646039
## ENSG00000057608.17      GDI2     1  HGNC:4227 OTTHUMG00000017607.10 7.941541
## ENSG00000067560.13      RHOA     1   HGNC:667  OTTHUMG00000156838.8 7.947602
## ENSG00000068697.7    LAPTM4A     2  HGNC:6924  OTTHUMG00000090750.3 8.189099
## ENSG00000070831.17     CDC42     2  HGNC:1736 OTTHUMG00000002753.10 7.009241
##                     AveExpr        t P.Value adj.P.Val        B
## ENSG00000009307.16 8.203491 125.3471       0         0 759.8161
## ENSG00000011304.22 7.489407 159.3929       0         0 857.5700
## ENSG00000022840.16 6.882460 136.6479       0         0 795.1348
## ENSG00000044115.21 7.801750 121.1519       0         0 746.2174
## ENSG00000048828.17 6.997776 129.3667       0         0 772.8610
## ENSG00000054118.15 6.365231 130.6175       0         0 776.8190
## ENSG00000057608.17 7.627046 131.7043       0         0 779.9930
## ENSG00000067560.13 8.089706 158.9324       0         0 856.2014
## ENSG00000068697.7  7.866955 125.1309       0         0 759.1753
## ENSG00000070831.17 6.925490 125.9530       0         0 762.0301

topTable returns a table with the top genes sorted by P.value expressing if the gene is differentially expressed.

3.2 Visualization

We have also prepared a function that produces PCA plots given the voom object created by the limma_pipeline function. You can inspect this function and try to figure out how it works.

plot_PCA = function(voomObj, condition_variable){
  group = factor(voomObj$targets[, condition_variable])
  pca = prcomp(t(voomObj$E))
  # Take PC1 and PC2 for the plot
  plot(pca$x[, 1:2], col = group, pch = 19)
  # include a legend for points
  legend(
    "bottomleft",
    inset = .01,
    levels(group),
    pch = 19,
    col = seq_along(levels(group))
  )
  return(pca)
}

By calling the function plot_PCA, we get a plot of the first two principal components:

res_pca = plot_PCA(limma_res$voomObj, "definition")

This plot shows that the two sample groups (tumor tissue and healthy tissue) have a well-separated RNA expression profile.

With all the information already available to us created with the limma_pipeline function, we can also, for example, create a heatmap picturing the expression of the top20 differentially expressed genes on our samples.

# first lets get the normalized expression matrix from our limma_res object
expr_mat = as.matrix(t(limma_res$voomObj$E))

# then lets get gene names that are easier to look at
gene_names = limma_res$voomObj$genes[, "gene_name"]
# and use these to rename the genes in our expression matrix
colnames(expr_mat) = gene_names

# we want to get the top20 DE genes
# topGenes is already sorted by the adjusted p-value (in ascending order)
head(limma_res$topGenes, 20)
##                    source type score phase            gene_id      gene_type
## ENSG00000104938.18 HAVANA gene    NA    NA ENSG00000104938.18 protein_coding
## ENSG00000165682.14 HAVANA gene    NA    NA ENSG00000165682.14 protein_coding
## ENSG00000263761.3  HAVANA gene    NA    NA  ENSG00000263761.3 protein_coding
## ENSG00000163217.2  HAVANA gene    NA    NA  ENSG00000163217.2 protein_coding
## ENSG00000136011.15 HAVANA gene    NA    NA ENSG00000136011.15 protein_coding
## ENSG00000164619.10 HAVANA gene    NA    NA ENSG00000164619.10 protein_coding
## ENSG00000145708.11 HAVANA gene    NA    NA ENSG00000145708.11 protein_coding
## ENSG00000160339.16 HAVANA gene    NA    NA ENSG00000160339.16 protein_coding
## ENSG00000019169.10 HAVANA gene    NA    NA ENSG00000019169.10 protein_coding
## ENSG00000251049.2  HAVANA gene    NA    NA  ENSG00000251049.2         lncRNA
## ENSG00000184374.3  HAVANA gene    NA    NA  ENSG00000184374.3 protein_coding
## ENSG00000142748.13 HAVANA gene    NA    NA ENSG00000142748.13 protein_coding
## ENSG00000145824.13 HAVANA gene    NA    NA ENSG00000145824.13 protein_coding
## ENSG00000160323.19 HAVANA gene    NA    NA ENSG00000160323.19 protein_coding
## ENSG00000164100.9  HAVANA gene    NA    NA  ENSG00000164100.9 protein_coding
## ENSG00000114812.13 HAVANA gene    NA    NA ENSG00000114812.13 protein_coding
## ENSG00000164161.10 HAVANA gene    NA    NA ENSG00000164161.10 protein_coding
## ENSG00000181072.11 HAVANA gene    NA    NA ENSG00000181072.11 protein_coding
## ENSG00000160801.14 HAVANA gene    NA    NA ENSG00000160801.14 protein_coding
## ENSG00000183287.14 HAVANA gene    NA    NA ENSG00000183287.14 protein_coding
##                     gene_name level    hgnc_id           havana_gene     logFC
## ENSG00000104938.18     CLEC4M     2 HGNC:13523  OTTHUMG00000182432.4 -9.230079
## ENSG00000165682.14     CLEC1B     1 HGNC:24356  OTTHUMG00000168502.1 -7.672067
## ENSG00000263761.3        GDF2     2  HGNC:4217  OTTHUMG00000188320.2 -9.252944
## ENSG00000163217.2       BMP10     2 HGNC:20869  OTTHUMG00000129573.2 -7.967087
## ENSG00000136011.15      STAB2     2 HGNC:18629  OTTHUMG00000170056.2 -6.135552
## ENSG00000164619.10      BMPER     2 HGNC:24154 OTTHUMG00000128675.24 -6.068348
## ENSG00000145708.11      CRHBP     2  HGNC:2356  OTTHUMG00000102133.3 -6.477126
## ENSG00000160339.16       FCN2     2  HGNC:3624  OTTHUMG00000020892.2 -7.521944
## ENSG00000019169.10      MARCO     2  HGNC:6895  OTTHUMG00000131400.4 -7.222983
## ENSG00000251049.2  AC107396.1     2       <NA>  OTTHUMG00000162168.2 -4.878924
## ENSG00000184374.3     COLEC10     2  HGNC:2220  OTTHUMG00000164971.2 -5.970541
## ENSG00000142748.13       FCN3     2  HGNC:3625  OTTHUMG00000005722.2 -5.866426
## ENSG00000145824.13     CXCL14     2 HGNC:10640  OTTHUMG00000129139.7 -6.679269
## ENSG00000160323.19   ADAMTS13     2  HGNC:1366  OTTHUMG00000020876.4 -3.143669
## ENSG00000164100.9       NDST3     2  HGNC:7682  OTTHUMG00000132959.5 -5.221459
## ENSG00000114812.13      VIPR1     1 HGNC:12694  OTTHUMG00000131792.9 -4.663159
## ENSG00000164161.10       HHIP     2 HGNC:14866  OTTHUMG00000161428.3 -6.529694
## ENSG00000181072.11      CHRM2     2  HGNC:1951  OTTHUMG00000155658.3 -5.542917
## ENSG00000160801.14      PTH1R     2  HGNC:9608  OTTHUMG00000133515.6 -4.167186
## ENSG00000183287.14      CCBE1     2 HGNC:29426  OTTHUMG00000180087.6 -5.430651
##                        AveExpr         t       P.Value     adj.P.Val        B
## ENSG00000104938.18 -3.29792212 -45.73718 7.192449e-166 1.634844e-161 367.9031
## ENSG00000165682.14 -3.36630698 -38.85242 1.182235e-141 1.343610e-137 312.4139
## ENSG00000263761.3  -2.97511698 -38.49714 2.451971e-140 1.857777e-136 309.7851
## ENSG00000163217.2  -3.56629765 -37.47196 1.687960e-136 9.591831e-133 300.6683
## ENSG00000136011.15 -0.87316796 -35.47037 7.669370e-129 3.486495e-125 283.6029
## ENSG00000164619.10 -1.73614838 -32.47381 5.699011e-117 2.158975e-113 256.2907
## ENSG00000145708.11  0.50819057 -31.81064 2.807402e-114 9.116035e-111 250.3490
## ENSG00000160339.16 -0.71253149 -31.77862 3.792409e-114 1.077518e-110 250.0199
## ENSG00000019169.10  0.02106692 -30.75429 6.095338e-110 1.539412e-106 240.3967
## ENSG00000251049.2  -5.58008401 -29.74075 1.000157e-105 2.273357e-102 229.2715
## ENSG00000184374.3   0.41556403 -28.60462 6.106177e-101  1.261758e-97 219.7624
## ENSG00000142748.13  1.97353779 -28.11263  7.540429e-99  1.428283e-95 214.9809
## ENSG00000145824.13 -0.50761409 -26.98018  5.418220e-94  9.473550e-91 203.8152
## ENSG00000160323.19  2.63404340 -26.81833  2.707197e-93  4.395327e-90 202.2345
## ENSG00000164100.9  -4.20542375 -26.40795  1.617913e-91  2.451677e-88 197.4752
## ENSG00000114812.13  0.75603499 -26.26335  6.863412e-91  9.750334e-88 196.7036
## ENSG00000164161.10 -1.63232306 -25.69212  2.107944e-88  2.818445e-85 190.9591
## ENSG00000181072.11 -4.42726426 -25.33904  7.370333e-87  9.307093e-84 186.8836
## ENSG00000160801.14  1.07293354 -25.07932  1.013441e-85  1.212396e-82 184.8461
## ENSG00000183287.14 -1.45879734 -24.92830  4.664431e-85  5.301126e-82 183.2542
# therefore we can just go ahead and select the first 20 values
top_de_genes = limma_res$topGenes$gene_name[1:20]

# we'll get the sample type data so that we can show the groups in the heatmap
sample_type = factor(limma_res$voomObj$targets$sample_type)

# define the color palette for the plot
hmcol = colorRampPalette(rev(brewer.pal(9, "RdBu")))(256)

# perform complete linkage clustering
clust_func = function(x) hclust(x, method = "complete")

# use the inverse of correlation as distance.
dist_func = function(x) as.dist((1 - cor(t(x))) / 2)

# A good looking heatmap involves a lot of parameters and tinkering
gene_heatmap = heatmap.2(
  t(expr_mat[, top_de_genes]),
  scale = "row",          # scale the values for each gene (row)
  density.info = "none",  # turns off density plot inside color legend
  trace = "none",         # turns off trace lines inside the heat map
  col = hmcol,            # define the color map
  labCol = FALSE,         # Not showing column labels
  ColSideColors = as.character(as.numeric(sample_type)), # Show colors for each response class
  dendrogram = "both",    # Show dendrograms for both axis
  hclust = clust_func,  # Define hierarchical clustering method
  distfun = dist_func,  # Using correlation coefficient for distance function
  cexRow = 1.0,           # Resize row labels
  keysize = 1.25,       # Size of the legend
  margins = c(1, 6)        # Define margin spaces
)