Impact of Missing Data Imputation Methods on Gene Expression Clustering and Classification

Marcilio CP de Souto, Pablo A Jaskowiak, and Ivan G Costa


Several missing value imputation methods for gene expression data have been proposed in the literature. In the past few years, researchers have been putting a great deal of effort into presenting systematic evaluations of the different imputation algorithms. Initially, most algorithms were assessed with an emphasis on the accuracy of the imputation, using metrics such as the root mean squared error. However, it has become clear that the success of the estimation of the expression value should be evaluated in more practical terms as well. One can consider, for example, the ability of the method to preserve the significant genes in the dataset, or its discriminative/predictive power for classification/clustering purposes. We performed a broad analysis of the impact of five well-known missing value imputation methods on three clustering and four classification methods, in the context of 12 cancer gene expression datasets. We employed a statistical framework, for the first time in this field, to assess whether different imputation methods improve the performance of the clustering/classification methods. Our results suggest that the imputation methods evaluated have a minor impact on the classification and downstream clustering analyses. Simple methods such as replacing the missing values by mean or the median values performed as well as more complex strategies.


One can find the 12 datasets after missing value imputation by WKNN, LLS, BPCA, EM_array, Mean and Median methods here. Two versions of each dataset are provided: (1) one version after filtering genes with more than 10% of missing values and (2) other version after filtering genes with more than 40% of missing values. Datasets are in weka (arff) format.


We added also an excel table with supplementary results here. In short, we describe: (Tab 1) the statistics of all datasets after each filtering and imputation step; (Tabs 2 and 3) the classification error rates for the different classifiers generated for the datasets after 10% (and 40%) of filtering of genes with missing values and missing value imputation; (Tabs 4 and 5) the corrected Rand Index for all clustering methods applied for the datasets after 10% (and 40%) of filtering of genes with missing values and missing value imputation; and (Tab 6) p-value of the Friedmann-Nemenyi test for experiments, including also the results of the imputation method EM_Array.