Biomedical data often contain missing values and in many applications missing value imputation (MVI) is an important part of the data analysis work-flow. However, the performance of MVI methods depends on details of the joint distribution of data and missingness patterns that are typically unknown in practice, making an a priori choice of MVI method challenging. Furthermore, technical assumptions underlying MVI methods can be hard to directly verify in practice. Motivated by these issues, in this paper, we propose an approach for the context-specific selection of MVI methods. Due to the fact that different methods may work well in different cases we argue for a move away from a "one size fits all" view and put forward in this paper a standardized, empirical approach in which MVI methods are benchmarked in the specific context of a problem of interest. We connect our work to the large body of MVI research, along the way refining definitions of missing at random and missing not at random and providing a detailed review of existing work on benchmarking. Our approach can be tailored to reflect specific assumptions on missingness patterns, allowing for application in diverse applied problems. Furthermore, in addition to using real data, we study benchmarking via data simulation spanning a broad range of properties, such as latent factors, non-linearity and multi-modality, with interpretable simulation parameters that are amenable to user specification. The approaches we propose can be used to (i) select an MVI method for a given data set or (ii) benchmark a novel MVI method across a range of regimes. Alongside the general protocol, we provide a specific, reproducible implementation (in the R-package ImputeBench, available under github.com/richterrob/ImputeBench) that gives users a ready-to-use tool for MVI selection and assessment. We illustrate the use of ImputeBench to study the behaviour of a range of existing imputation methods (k-nn, soft impute, missForest, MICE) in the context of real data from an ongoing large-scale population-level study.
Competing Interest StatementThe authors have declared no competing interest.
Funding StatementThis work was supported in part by the German Federal Ministry of Education and Research (BMBF) project "MechML" and the Diet-Body-Brain Competence Cluster in Nutrition Research funded by the German Federal Ministry of Education and Research ('01EA1410C', '01EA1809C' to M.M.B.B.)
Author DeclarationsI confirm all relevant ethical guidelines have been followed, and any necessary IRB and/or ethics committee approvals have been obtained.
Yes
The details of the IRB/oversight body that provided approval or exemption for the research described are given below:
Approval to undertake the Rhineland Study was obtained from the ethics committee of the University of Bonn, Medical Faculty. The study was carried out in accordance with the recommendations of the International Council for Harmonisation Good Clinical Practice standards. We obtained written informed consent from all participants in accordance with the Declaration of Helsinki.
I confirm that all necessary patient/participant consent has been obtained and the appropriate institutional forms have been archived, and that any patient/participant/sample identifiers included were not known to anyone (e.g., hospital staff, patients or participants themselves) outside the research group so cannot be used to identify individuals.
Yes
I understand that all clinical trials and any other prospective interventional studies must be registered with an ICMJE-approved registry, such as ClinicalTrials.gov. I confirm that any such study reported in the manuscript has been registered and the trial registration ID is provided (note: if posting a prospective study registered retrospectively, please provide a statement in the trial ID field explaining why the study was not registered in advance).
Yes
I have followed all appropriate research reporting guidelines, such as any relevant EQUATOR Network research reporting checklist(s) and other pertinent material, if applicable.
Yes
Data AvailabilityThe accompanying R package ImputeBench is available on github: https://github.com/richterrob/ImputeBench and its vignette is available on figshare: https://figshare.com/articles/online_resource/ImputeBench_Vignette_html/23896677. The data from the Rhineland Study used in this manuscript is not publicly available due to data protection regulations. Access to data can be provided to scientists in accordance with the Rhineland Study's Data Use and Access Policy. Requests for additional information and/or access to the datasets of the Rhineland study can be send to RS-DUAC@dzne.de.
https://github.com/richterrob/ImputeBench
https://figshare.com/articles/online_resource/ImputeBench_Vignette_html/23896677
留言 (0)