Correcting data imbalance for semi-supervised Covid-19 detection using X-ray chest images

Abstract

A key factor in the ght against viral diseases such as the coronavirus (COVID-19) is the identi cation of virus carriers as early and quickly as possible, in a cheap and efficient manner. The application of deep learning for image classi cation of chest X-ray images of COVID-19 patients could become a useful pre-diagnostic detection methodology. However, deep learning architectures require large labelled datasets. This is often a limitation when the subject of research is relatively new as in the case of the virus outbreak, where dealing with small labelled datasets is a challenge. Moreover, in such context, the datasets are also highly imbalanced, with few observations from positive cases of the new disease. In this work we evaluate the performance of the semi-supervised deep learning architecture known as MixMatch with a very limited number of labelled observations and highly imbalanced labelled datasets. We demonstrate the critical impact of data imbalance to the model's accuracy. Therefore, we propose a simple approach for correcting data imbalance, by re-weighting each observation in the loss function, giving a higher weight to the observations corresponding to the under-represented class. For unlabelled observations, we use the pseudo and augmented labels calculated by MixMatch to choose the appropriate weight. The proposed method improved classi cation accuracy by up to 18%, with respect to the non balanced MixMatch algorithm. We tested our proposed approach with several available datasets using 10, 15 and 20 labelled observations, for binary classi cation (COVID-19 positive and normal cases). For multi-class classi cation (COVID-19 positive, pneumonia and normal cases), we tested 30, 50, 70 and 90 labelled observations. Additionally, a new dataset is included among the tested datasets, composed of chest X-ray images of Costa Rican adult patients.

Description

The file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.

Keywords

Coronavirus, Covid-19, Computer Aided Diagnosis, Data imbalance, Semi-Supervised Learning

Citation

Calderon-Ramirez, S., Yang, S., Moemeni, A., Elizondo, D., Colreavy-Donnelly, S., Chavarria-Estrada, L.F. and Molina-Cabello, M.A. (2021) Correcting data imbalance for semi-supervised Covid-19 detection using X-ray chest images. Applied Soft Computing, 111, 107692.

Rights

Research Institute