Dataset similarity to assess semi-supervised learning under distribution mismatch between the labelled and unlabelled datasets

dc.cclicenceCC-BY-NCen
dc.contributor.authorCalderon-Ramirez, Saul
dc.contributor.authorOala, Luis
dc.contributor.authorTorrents-Barrena, Jordina
dc.contributor.authorYang, Shengxiang
dc.contributor.authorElizondo, David
dc.contributor.authorMoemeni, Armaghan
dc.contributor.authorColreavy-Donnelly, Simon
dc.contributor.authorSamek, Wojciech
dc.contributor.authorMolina-Cabello, Miguel
dc.contributor.authorLopez-Rubio, Ezequiel
dc.date.acceptance2022-04-14
dc.date.accessioned2022-04-25T12:56:31Z
dc.date.available2022-04-25T12:56:31Z
dc.date.issued2022-04-22
dc.descriptionThe file attached to this record is the author's final peer reviewed version. The Publisher's final version can be found by following the DOI link.en
dc.description.abstractSemi-supervised deep learning (SSDL) is a popular strategy to leverage unlabelled data for machine learning when labelled data is not readily available. In real-world scenarios, different unlabelled data sources are usually available, with varying degrees of distribution mismatch regarding the labelled datasets. It begs the question which unlabelled dataset to choose for good SSDL outcomes. ftentimes, semantic heuristics are used to match unlabelled data with labelled data. However, a quantitative and systematic approach to this election problem would be preferable. In this work, we first test the SSDL MixMatch algorithm under various distribution mismatch configurations to study the impact on SSDL accuracy. Then, we propose a quantitative unlabelled dataset selection heuristic based on dataset dissimilarity measures. These are designed to systematically assess how distribution mismatch between the labelled and unlabelled datasets affects MixMatch performance. We refer to our proposed method as deep dataset dissimilarity measures (DeDiMs), designed to compare labelled and unlabelled datasets. They use the feature space of a generic Wide-ResNet, can be applied prior to learning, are quick to evaluate and model agnostic. The strong correlation in our tests between MixMatch accuracy and the proposed DeDiMs suggests that this approach can be a good fit for quantitatively ranking different unlabelled datasets prior to SSDL training.en
dc.funderOther external funder (please detail below)en
dc.funder.otherMinistry of Science Innovation and Universities of Spainen
dc.funder.otherAutonomous Government of Andalusiaen
dc.identifier.citationS. Calderon-Ramirez, L. Oala, J. Torrents-Barrena, S. Yang, D. Elizondo, A. Moemeni, S. Colreavy-Donnelly, W. Samek, M. A. Molina-Cabello, and E. Lopez-Rubio. (2022) Dataset similarity to assess semi-supervised learning under distribution mismatch between the labelled and unlabelled datasets. IEEE Transactions on Artificial Intelligence, 4 (2), pp. 282-291en
dc.identifier.doihttps://doi.org/10.1109/TAI.2022.3168804
dc.identifier.issn2691-4581
dc.identifier.urihttps://hdl.handle.net/2086/21835
dc.language.isoen_USen
dc.peerreviewedYesen
dc.projectidRTI2018-094645-B-I00en
dc.projectidUMA18-FEDERJA-084en
dc.publisherIEEEen
dc.researchinstituteInstitute of Artificial Intelligence (IAI)en
dc.subjectSemi-supervised deep learningen
dc.subjectMixMatchen
dc.subjectOut of distribution dataen
dc.subjectDeep learningen
dc.subjectDistribution mismatchen
dc.subjectDataset similarityen
dc.titleDataset similarity to assess semi-supervised learning under distribution mismatch between the labelled and unlabelled datasetsen
dc.typeArticleen

Files

Original bundle
Now showing 1 - 2 of 2
Loading...
Thumbnail Image
Name:
IEEETAI22.pdf
Size:
582.02 KB
Format:
Adobe Portable Document Format
Description:
Main article
Loading...
Thumbnail Image
Name:
Supplement.pdf
Size:
459.37 KB
Format:
Adobe Portable Document Format
Description:
Supplementary document
License bundle
Now showing 1 - 1 of 1
No Thumbnail Available
Name:
license.txt
Size:
4.2 KB
Format:
Item-specific license agreed upon to submission
Description: