Look and listen:A multi-modality late fusion approach to scene classification for autonomous machines

Abstract

The novelty of this study consists in a multi-modality approach to scene classification, where image and audio complement each other in a process of deep late fusion. The approach is demonstrated on a difficult classification problem, consisting of two synchronised and balanced datasets of 16, 000 data objects, encompassing 4.4 hours of video of 8 environments with varying degrees of similarity. We first extract video frames and accompanying audio at one second intervals. The image and the audio datasets are first classified independently, using a fine-tuned VGG16 and an evolutionary optimised deep neural network, with accuracies of 89.27% and 93.72%, respectively. This is followed by late fusion of the two neural networks to enable a higher order function, leading to accuracy of 96.81% in this multi-modality classifier with synchronised video frames and audio clips. The tertiary neural network implemented for late fusion outperforms classical state-of-the-art classifiers by around 3% when the two primary networks are considered as feature generators. We show that situations where a single-modality may be confused by anomalous data points are now corrected through an emerging higher order integration. Prominent examples include a water feature in a city misclassified as a river by the audio classifier alone and a densely crowded street misclassified as a forest by the image classifier alone. Both are examples which are correctly classified by our multi-modality approach.

Publication DOI: https://doi.org/10.1109/IROS45743.2020.9341557
Divisions: College of Engineering & Physical Sciences > School of Informatics and Digital Engineering > Computer Science
College of Engineering & Physical Sciences
College of Engineering & Physical Sciences > Aston Institute of Urban Technology and the Environment (ASTUTE)
College of Engineering & Physical Sciences > Systems analytics research institute (SARI)
Additional Information: © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.
Event Title: 2020 IEEE/RSJ International Conference on Intelligent Robots and Systems, IROS 2020
Event Type: Other
Event Dates: 2020-10-24 - 2021-01-24
Uncontrolled Keywords: Image analysis,Neural networks,Urban areas,Forestry,Generators,Intelligent robots,Rivers,Control and Systems Engineering,Software,Computer Vision and Pattern Recognition,Computer Science Applications
ISBN: 978-1-7281-6213-3, 9781728162126
Full Text Link:
Related URLs: http://www.scop ... tnerID=8YFLogxK (Scopus URL)
https://ieeexpl ... ocument/9341557 (Publisher URL)
PURE Output Type: Conference contribution
Published Date: 2021-02-10
Accepted Date: 2020-06-20
Authors: Bird, Jordan J.
Faria, Diego R. (ORCID Profile 0000-0002-2771-1713)
Premebida, Cristiano
Ekart, Aniko (ORCID Profile 0000-0001-6967-5397)
Vogiatzis, George (ORCID Profile 0000-0002-3226-0603)

Download

[img]

Version: Accepted Version

| Preview

Export / Share Citation


Statistics

Additional statistics for this record