Dai, Zhuangzhuang, Tran, Vu, Markham, Andrew, Trigoni, Niki, Rahman, M. Arif, Wijayasingha, L.N.S., Stankovic, John and Li, Chen (2024). EgoCap and EgoFormer: First-person image captioning with context fusion. Pattern Recognition Letters, 181 , pp. 50-56.
Abstract
First-person captioning is significant because it provides veracious descriptions of egocentric scenes in a unique perspective. Also, there is a need to caption the scene, a.k.a. life-logging, for patients, travellers, and emergency responders in an egocentric narrative. Ego-captioning is indeed non-trivial since (1) Ego-images can be noisy due to motion and angles; (2) Describing a scene in a first-person narrative involves drastically different semantics; (3) Empirical implications have to be made on top of visual appearance because the cameraperson is often outside the field of view. We note we humans make good sense out of casual footage thanks to our contextual awareness in judging when and where the event unfolds, and whom the cameraperson is interacting with. This inspires the infusion of such "contexts" for situation-aware captioning. We create EgoCap which contains 2.1K ego-images, over 10K ego-captions, and 6.3K contextual labels, to close the gap of lacking ego-captioning datasets. We propose EgoFormer, a dual-encoder transformer-based network which fuses both contextual and visual features. The context encoder is pre-trained on ImageNet before fine tuning with context classification tasks. Similar to visual attention, we exploit stacked multi-head attention layers in the captioning decoder to reinforce attention to the context features. The EgoFormer has realized state-of-the-art performance on EgoCap achieving a CIDEr score of 125.52. The EgoCap dataset and EgoFormer are publicly available at https://github.com/zdai257/EgoCap-EgoFormer.
Publication DOI: | https://doi.org/10.1016/j.patrec.2024.03.012 |
---|---|
Divisions: | College of Engineering & Physical Sciences > School of Computer Science and Digital Technologies > Applied AI & Robotics College of Engineering & Physical Sciences > Smart and Sustainable Manufacturing College of Engineering & Physical Sciences > Aston Centre for Artifical Intelligence Research and Application College of Engineering & Physical Sciences > School of Computer Science and Digital Technologies College of Engineering & Physical Sciences |
Funding Information: | This research is supported by EPSRC project “ACE-OPS: From Autonomy to Cognitive assistance in Emergency OPerationS”, United Kingdom (EP/S030832/1) and by NIST project “Pervasive, Accurate, and Reliable Location Based Services for Emergency Responders, Un |
Additional Information: | Copyright © 2024 Elsevier B.V. This manuscript version is made available under the CC-BY-NC-ND 4.0 license https://creativecommons.org/licenses/by-nc-nd/4.0/ |
Uncontrolled Keywords: | image captioning,storytelling,dataset |
Publication ISSN: | 1872-7344 |
Last Modified: | 11 Nov 2024 09:03 |
Date Deposited: | 02 May 2024 15:05 |
Full Text Link: | |
Related URLs: |
https://www.sci ... 167865524000801
(Publisher URL) http://www.scop ... tnerID=8YFLogxK (Scopus URL) |
PURE Output Type: | Article |
Published Date: | 2024-05 |
Published Online Date: | 2024-03-20 |
Accepted Date: | 2024-03-15 |
Authors: |
Dai, Zhuangzhuang
(
0000-0002-6098-115X)
Tran, Vu Markham, Andrew Trigoni, Niki Rahman, M. Arif Wijayasingha, L.N.S. Stankovic, John Li, Chen |
Download
Version: Accepted Version
Access Restriction: Restricted to Repository staff only until 20 March 2025.
License: Creative Commons Attribution Non-commercial No Derivatives