MasakhaNEWS: News Topic Classification for African languages

Abstract

African languages are severely under-represented in NLP research due to lack of datasets covering several NLP tasks. While there are individual language specific datasets that are being expanded to different tasks, only a handful of NLP tasks (e.g. named entity recognition and machine translation) have standardized benchmark datasets covering several geographical and typologically-diverse African languages. In this paper, we develop MasakhaNEWS -- a new benchmark dataset for news topic classification covering 16 languages widely spoken in Africa. We provide an evaluation of baseline models by training classical machine learning models and fine-tuning several language models. Furthermore, we explore several alternatives to full fine-tuning of language models that are better suited for zero-shot and few-shot learning such as cross-lingual parameter-efficient fine-tuning (like MAD-X), pattern exploiting training (PET), prompting language models (like ChatGPT), and prompt-free sentence transformer fine-tuning (SetFit and Cohere Embedding API). Our evaluation in zero-shot setting shows the potential of prompting ChatGPT for news topic classification in low-resource African languages, achieving an average performance of 70 F1 points without leveraging additional supervision like MAD-X. In few-shot setting, we show that with as little as 10 examples per label, we achieved more than 90\% (i.e. 86.0 F1 points) of the performance of full supervised training (92.6 F1 points) leveraging the PET approach.

Publication DOI: https://doi.org/10.48550/arXiv.2304.09972
Divisions: College of Business and Social Sciences > Aston Business School > Operations & Information Management
Additional Information: This arXiv preprint version of this paper is licensed under a Creative Commons Attribution 4.0 International License (https://creativecommons.org/licenses/by/4.0/).
Uncontrolled Keywords: cs.CL
Last Modified: 05 Aug 2024 09:07
Date Deposited: 05 May 2023 09:28
Full Text Link:
Related URLs: https://aclanth ... ijcnlp-main.10/ (Publisher URL)
https://github. ... /masakhane-news (Related URL)
https://hugging ... ane/masakhanews (Related URL)
PURE Output Type: Conference contribution
Published Date: 2023-11
Accepted Date: 2023-09-04
Submitted Date: 2023-04-19
Authors: Adelani, David Ifeoluwa
Masiak, Marek
Azime, Israel Abebe
Alabi, Jesujoba Oluwadara
Tonja, Atnafu Lambebo
Mwase, Christine
Ogundepo, Odunayo
Dossou, Bonaventure F. P.
Oladipo, Akintunde
Nixdorf, Doreen
Emezue, Chris Chinenye
al-azzawi, Sana Sabah
Sibanda, Blessing K.
David, Davis
Ndolela, Lolwethu
Mukiibi, Jonathan
Ajayi, Tunde Oluwaseyi
Ngoli, Tatiana Moteu
Odhiambo, Brian
Owodunni, Abraham Toluwase
Obiefuna, Nnaemeka C.
Mohamed, Muhidin
Muhammad, Shamsuddeen Hassan
Ababu, Teshome Mulugeta
Abdullahi, Saheed Salahudeen
Yigezu, Mesay Gemeda
Gwadabe, Tajuddeen
Abdulmumin, Idris
Bame, Mahlet Taye
Awoyomi, Oluwabusayo Olufunke
Shode, Iyanuoluwa
Adelani, Tolulope Anu
Kailani, Habiba Abdulganiy
Omotayo, Abdul-Hakeem
Adeeko, Adetola
Abeeb, Afolabi
Aremu, Anuoluwapo
Samuel, Olanrewaju
Siro, Clemencia
Kimotho, Wangari
Ogbu, Onyekachi Raphael
Mbonu, Chinedu E.
Chukwuneke, Chiamaka I.
Fanijo, Samuel
Ojo, Jessica
Awosan, Oyinkansola F.
Guge, Tadesse Kebede
Sari, Sakayo Toadoum
Nyatsine, Pamela
Sidume, Freedmore
Yousuf, Oreen
Oduwole, Mardiyyah
Tshinu, Kanda Patrick
Kimanuka, Ussen
Diko, Thina
Nxakama, Siyanda
Nugussie, Sinodos G.
Johar, Abdulmejid Tuni
Mohamed, Shafie Abdi
Hassan, Fuad Mire
Mehamed, Moges Ahmed
Ngabire, Evrard
Twagirayezu, Jules
Ssenkungu, Ivan
Stenetorp, Pontus

Download

[img]

Version: Draft Version

License: Creative Commons Attribution

| Preview

Export / Share Citation


Statistics

Additional statistics for this record