A Large-Scale English Multi-Label Twitter Dataset for Cyberbullying and Online Abuse Detection

Abstract

In this paper, we introduce a new English Twitter-based dataset for online abuse and cyberbullying detection. Comprising 62,587 tweets, this dataset was sourced from Twitter using specific query terms designed to retrieve tweets with high probabilities of various forms of bullying and offensive content, including insult, profanity, sarcasm, threat, porn and exclusion. Analysis performed on the dataset confirmed common cyberbullying themes reported by other studies and revealed interesting relationships between the classes. The dataset was used to train a number of transformer-based deep learning models returning impressive results.

Publication DOI: https://doi.org/10.18653/v1/2021.woah-1.16
Divisions: College of Engineering & Physical Sciences > Aston STEM Education Centre
College of Engineering & Physical Sciences > School of Informatics and Digital Engineering > Computer Science
College of Engineering & Physical Sciences > Aston Institute of Urban Technology and the Environment (ASTUTE)
College of Engineering & Physical Sciences > Systems analytics research institute (SARI)
College of Engineering & Physical Sciences
Additional Information: © 2021 The Association for Computational Linguistics. Licensed under the Creative Commons Attribution license https://creativecommons.org/licenses/by/4.0/
Event Title: The 5th Workshop on Online Abuse and Harms
Event Type: Other
Event Dates: 2021-08-06 - 2021-08-06
ISBN: 9781954085596
Full Text Link:
Related URLs: https://aclanth ... 1.woah-1.16.pdf (Publisher URL)
https://bitbuck ... llying-twitter/ (Related URL)
PURE Output Type: Conference contribution
Published Date: 2021-08
Authors: Salawu, Semiu
Lumsden, Jo (ORCID Profile 0000-0002-8637-7647)
He, Yulan (ORCID Profile 0000-0003-3948-5845)

Download

[img]

Version: Published Version

License: Creative Commons Attribution

| Preview

Export / Share Citation


Statistics

Additional statistics for this record