On stopwords, filtering and data sparsity for sentiment analysis of Twitter

Abstract

Sentiment classification over Twitter is usually affected by the noisy nature (abbreviations, irregular forms) of tweets data. A popular procedure to reduce the noise of textual data is to remove stopwords by using pre-compiled stopword lists or more sophisticated methods for dynamic stopword identification. However, the effectiveness of removing stopwords in the context of Twitter sentiment classification has been debated in the last few years. In this paper we investigate whether removing stopwords helps or hampers the effectiveness of Twitter sentiment classification methods. To this end, we apply six different stopword identification methods to Twitter data from six different datasets and observe how removing stopwords affects two well-known supervised sentiment classification methods. We assess the impact of removing stopwords by observing fluctuations on the level of data sparsity, the size of the classifier's feature space and its classification performance. Our results show that using pre-compiled lists of stopwords negatively impacts the performance of Twitter sentiment classification approaches. On the other hand, the dynamic generation of stopword lists, by removing those infrequent terms appearing only once in the corpus, appears to be the optimal method to maintaining a high classification performance while reducing the data sparsity and substantially shrinking the feature space

Divisions: College of Engineering & Physical Sciences > Systems analytics research institute (SARI)
?? 50811700Jl ??
Additional Information: The LREC 2014 Proceedings are licensed under a Creative Commons Attribution-NonCommercial 4.0 International License
Event Title: 9th International Conference on Language Resources and Evaluation
Event Type: Other
Event Location: Iceland
Event Dates: 2014-05-26 - 2014-05-31
Uncontrolled Keywords: sentiment analysis,stopwords,data sparsity
ISBN: 978-2-9517408-8-4
Last Modified: 19 Nov 2024 08:42
Date Deposited: 02 Jun 2016 09:05
Full Text Link: http://www.lrec ... f/292_Paper.pdf
Related URLs:
PURE Output Type: Conference contribution
Published Date: 2014
Authors: Saif, Hassan
Fernández, Miriam
He, Yulan (ORCID Profile 0000-0003-3948-5845)
Alani, Harith

Download

Export / Share Citation


Statistics

Additional statistics for this record