Features, Architecture, Combine Them Together

[12 June 2022] Word embeddings are powerful even when not accompanied by a neural network model. In this course project, we profiled each Twitter user by word distribution or embeddings, and then did the classfication task by machine learning techniques.

Background

This work is the course project of the Lanuage Processing 2 (LP2) at the University of Copenhagen. The project topic is a shared task at CLEF 2022 about classifying authors as ironic or not depending on their number of tweets with ironic content. The code and report for this paper are available and its abstract is shown below:

This research uses the CLEF 2022 dataset to profile ironic and stereotype spreaders on Twitter, representing their tweets at three levels: naive features, sparse vectors, and embeddings. We employ three classic machine learning classifiers (logistic regression, support vector machine, and random forest), as well as a new architecture dubbed the voting model, to combine the strength of each feature. The findings highlight the merit of embeddings, which include subword-level information in particular. Furthermore, depending on the classifier’s mechanism, the quality of the features will affect the classifiers to varying degrees.