Topic Detection on COVID-19 Tweets: A Comparative Study on Clustering and Transfer Learning Models
Research/Original Article (دارای رتبه معتبر)
Automatic topic detection seems unavoidable in social media analysis due to big text data which their users generate. Clustering-based methods are one of the most important and up-to-date categories in topic detection. The goal of this research is to have a wide study on this category. Therefore, this paper aims to study the main components of clustering-based-topic-detection, which are embedding methods, distance metrics, and clustering algorithms. Transfer learning and consequently pretrained language models and word embeddings have been considered in recent years. Regarding the importance of embedding methods, the efficiency of five new embedding methods, from earlier to recent ones, are compared in this paper. To conduct our study, two commonly used distance metrics, in addition to five important clustering algorithms in the field of topic detection, are implemented by the authors. As COVID-19 has turned into a hot trending topic on social networks in recent years, a dataset including one-month tweets collected with COVID-19-related hashtags is used for this study. More than 7500 experiments are performed to determine tunable parameters. Then all combinations of embedding methods, distance metrics and clustering algorithms (50 combinations) are evaluated using Silhouette metric. Results show that T5 strongly outperforms other embedding methods, cosine distance is weakly better than other distance metrics, and DBSCAN is superior to other clustering algorithms.