The Somali Twitter Hate Speech Detection Dataset is a large-scale, AI-ready text corpus collected from Twitter (X) using automated data extraction tools and official APIs. It is designed to support advanced research in hate speech detection, sentiment analysis, natural language processing (NLP), and social impact analytics in the Somali language.
The dataset was processed through a structured AI preprocessing and feature engineering pipeline, enabling its direct use in machine learning, deep learning, and computational social science applications. A total of 7,219 Somali-language tweets were collected using scrape and API-based retrieval methods.
Dataset Structure
The dataset is organized in a machine-learning–friendly format and optimized for both NLP modelling and social media analytics, enabling:
Predictive and behavioral analytics
Text-based AI model training and evaluation
Trend and discourse analysis
Engagement and influence modeling
Hate speech and toxicity classification