The Goal
Swahili is spoken by over 100 million people across East and Central Africa. Yet large language models like GPT-4 perform significantly worse on Swahili compared to English.
Our goal: create a Swahili LLM that approaches English-level performance for common tasks.
Baseline Performance
We evaluated GPT-4 and Claude on Swahili tasks:
The gap is significant, especially for complex tasks.
Our Approach
Data Collection
We assembled the largest Swahili text corpus to date:
Sources:
News articles (5M articles from 20+ outlets)Wikipedia (75K articles)Government documents (500K pages)Books and literature (1,000 titles)Social media (10M posts, anonymized)Human-generated instruction data (50K examples)Quality control:
Native speaker reviewAutomated filtering for qualityDeduplicationDialect annotation (Standard, Tanzanian, Kenyan, etc.)Total: 10 billion tokens
Fine-Tuning Strategy
We used a multi-stage approach:
Stage 1: Continued Pre-training
Started with Llama 2 70BContinued pre-training on Swahili corpus1 trillion tokens processedCost: $50K in computeStage 2: Instruction Tuning
Human-written Swahili instructionsTranslated high-quality English datasetsEmphasis on culturally relevant tasksStage 3: RLHF
Native speaker preferencesFocused on natural, fluent SwahiliRewarded cultural appropriatenessTechnical Details
Infrastructure:
8x A100 80GB GPUsDeepSpeed ZeRO Stage 3Mixed precision training (BF16)Gradient checkpointingHyperparameters:
Learning rate: 1e-5 (continued pretraining), 2e-6 (fine-tuning)Batch size: 256Sequence length: 4096Training time: 2 weeks per stageResults
Our fine-tuned model (SwahiliLLM) shows significant improvements:
Human evaluators rated SwahiliLLM outputs as more natural and culturally appropriate in 73% of comparisons.
Challenges and Solutions
Dialectal Variation
Swahili varies across regions. Solutions:
Annotated dialect informationTraining data from multiple regionsPrompt-based dialect controlCode-Switching
East Africans frequently mix Swahili with English. Solutions:
Include code-switched data in trainingModel handles mixed-language inputs naturallyLimited High-Quality Data
Some domains lack Swahili content. Solutions:
Translation of English datasets with native speaker reviewSynthetic data generation with quality filteringActive learning to prioritize annotationCultural Nuance
LLMs trained on English encode Western cultural assumptions. Solutions:
Culturally appropriate instruction dataNative speaker feedback during RLHFRed teaming for cultural failuresOpen Release
We're releasing:
SwahiliLLM 7B and 13B model weightsSwahili instruction dataset (50K examples)Evaluation benchmarksFine-tuning code and recipesAvailable at: github.com/veke/swahili-llm
What's Next
We're working on:
Expanding to other African languagesMultimodal capabilities (voice, images)Domain-specific models (medical, legal, financial)Smaller models for on-device deploymentThis is just the beginning. With focused effort, we can build LLMs that truly work for African language speakers.