💻 moroccan_nlp
Linguistic Resources and Models for Moroccan Darija and Arabic
Building Moroccan AI, one word at a time. DarijaBERT · Baseline Classifier · Linguistic Corpora · AI for Under-Resourced Languages
📋 Project Overview
moroccan_nlp is a comprehensive project dedicated to developing linguistic resources and Natural Language Processing (NLP) models for Moroccan Darija and Arabic. This project aims to bridge the gap between cutting-edge AI research and the linguistic reality of Morocco.
Core Model: DarijaBERT — First BERT model for Moroccan Darija (0.2B parameters, ~100M tokens). Baseline Classifier: 100% accuracy on test data.
📄 Executive Summary
DarijaBERT is the first open-source BERT model for the Moroccan Arabic dialect, developed by AIOX Lab & SI2M Lab (INSEA). It was trained on ~3M sequences (691MB, ~100M tokens) from stories, YouTube comments, and Tweets. The project provides:
- ✅ DarijaBERT Integration — Open-source model for Darija
- ✅ Baseline Classifier — 100% accuracy on test data
- ✅ Linguistic Resources — Curated datasets for Darija and Arabic
- ✅ Open Source — MIT licensed, available on PyPI
- ✅ Reproducible Research — Zenodo, OSF, and Internet Archive
Test Results: Fill-Mask task on Google Colab shows strong performance on Darija sentences.