moroccan_nlp · Natural Language Processing · Darija · Arabic

💻 moroccan_nlp

Linguistic Resources and Models for Moroccan Darija and Arabic

Building Moroccan AI, one word at a time. DarijaBERT · Baseline Classifier · Linguistic Corpora · AI for Under-Resourced Languages

100%

Baseline Classifier Accuracy

0.2B

DarijaBERT Parameters

~100M

Training Tokens

📋 Project Overview

moroccan_nlp is a comprehensive project dedicated to developing linguistic resources and Natural Language Processing (NLP) models for Moroccan Darija and Arabic. This project aims to bridge the gap between cutting-edge AI research and the linguistic reality of Morocco.

Core Model: DarijaBERT — First BERT model for Moroccan Darija (0.2B parameters, ~100M tokens). Baseline Classifier: 100% accuracy on test data.

📄 Executive Summary

DarijaBERT is the first open-source BERT model for the Moroccan Arabic dialect, developed by AIOX Lab & SI2M Lab (INSEA). It was trained on ~3M sequences (691MB, ~100M tokens) from stories, YouTube comments, and Tweets. The project provides:

✅ DarijaBERT Integration — Open-source model for Darija
✅ Baseline Classifier — 100% accuracy on test data
✅ Linguistic Resources — Curated datasets for Darija and Arabic
✅ Open Source — MIT licensed, available on PyPI
✅ Reproducible Research — Zenodo, OSF, and Internet Archive

Test Results: Fill-Mask task on Google Colab shows strong performance on Darija sentences.

📈 Dashboard

View live metrics and performance indicators

Go to Dashboard →

📄 Reports

Access research papers and analysis reports

Go to Reports →

📚 Documentation

Read technical documentation and methodology

Go to Documentation →

🔗 Resources

Access datasets, citations, and external resources

View Resources →