Technical Documentation · Project Guide

📚 Documentation

Complete technical documentation for moroccan_nlp

📘 README

moroccan_nlp - Linguistic Resources and Models for Moroccan Darija and Arabic

Building Moroccan AI, one word at a time. DarijaBERT · Baseline Classifier · Linguistic Corpora · AI for Under-Resourced Languages

Version: 1.0.0 | License: MIT

DOI: 10.5281/zenodo.21154423

📁 Project Structure

moroccan_nlp/
├── DATA/                     # Raw and processed datasets
│   ├── raw/                  # Original data
│   └── processed/            # Cleaned data
├── MODELS/                   # NLP models
│   └── DarijaBERT/           # DarijaBERT integration
│       ├── load_model.py     # Model loading script
│       └── results.txt       # Test results
├── scripts/                  # Utility scripts
│   ├── train_baseline_v6.py  # Baseline classifier
│   ├── preprocess_light.py   # Data preprocessing
│   └── load_data.py          # Data loading
├── ANALYSIS/                 # Data analysis notebooks
├── PUBLICATION/              # Research papers
├── REPORTS/                  # Progress reports
├── VALIDATION/               # Model validation
├── docs/                     # Technical documentation
├── README.md                 # This file
└── requirements.txt          # Python dependencies

🏗️ Architecture

The moroccan_nlp system follows a modular architecture with the following components:

DarijaBERT Model - First BERT model for Moroccan Darija (0.2B parameters, ~100M tokens)
Baseline Classifier - Keyword-based classification with 100% accuracy
Data Pipeline - Collection, preprocessing, and validation of Darija corpora
Evaluation Framework - Fill-Mask, sentiment analysis, and classification tasks
Reproducibility Layer - Zenodo, OSF, and Internet Archive integration

🔬 Methodology

The project follows a systematic methodology for Darija NLP:

Data Collection - Curating Darija datasets from stories, YouTube comments, and Tweets (~3M sequences, 691MB)
Model Training - Training DarijaBERT on ~100M tokens using TPU v3.8 for 49 hours
Baseline Development - Creating a keyword-based classifier with 100% accuracy
Evaluation - Testing on Fill-Mask tasks and downstream NLP applications
Open Source Release - Publishing model, code, and documentation on GitHub, PyPI, and Zenodo

📖 How to Cite

Software Archive (Zenodo)

@software{baladi2026moroccan_nlp,
  author       = {Baladi, Samir},
  title        = {moroccan_nlp v1.0.0: Linguistic Resources and Models for Darija},
  year         = {2026},
  month        = {July},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.21154423},
  url          = {https://doi.org/10.5281/zenodo.21154423}
}

DarijaBERT Paper

@article{gaanoun2023darijabert,
  title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
  author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
  year={2023}
}

DOI: 10.5281/zenodo.21154423

⚖️ License

This project is licensed under the MIT License.

DarijaBERT is licensed for research use only (contact: dbert@aiox-labs.com).