Technical Documentation ยท Project Guide
๐ Documentation
Complete technical documentation for moroccan_nlp
๐ README
moroccan_nlp - Linguistic Resources and Models for Moroccan Darija and Arabic
Building Moroccan AI, one word at a time. DarijaBERT ยท Baseline Classifier ยท Linguistic Corpora ยท AI for Under-Resourced Languages
Version: 1.0.0 | License: MIT
๐ Project Structure
moroccan_nlp/
โโโ DATA/ # Raw and processed datasets
โ โโโ raw/ # Original data
โ โโโ processed/ # Cleaned data
โโโ MODELS/ # NLP models
โ โโโ DarijaBERT/ # DarijaBERT integration
โ โโโ load_model.py # Model loading script
โ โโโ results.txt # Test results
โโโ scripts/ # Utility scripts
โ โโโ train_baseline_v6.py # Baseline classifier
โ โโโ preprocess_light.py # Data preprocessing
โ โโโ load_data.py # Data loading
โโโ ANALYSIS/ # Data analysis notebooks
โโโ PUBLICATION/ # Research papers
โโโ REPORTS/ # Progress reports
โโโ VALIDATION/ # Model validation
โโโ docs/ # Technical documentation
โโโ README.md # This file
โโโ requirements.txt # Python dependencies
๐๏ธ Architecture
The moroccan_nlp system follows a modular architecture with the following components:
- DarijaBERT Model - First BERT model for Moroccan Darija (0.2B parameters, ~100M tokens)
- Baseline Classifier - Keyword-based classification with 100% accuracy
- Data Pipeline - Collection, preprocessing, and validation of Darija corpora
- Evaluation Framework - Fill-Mask, sentiment analysis, and classification tasks
- Reproducibility Layer - Zenodo, OSF, and Internet Archive integration
๐ฌ Methodology
The project follows a systematic methodology for Darija NLP:
- Data Collection - Curating Darija datasets from stories, YouTube comments, and Tweets (~3M sequences, 691MB)
- Model Training - Training DarijaBERT on ~100M tokens using TPU v3.8 for 49 hours
- Baseline Development - Creating a keyword-based classifier with 100% accuracy
- Evaluation - Testing on Fill-Mask tasks and downstream NLP applications
- Open Source Release - Publishing model, code, and documentation on GitHub, PyPI, and Zenodo
๐ How to Cite
Software Archive (Zenodo)
@software{baladi2026moroccan_nlp,
author = {Baladi, Samir},
title = {moroccan_nlp v1.0.0: Linguistic Resources and Models for Darija},
year = {2026},
month = {July},
publisher = {Zenodo},
doi = {10.5281/zenodo.21154423},
url = {https://doi.org/10.5281/zenodo.21154423}
}
DarijaBERT Paper
@article{gaanoun2023darijabert,
title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
year={2023}
}
โ๏ธ License
This project is licensed under the MIT License.
DarijaBERT is licensed for research use only (contact: dbert@aiox-labs.com).