Technical Documentation ยท Project Guide

๐Ÿ“š Documentation

Complete technical documentation for moroccan_nlp

๐Ÿ“˜ README

moroccan_nlp - Linguistic Resources and Models for Moroccan Darija and Arabic

Building Moroccan AI, one word at a time. DarijaBERT ยท Baseline Classifier ยท Linguistic Corpora ยท AI for Under-Resourced Languages

Version: 1.0.0 | License: MIT

DOI: 10.5281/zenodo.21154423

๐Ÿ“ Project Structure

moroccan_nlp/
โ”œโ”€โ”€ DATA/                     # Raw and processed datasets
โ”‚   โ”œโ”€โ”€ raw/                  # Original data
โ”‚   โ””โ”€โ”€ processed/            # Cleaned data
โ”œโ”€โ”€ MODELS/                   # NLP models
โ”‚   โ””โ”€โ”€ DarijaBERT/           # DarijaBERT integration
โ”‚       โ”œโ”€โ”€ load_model.py     # Model loading script
โ”‚       โ””โ”€โ”€ results.txt       # Test results
โ”œโ”€โ”€ scripts/                  # Utility scripts
โ”‚   โ”œโ”€โ”€ train_baseline_v6.py  # Baseline classifier
โ”‚   โ”œโ”€โ”€ preprocess_light.py   # Data preprocessing
โ”‚   โ””โ”€โ”€ load_data.py          # Data loading
โ”œโ”€โ”€ ANALYSIS/                 # Data analysis notebooks
โ”œโ”€โ”€ PUBLICATION/              # Research papers
โ”œโ”€โ”€ REPORTS/                  # Progress reports
โ”œโ”€โ”€ VALIDATION/               # Model validation
โ”œโ”€โ”€ docs/                     # Technical documentation
โ”œโ”€โ”€ README.md                 # This file
โ””โ”€โ”€ requirements.txt          # Python dependencies
        

๐Ÿ—๏ธ Architecture

The moroccan_nlp system follows a modular architecture with the following components:

  • DarijaBERT Model - First BERT model for Moroccan Darija (0.2B parameters, ~100M tokens)
  • Baseline Classifier - Keyword-based classification with 100% accuracy
  • Data Pipeline - Collection, preprocessing, and validation of Darija corpora
  • Evaluation Framework - Fill-Mask, sentiment analysis, and classification tasks
  • Reproducibility Layer - Zenodo, OSF, and Internet Archive integration

๐Ÿ”ฌ Methodology

The project follows a systematic methodology for Darija NLP:

  1. Data Collection - Curating Darija datasets from stories, YouTube comments, and Tweets (~3M sequences, 691MB)
  2. Model Training - Training DarijaBERT on ~100M tokens using TPU v3.8 for 49 hours
  3. Baseline Development - Creating a keyword-based classifier with 100% accuracy
  4. Evaluation - Testing on Fill-Mask tasks and downstream NLP applications
  5. Open Source Release - Publishing model, code, and documentation on GitHub, PyPI, and Zenodo

๐Ÿ“– How to Cite

Software Archive (Zenodo)

@software{baladi2026moroccan_nlp,
  author       = {Baladi, Samir},
  title        = {moroccan_nlp v1.0.0: Linguistic Resources and Models for Darija},
  year         = {2026},
  month        = {July},
  publisher    = {Zenodo},
  doi          = {10.5281/zenodo.21154423},
  url          = {https://doi.org/10.5281/zenodo.21154423}
}

DarijaBERT Paper

@article{gaanoun2023darijabert,
  title={Darijabert: a Step Forward in Nlp for the Written Moroccan Dialect},
  author={Gaanoun, Kamel and Naira, Abdou Mohamed and Allak, Anass and Benelallam, Imade},
  year={2023}
}

DOI: 10.5281/zenodo.21154423

โš–๏ธ License

This project is licensed under the MIT License.

DarijaBERT is licensed for research use only (contact: dbert@aiox-labs.com).