File size: 8,959 Bytes
3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 3052a65 656311d 5ee9036 656311d |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 |
---
base_model: facebook/w2v-bert-2.0
library_name: transformers
license: mit
metrics:
- accuracy
- f1
- precision
- recall
tags:
- generated_from_trainer
- arabic
- quran
- speech-segmentation
model-index:
- name: recitation-segmenter-v2
results: []
pipeline_tag: automatic-speech-recognition
language: ar
---
# recitation-segmenter-v2: Quranic Recitation Segmenter
This model is a fine-tuned version of [facebook/w2v-bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0) for segmenting Holy Quran recitations based on pause points (waqf). It was presented in the paper [Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning](https://huggingface.co/papers/2509.00094).
Project Page: https://obadx.github.io/prepare-quran-dataset/
GitHub Repository: https://github.com/obadx/recitations-segmenter
It achieves the following results on the evaluation set:
- Accuracy: 0.9958
- F1: 0.9964
- Loss: 0.0132
- Precision: 0.9976
- Recall: 0.9951
## Model description
The `recitation-segmenter-v2` model is an enhanced AI model capable of segmenting Holy Quran recitations based on pause points (`waqf`) with high accuracy. It is built upon a fine-tuned [Wav2Vec2Bert](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert) model, performing Sequence Frame Level Classification with a 20-millisecond resolution. This model and its accompanying Python library are designed for high-performance processing of any number and length of Quranic recitations, from a few seconds to several hours, without performance degradation.
Key Features:
* Segments Quranic recitations according to `waqf` (pause rules).
* Specifically trained for Quranic recitations.
* High accuracy, up to 20 milliseconds precision.
* Requires only ~3 GB of GPU memory.
* Capable of processing recitations of any duration without performance loss.
The model is part of a larger effort described in the associated paper, aiming to bridge gaps in assessing spoken language for the Holy Quran. This includes an automated pipeline to produce high-quality Quranic datasets and a novel ASR-based approach for pronunciation error detection using a custom Quran Phonetic Script (QPS).
## Intended uses & limitations
This model is primarily intended for:
* Automatic segmentation of Holy Quran recitations for educational purposes or content analysis.
* Building high-quality Quranic audio databases.
* As a foundational component for larger systems focused on pronunciation error detection and correction for Quran learners.
**Limitations**:
* The segmenter currently considers `sakt` (a very short pause without breath) as a full `waqf` (stop), which might be a nuance for advanced Tajweed analysis.
* The model is specifically trained and optimized for Quranic recitations and might not generalize well to other forms of spoken Arabic.
## Training and evaluation data
The model was fine-tuned on a meticulously collected dataset of Quranic recitations. The data collection process, described in the associated paper, involved a 98% automated pipeline including collection from expert reciters, segmentation at pause points (`waqf`) using a fine-tuned `wav2vec2-BERT` model, transcription of segments, and transcript verification via a novel Tasmeea algorithm. The dataset comprises over 850 hours of audio (~300K annotated utterances).
The data preparation involved:
1. Downloading Quranic recitations and converting them to Hugging Face Audio Dataset format at 16000 Hz sample rate.
2. Pre-segmenting verses based on pauses using `sliero-vad-v4` from [everyayah.com](https://everyayah.com).
3. Applying post-processing (e.g., `min_silence_duration_ms`, `min_speech_duration_ms`, `pad_duration_ms`) to refine segments and manual verification for high-quality divisions.
4. Applying data augmentation techniques, including time stretching (speeding up/slowing down 40% of recitations) and various audio effects (Aliasing, AddGaussianNoise, BandPassFilter, PitchShift, RoomSimulator, etc.) using the `audiomentations` library.
5. Normalizing audio segments to 16000 Hz and chunking them, with a maximum length of 20 seconds, using a sliding window approach for longer segments.
The training dataset and its augmented version are available on Hugging Face:
* [Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation)
* [Augmented Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation-augmented)
## Usage
You can use this model with its accompanying Python library, `recitations-segmenter`, which integrates with Hugging Face `transformers`.
First, ensure `ffmpeg` and `libsoundfile` are installed system-wide.
### Requirements
Install `ffmpeg` and `libsoundfile` system-wide.
#### Linux
```bash
sudo apt-get update
sudo apt-get install -y ffmpeg libsndfile1 portaudio19-dev
```
#### Windows & Mac
You can create an `anaconda` environment and then install these libraries:
```bash
conda create -n segment python=3.12
conda activate segment
conda install -c conda-forge ffmpeg libsndfile
```
### Via pip
```bash
pip install recitations-segmenter
```
### Sample usage (Python API)
Here's a complete example for using the library in Python. A Google Colab example is also available: [Open in Colab](https://colab.research.google.com/drive/1-RuRQOj4l2MA_SG2p4m-afR7MAsT5I22?usp=sharing)
```python
from pathlib import Path
from recitations_segmenter import segment_recitations, read_audio, clean_speech_intervals
from transformers import AutoFeatureExtractor, AutoModelForAudioFrameClassification
import torch
if __name__ == '__main__':
device = torch.device('cuda')
dtype = torch.bfloat16
processor = AutoFeatureExtractor.from_pretrained(
"obadx/recitation-segmenter-v2")
model = AutoModelForAudioFrameClassification.from_pretrained(
"obadx/recitation-segmenter-v2",
)
model.to(device, dtype=dtype)
# Change this to the file pathes of Holy Quran recitations
# File pathes with the Holy Quran Recitations
file_pathes = [
'./assets/dussary_002282.mp3',
'./assets/hussary_053001.mp3',
]
waves = [read_audio(p) for p in file_pathes]
# Extracting speech inervals in samples according to 16000 Sample rate
sampled_outputs = segment_recitations(
waves,
model,
processor,
device=device,
dtype=dtype,
batch_size=8,
)
for out, path in zip(sampled_outputs, file_pathes):
# Clean The speech intervals by:
# * merging small silence durations
# * remove small speech durations
# * add padding to each speech duration
# Raises:
# * NoSpeechIntervals: if the wav is complete silence
# * TooHighMinSpeechDruation: if `min_speech_duration` is too high which
# resuls for deleting all speech intervals
clean_out = clean_speech_intervals(
out.speech_intervals,
out.is_complete,
min_silence_duration_ms=30,
min_speech_duration_ms=30,
pad_duration_ms=30,
return_seconds=True,
)
print(f'Speech Intervals of: {Path(path).name}: ')
print(clean_out.clean_speech_intervals)
print(f'Is Recitation Complete: {clean_out.is_complete}')
print('-' * 40)
```
## Training procedure
The model was trained on `Wav2Vec2BertForAudioFrameClassification` using the `transformers` library. More detailed motivations, methodology, and setup can be found in the GitHub repository's "تفاصيل التدريب" section.
### Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 50
- eval_batch_size: 64
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 1
### Training results
| Training Loss | Epoch | Step | Accuracy | F1 | Validation Loss | Precision | Recall |
|:-------------:|:------:|:----:|:--------:|:------:|:---------------:|:---------:|:------:|
| 0.0701 | 0.2507 | 275 | 0.9953 | 0.9959 | 0.0249 | 0.9947 | 0.9971 |
| 0.0234 | 0.5014 | 550 | 0.9953 | 0.9959 | 0.0185 | 0.9940 | 0.9977 |
| 0.0186 | 0.7521 | 825 | 0.9958 | 0.9964 | 0.0132 | 0.9976 | 0.9951 |
### Framework versions
- Transformers 4.51.3
- Pytorch 2.2.1+cu121
- Datasets 3.5.0
- Tokenizers 0.21.1
## Citation
If you find our work helpful or inspiring, please feel free to cite it.
```bibtex
@article{ibrahim2025automatic,
title={Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning},
author={Abdullah Abdelfattah, Mahmoud I.Khalil, Hazem M.Abbas},
journal={arXiv preprint arXiv:2509.00094},
year={2025}
}
``` |