File size: 8,959 Bytes
3052a65
656311d
3052a65
 
 
 
 
 
 
656311d
 
 
 
 
3052a65
 
 
656311d
 
3052a65
 
656311d
 
 
3052a65
656311d
 
3052a65
 
 
 
 
 
 
 
 
 
656311d
 
 
 
 
 
 
 
 
 
3052a65
 
 
656311d
 
 
 
 
 
 
 
3052a65
 
 
656311d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3052a65
 
 
656311d
 
3052a65
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
656311d
 
 
 
 
 
 
5ee9036
656311d
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
base_model: facebook/w2v-bert-2.0
library_name: transformers
license: mit
metrics:
- accuracy
- f1
- precision
- recall
tags:
- generated_from_trainer
- arabic
- quran
- speech-segmentation
model-index:
- name: recitation-segmenter-v2
  results: []
pipeline_tag: automatic-speech-recognition
language: ar
---

# recitation-segmenter-v2: Quranic Recitation Segmenter

This model is a fine-tuned version of [facebook/w2v-bert-2.0](https://huggingface.co/facebook/w2v-bert-2.0) for segmenting Holy Quran recitations based on pause points (waqf). It was presented in the paper [Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning](https://huggingface.co/papers/2509.00094).

Project Page: https://obadx.github.io/prepare-quran-dataset/
GitHub Repository: https://github.com/obadx/recitations-segmenter

It achieves the following results on the evaluation set:
- Accuracy: 0.9958
- F1: 0.9964
- Loss: 0.0132
- Precision: 0.9976
- Recall: 0.9951

## Model description

The `recitation-segmenter-v2` model is an enhanced AI model capable of segmenting Holy Quran recitations based on pause points (`waqf`) with high accuracy. It is built upon a fine-tuned [Wav2Vec2Bert](https://huggingface.co/docs/transformers/model_doc/wav2vec2-bert) model, performing Sequence Frame Level Classification with a 20-millisecond resolution. This model and its accompanying Python library are designed for high-performance processing of any number and length of Quranic recitations, from a few seconds to several hours, without performance degradation.

Key Features:
*   Segments Quranic recitations according to `waqf` (pause rules).
*   Specifically trained for Quranic recitations.
*   High accuracy, up to 20 milliseconds precision.
*   Requires only ~3 GB of GPU memory.
*   Capable of processing recitations of any duration without performance loss.

The model is part of a larger effort described in the associated paper, aiming to bridge gaps in assessing spoken language for the Holy Quran. This includes an automated pipeline to produce high-quality Quranic datasets and a novel ASR-based approach for pronunciation error detection using a custom Quran Phonetic Script (QPS).

## Intended uses & limitations

This model is primarily intended for:
*   Automatic segmentation of Holy Quran recitations for educational purposes or content analysis.
*   Building high-quality Quranic audio databases.
*   As a foundational component for larger systems focused on pronunciation error detection and correction for Quran learners.

**Limitations**:
*   The segmenter currently considers `sakt` (a very short pause without breath) as a full `waqf` (stop), which might be a nuance for advanced Tajweed analysis.
*   The model is specifically trained and optimized for Quranic recitations and might not generalize well to other forms of spoken Arabic.

## Training and evaluation data

The model was fine-tuned on a meticulously collected dataset of Quranic recitations. The data collection process, described in the associated paper, involved a 98% automated pipeline including collection from expert reciters, segmentation at pause points (`waqf`) using a fine-tuned `wav2vec2-BERT` model, transcription of segments, and transcript verification via a novel Tasmeea algorithm. The dataset comprises over 850 hours of audio (~300K annotated utterances).

The data preparation involved:
1.  Downloading Quranic recitations and converting them to Hugging Face Audio Dataset format at 16000 Hz sample rate.
2.  Pre-segmenting verses based on pauses using `sliero-vad-v4` from [everyayah.com](https://everyayah.com).
3.  Applying post-processing (e.g., `min_silence_duration_ms`, `min_speech_duration_ms`, `pad_duration_ms`) to refine segments and manual verification for high-quality divisions.
4.  Applying data augmentation techniques, including time stretching (speeding up/slowing down 40% of recitations) and various audio effects (Aliasing, AddGaussianNoise, BandPassFilter, PitchShift, RoomSimulator, etc.) using the `audiomentations` library.
5.  Normalizing audio segments to 16000 Hz and chunking them, with a maximum length of 20 seconds, using a sliding window approach for longer segments.

The training dataset and its augmented version are available on Hugging Face:
*   [Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation)
*   [Augmented Training Data](https://huggingface.co/datasets/obadx/recitation-segmentation-augmented)

## Usage

You can use this model with its accompanying Python library, `recitations-segmenter`, which integrates with Hugging Face `transformers`.

First, ensure `ffmpeg` and `libsoundfile` are installed system-wide.

### Requirements

Install `ffmpeg` and `libsoundfile` system-wide.

#### Linux

```bash
sudo apt-get update
sudo apt-get install -y ffmpeg libsndfile1 portaudio19-dev
```

#### Windows & Mac

You can create an `anaconda` environment and then install these libraries:

```bash
conda create -n segment python=3.12
conda activate segment
conda install -c conda-forge ffmpeg libsndfile
```

### Via pip

```bash
pip install recitations-segmenter
```

### Sample usage (Python API)

Here's a complete example for using the library in Python. A Google Colab example is also available: [Open in Colab](https://colab.research.google.com/drive/1-RuRQOj4l2MA_SG2p4m-afR7MAsT5I22?usp=sharing)

```python
from pathlib import Path

from recitations_segmenter import segment_recitations, read_audio, clean_speech_intervals
from transformers import AutoFeatureExtractor, AutoModelForAudioFrameClassification
import torch

if __name__ == '__main__':
    device = torch.device('cuda')
    dtype = torch.bfloat16

    processor = AutoFeatureExtractor.from_pretrained(
        "obadx/recitation-segmenter-v2")
    model = AutoModelForAudioFrameClassification.from_pretrained(
        "obadx/recitation-segmenter-v2",
    )

    model.to(device, dtype=dtype)

    # Change this to the file pathes of Holy Quran recitations
    # File pathes with the Holy Quran Recitations
    file_pathes = [
        './assets/dussary_002282.mp3',
        './assets/hussary_053001.mp3',
    ]
    waves = [read_audio(p) for p in file_pathes]

    # Extracting speech inervals in samples according to 16000 Sample rate
    sampled_outputs = segment_recitations(
        waves,
        model,
        processor,
        device=device,
        dtype=dtype,
        batch_size=8,
    )

    for out, path in zip(sampled_outputs, file_pathes):
        # Clean The speech intervals by:
        # * merging small silence durations
        # * remove small speech durations
        # * add padding to each speech duration
        # Raises:
        # * NoSpeechIntervals: if the wav is complete silence
        # * TooHighMinSpeechDruation: if `min_speech_duration` is too high which
        # resuls for deleting all speech intervals
        clean_out = clean_speech_intervals(
            out.speech_intervals,
            out.is_complete,
            min_silence_duration_ms=30,
            min_speech_duration_ms=30,
            pad_duration_ms=30,
            return_seconds=True,
        )

        print(f'Speech Intervals of: {Path(path).name}: ')
        print(clean_out.clean_speech_intervals)
        print(f'Is Recitation Complete: {clean_out.is_complete}')
        print('-' * 40)
```

## Training procedure

The model was trained on `Wav2Vec2BertForAudioFrameClassification` using the `transformers` library. More detailed motivations, methodology, and setup can be found in the GitHub repository's "تفاصيل التدريب" section.

### Training hyperparameters

The following hyperparameters were used during training:
- learning_rate: 5e-05
- train_batch_size: 50
- eval_batch_size: 64
- seed: 42
- optimizer: Use adamw_torch with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.2
- num_epochs: 1

### Training results

| Training Loss | Epoch  | Step | Accuracy | F1     | Validation Loss | Precision | Recall |
|:-------------:|:------:|:----:|:--------:|:------:|:---------------:|:---------:|:------:|
| 0.0701        | 0.2507 | 275  | 0.9953   | 0.9959 | 0.0249          | 0.9947    | 0.9971 |
| 0.0234        | 0.5014 | 550  | 0.9953   | 0.9959 | 0.0185          | 0.9940    | 0.9977 |
| 0.0186        | 0.7521 | 825  | 0.9958   | 0.9964 | 0.0132          | 0.9976    | 0.9951 |

### Framework versions

- Transformers 4.51.3
- Pytorch 2.2.1+cu121
- Datasets 3.5.0
- Tokenizers 0.21.1

## Citation
If you find our work helpful or inspiring, please feel free to cite it.

```bibtex
@article{ibrahim2025automatic,
  title={Automatic Pronunciation Error Detection and Correction of the Holy Quran's Learners Using Deep Learning},
  author={Abdullah Abdelfattah, Mahmoud I.Khalil, Hazem M.Abbas},
  journal={arXiv preprint arXiv:2509.00094},
  year={2025}
}
```