| # π T5-Based Multilingual Text Translator | |
| This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility. | |
| --- | |
| ## π Problem Statement | |
| The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization. | |
| --- | |
| ## π Dataset | |
| - **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments. | |
| - **Languages Supported:** | |
| - English | |
| - French | |
| - German | |
| - Italian | |
| - Portuguese | |
| - **Structure:** | |
| - Each language has a corresponding `.txt` file. | |
| - Lines are aligned by index to form translation pairs. | |
| - **Example Input Format:** | |
| ``` | |
| Source: translate English to French: I am a student. | |
| Target: Je suis un Γ©tudiant. | |
| ``` | |
| --- | |
| ## π§ Model Details | |
| - **Architecture:** T5-small | |
| - **Tokenizer:** `T5Tokenizer` | |
| - **Model:** `T5ForConditionalGeneration` | |
| - **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning) | |
| --- | |
| ## π§ Installation | |
| ```bash | |
| pip install transformers datasets torch gtts | |
| ``` | |
| --- | |
| ## π Loading the Model | |
| ```python | |
| from transformers import T5ForConditionalGeneration, T5Tokenizer | |
| import torch | |
| # Load quantized model (float16) | |
| model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16) | |
| tokenizer = T5Tokenizer.from_pretrained("quantized_model") | |
| # Translation example | |
| source = "translate English to German: How are you?" | |
| inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True) | |
| with torch.no_grad(): | |
| outputs = model.generate(**inputs) | |
| print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True)) | |
| ``` | |
| --- | |
| ## π Performance Metrics | |
| As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated. | |
| --- | |
| ## ποΈ Fine-Tuning Details | |
| ### π Dataset Preparation | |
| - A total of 5 text files (`english.txt`, `french.txt`, etc.) | |
| - Each sentence aligned by index for parallel translation. | |
| ### π§ Training Configuration | |
| - **Epochs:** 1 | |
| - **Batch size:** 4 | |
| - **Max sequence length:** 128 | |
| - **Model base:** `t5-small` | |
| - **Framework:** Hugging Face Transformers + PyTorch | |
| - **Evaluation strategy:** 10% test split | |
| --- | |
| ## π Quantization | |
| Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed. | |
| ```python | |
| # Load full-precision model | |
| model_fp32 = T5ForConditionalGeneration.from_pretrained("model") | |
| # Convert to half precision | |
| model_fp16 = model_fp32.half() | |
| model_fp16.save_pretrained("quantized_model") | |
| ``` | |
| **Model Size Comparison:** | |
| | Type | Size (KB) | | |
| |------------------|-----------| | |
| | FP32 (Original) | ~6,904 KB | | |
| | FP16 (Quantized) | ~3,452 KB | | |
| --- | |
| ## π Repository Structure | |
| ``` | |
| . | |
| βββ model/ # Contains FP32 model files | |
| β βββ config.json | |
| β βββ model.safetensors | |
| β βββ tokenizer_config.json | |
| β βββ ... | |
| βββ quantized_model/ # Contains FP16 quantized model files | |
| β βββ config.json | |
| β βββ model.safetensors | |
| β βββ tokenizer_config.json | |
| β βββ ... | |
| βββ README.md # Documentation | |
| βββ multilingual_translator.py # Training and inference script | |
| ``` | |
| --- | |
| ## β οΈ Limitations | |
| - Trained on a small dataset with only one epoch β may not generalize well to all phrases or complex sentences. | |
| - Language coverage is limited to 5 predefined languages. | |
| - gTTS is dependent on Google API and requires internet access. | |
| --- | |
| ## π€ Contributing | |
| Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use. | |