AventIQ-AI
/

Multilingual_Text_Translator

Model card Files Files and versions

xet

Community

KshitizTayal commited on Jun 3

Commit

974a7e1

verified ·

1 Parent(s): 4a43961

Update README.md

Browse files

Files changed (1) hide show

README.md +150 -3

README.md CHANGED Viewed

@@ -1,3 +1,150 @@
----
-license: apache-2.0
----

+# 🌐 T5-Based Multilingual Text Translator
+This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.
+---
+## 📝 Problem Statement
+The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.
+---
+## 📊 Dataset
+- **Source:** Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
+- **Languages Supported:**
+  - English
+  - French
+  - German
+  - Italian
+  - Portuguese
+- **Structure:**
+  - Each language has a corresponding `.txt` file.
+  - Lines are aligned by index to form translation pairs.
+- **Example Input Format:**
+  ```
+  Source: translate English to French: I am a student.
+  Target: Je suis un étudiant.
+  ```
+---
+## 🧠 Model Details
+- **Architecture:** T5-small
+- **Tokenizer:** `T5Tokenizer`
+- **Model:** `T5ForConditionalGeneration`
+- **Task Type:** Sequence-to-Sequence Translation (Supervised Fine-tuning)
+---
+## 🔧 Installation
+```bash
+pip install transformers datasets torch gtts
+```
+---
+## 🚀 Loading the Model
+```python
+from transformers import T5ForConditionalGeneration, T5Tokenizer
+import torch
+# Load quantized model (float16)
+model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
+tokenizer = T5Tokenizer.from_pretrained("quantized_model")
+# Translation example
+source = "translate English to German: How are you?"
+inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)
+with torch.no_grad():
+    outputs = model.generate(**inputs)
+print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
+## 📈 Performance Metrics
+As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.
+---
+## 🏋️ Fine-Tuning Details
+### 📚 Dataset Preparation
+- A total of 5 text files (`english.txt`, `french.txt`, etc.)
+- Each sentence aligned by index for parallel translation.
+### 🔧 Training Configuration
+- **Epochs:** 1
+- **Batch size:** 4
+- **Max sequence length:** 128
+- **Model base:** `t5-small`
+- **Framework:** Hugging Face Transformers + PyTorch
+- **Evaluation strategy:** 10% test split
+---
+## 🔄 Quantization
+Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.
+```python
+# Load full-precision model
+model_fp32 = T5ForConditionalGeneration.from_pretrained("model")
+# Convert to half precision
+model_fp16 = model_fp32.half()
+model_fp16.save_pretrained("quantized_model")
+```
+**Model Size Comparison:**
+| Type            | Size (KB) |
+|------------------|-----------|
+| FP32 (Original)  | ~6,904 KB |
+| FP16 (Quantized) | ~3,452 KB |
+---
+## 📁 Repository Structure
+```
+.
+├── model/                       # Contains FP32 model files
+│   ├── config.json
+│   ├── model.safetensors
+│   ├── tokenizer_config.json
+│   └── ...
+├── quantized_model/            # Contains FP16 quantized model files
+│   ├── config.json
+│   ├── model.safetensors
+│   ├── tokenizer_config.json
+│   └── ...
+├── README.md                   # Documentation
+└── multilingual_translator.py  # Training and inference script
+```
+---
+## ⚠️ Limitations
+- Trained on a small dataset with only one epoch — may not generalize well to all phrases or complex sentences.
+- Language coverage is limited to 5 predefined languages.
+- gTTS is dependent on Google API and requires internet access.
+---
+## 🤝 Contributing
+Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.