Multilingual_Text_Translator / README.md

Update README.md

974a7e1 verified 6 months ago

4.07 kB

	# 🌐 T5-Based Multilingual Text Translator

	This repository presents a fine-tuned T5-small model for multilingual text translation across English, French, German, Italian, and Portuguese. It includes quantization for efficient inference and speech synthesis support for accessibility.

	---

	## 📝 Problem Statement

	The goal is to translate text between English and multiple European languages using a transformer-based model. Instead of using black-box APIs, this project fine-tunes the T5 model on parallel multilingual corpora, enabling offline translation and potential customization.

	---

	## 📊 Dataset

	- Source: Custom parallel corpus (`.txt` files) with one-to-one sentence alignments.
	- Languages Supported:
	- English
	- French
	- German
	- Italian
	- Portuguese

	- Structure:
	- Each language has a corresponding `.txt` file.
	- Lines are aligned by index to form translation pairs.

	- Example Input Format:
	```
	Source: translate English to French: I am a student.
	Target: Je suis un étudiant.
	```

	---

	## 🧠 Model Details

	- Architecture: T5-small
	- Tokenizer: `T5Tokenizer`
	- Model: `T5ForConditionalGeneration`
	- Task Type: Sequence-to-Sequence Translation (Supervised Fine-tuning)

	---

	## 🔧 Installation

	```bash
	pip install transformers datasets torch gtts
	```

	---

	## 🚀 Loading the Model

	```python
	from transformers import T5ForConditionalGeneration, T5Tokenizer
	import torch

	# Load quantized model (float16)
	model = T5ForConditionalGeneration.from_pretrained("quantized_model", torch_dtype=torch.float16)
	tokenizer = T5Tokenizer.from_pretrained("quantized_model")

	# Translation example
	source = "translate English to German: How are you?"
	inputs = tokenizer(source, return_tensors="pt", padding=True, truncation=True)

	with torch.no_grad():
	outputs = model.generate(**inputs)

	print("Translated:", tokenizer.decode(outputs[0], skip_special_tokens=True))
	```

	---

	## 📈 Performance Metrics

	As this project is based on a single-epoch fine-tuning, performance metrics are not explicitly computed. For a production-level system, BLEU or ROUGE scores should be evaluated.

	---

	## 🏋️ Fine-Tuning Details

	### 📚 Dataset Preparation

	- A total of 5 text files (`english.txt`, `french.txt`, etc.)
	- Each sentence aligned by index for parallel translation.

	### 🔧 Training Configuration

	- Epochs: 1
	- Batch size: 4
	- Max sequence length: 128
	- Model base: `t5-small`
	- Framework: Hugging Face Transformers + PyTorch
	- Evaluation strategy: 10% test split

	---

	## 🔄 Quantization

	Post-training quantization was performed using `.half()` precision (FP16) to reduce model size and improve inference speed.

	```python
	# Load full-precision model
	model_fp32 = T5ForConditionalGeneration.from_pretrained("model")

	# Convert to half precision
	model_fp16 = model_fp32.half()
	model_fp16.save_pretrained("quantized_model")
	```

	Model Size Comparison:

	\| Type \| Size (KB) \|
	\|------------------\|-----------\|
	\| FP32 (Original) \| ~6,904 KB \|
	\| FP16 (Quantized) \| ~3,452 KB \|

	---

	## 📁 Repository Structure

	```
	.
	├── model/ # Contains FP32 model files
	│ ├── config.json
	│ ├── model.safetensors
	│ ├── tokenizer_config.json
	│ └── ...
	├── quantized_model/ # Contains FP16 quantized model files
	│ ├── config.json
	│ ├── model.safetensors
	│ ├── tokenizer_config.json
	│ └── ...
	├── README.md # Documentation
	└── multilingual_translator.py # Training and inference script
	```

	---

	## ⚠️ Limitations

	- Trained on a small dataset with only one epoch — may not generalize well to all phrases or complex sentences.
	- Language coverage is limited to 5 predefined languages.
	- gTTS is dependent on Google API and requires internet access.

	---

	## 🤝 Contributing

	Feel free to submit issues or PRs to add more language pairs, extend training, or integrate UI for real-time use.