Translate-V2 / README.md
Dyno1307's picture
Upload 14 files
279ed8e verified
---
title: Translate
emoji: 🌐
colorFrom: blue
colorTo: indigo
sdk: docker
app_file: app.py
pinned: false
---
# Saksi Translation: Nepali-English Machine Translation
This project provides a machine translation solution to translate text from Nepali and Sinhala to English. It leverages the power of the NLLB (No Language Left Behind) model from Meta AI, which is fine-tuned on a custom dataset for improved performance. The project includes a complete workflow from data acquisition to model deployment, featuring a REST API for easy integration.
## Table of Contents
- [Features](#features)
- [Workflow](#workflow)
- [Tech Stack](#tech-stack)
- [Model Details](#model-details)
- [API Endpoints](#api-endpoints)
- [Getting Started](#getting-started)
- [Usage](#usage)
- [Project Structure](#project-structure)
- [Future Improvements](#future-improvements)
## Features
- **High-Quality Translation:** Utilizes a fine-tuned NLLB model for accurate translations.
- **Support for Multiple Languages:** Currently supports Nepali and Sinhala to English translation.
- **REST API:** Exposes the translation model through a high-performance FastAPI application.
- **Interactive Frontend:** A simple and intuitive web interface for easy translation.
- **Batch Translation:** Supports translating multiple texts in a single request.
- **PDF Translation:** Supports translating text directly from PDF files.
- **Scalable and Reproducible:** Built with a modular structure and uses MLflow for experiment tracking.
## Workflow
The project follows a standard machine learning workflow for building and deploying a translation model:
1. **Data Acquisition:** The process begins with collecting parallel text data (Nepali/Sinhala and English). The `scripts/fetch_parallel_data.py` script is used to download data from various online sources. The quality and quantity of this data are crucial for the model's performance.
2. **Data Cleaning and Preprocessing:** Raw data from the web is often noisy and requires cleaning. The `scripts/clean_text_data.py` script performs several preprocessing steps:
* **HTML Tag Removal:** Strips out HTML tags and other web artifacts.
* **Unicode Normalization:** Normalizes Unicode characters to ensure consistency.
* **Sentence Filtering:** Removes sentences that are too long or too short, which can negatively impact training.
* **Corpus Alignment:** Ensures a one-to-one correspondence between source and target sentences.
3. **Model Finetuning:** The core of the project is fine-tuning a pre-trained NLLB model on our custom parallel dataset. The `src/train.py` script, which leverages the Hugging Face `Trainer` API, handles this process. This script manages the entire training loop, including:
* Loading the pre-trained NLLB model and tokenizer.
* Creating a PyTorch Dataset from the preprocessed data.
* Configuring training arguments like learning rate, batch size, and number of epochs.
* Executing the training loop and saving the fine-tuned model checkpoints.
4. **Model Evaluation:** After training, the model's performance is evaluated using the `src/evaluation.py` script. This script calculates the **BLEU (Bilingual Evaluation Understudy)** score, a widely accepted metric for machine translation quality. It works by comparing the model's translations of a test set with a set of high-quality reference translations.
5. **Inference and Deployment:** Once the model is trained and evaluated, it's ready for use.
* `interactive_translate.py`: A command-line script for quick, interactive translation tests.
* `fast_api.py`: A production-ready REST API built with FastAPI that serves the translation model. This allows other applications to easily consume the translation service.
## Tech Stack
The technologies used in this project were chosen to create a robust, efficient, and maintainable machine translation pipeline:
- **Python:** The primary language for the project, offering a rich ecosystem of libraries and frameworks for machine learning.
- **PyTorch:** A flexible and powerful deep learning framework that provides fine-grained control over the model training process.
- **Hugging Face Transformers:** The backbone of the project, providing easy access to pre-trained models like NLLB and a standardized interface for training and inference.
- **Hugging Face Datasets:** Simplifies the process of loading and preprocessing large datasets, with efficient data loading and manipulation capabilities.
- **FastAPI:** A modern, high-performance web framework for building APIs with Python. It's used to serve the translation model as a REST API.
- **Uvicorn:** A lightning-fast ASGI server, used to run the FastAPI application.
- **MLflow:** Used for experiment tracking to ensure reproducibility. It logs training parameters, metrics, and model artifacts, which is crucial for managing machine learning projects.
## Model Details
- **Base Model:** The project uses the `facebook/nllb-200-distilled-600M` model, a distilled version of the NLLB-200 model. This model is designed to be efficient while still providing high-quality translations for a large number of languages.
- **Fine-tuning:** The base model is fine-tuned on a custom dataset of Nepali-English and Sinhala-English parallel text to improve its performance on these specific language pairs.
- **Tokenizer:** The `NllbTokenizer` is used for tokenizing the text. It's a sentence-piece based tokenizer that is specifically designed for the NLLB model.
## API Endpoints
The FastAPI application provides the following endpoints:
- **`GET /`**: Returns the frontend HTML page.
- **`GET /languages`**: Returns a list of supported languages.
- **`POST /translate`**: Translates a single text.
- **Request Body:**
```json
{
"text": "string",
"source_language": "string"
}
```
- **Response Body:**
```json
{
"original_text": "string",
"translated_text": "string",
"source_language": "string"
}
```
- **`POST /batch-translate`**: Translates a batch of texts.
- **Request Body:**
```json
{
"texts": [
"string"
],
"source_language": "string"
}
```
- **Response Body:**
```json
{
"original_texts": [
"string"
],
"translated_texts": [
"string"
],
"source_language": "string"
}
```
- **`POST /translate-pdf`**: Translates a PDF file.
- **Request:** `source_language: str`, `file: UploadFile`
- **Response Body:**
```json
{
"filename": "string",
"translated_text": "string",
"source_language": "string"
}
```
## Getting Started
### Prerequisites
- **Python 3.10 or higher:** Ensure you have a recent version of Python installed.
- **Git and Git LFS:** Git is required to clone the repository, and Git LFS is required to handle large model files.
- **(Optional) NVIDIA GPU with CUDA:** A GPU is highly recommended for training the model.
### Installation
1. **Clone the repository:**
```bash
git clone <repository-url>
cd saksi_translation
```
2. **Create and activate a virtual environment:**
```bash
python -m venv .venv
# On Windows
.venv\Scripts\activate
# On macOS/Linux
source .venv/bin/activate
```
3. **Install dependencies:**
```bash
pip install -r requirements.txt
```
## Usage
### Data Preparation
- **Fetch Parallel Data:**
```bash
python scripts/fetch_parallel_data.py --output_dir data/raw
```
- **Clean Text Data:**
```bash
python scripts/clean_text_data.py --input_dir data/raw --output_dir data/processed
```
### Training
- **Start Training:**
```bash
python src/train.py \
--model_name "facebook/nllb-200-distilled-600M" \
--dataset_path "data/processed" \
--output_dir "models/nllb-finetuned-nepali-en" \
--learning_rate 2e-5 \
--per_device_train_batch_size 8 \
--num_train_epochs 3
```
### Evaluation
- **Evaluate the Model:**
```bash
python src/evaluate.py \
--model_path "models/nllb-finetuned-nepali-en" \
--test_data_path "data/test_sets/test.en" \
--reference_data_path "data/test_sets/test.ne"
```
### Interactive Translation
- **Run the interactive script:**
```bash
python interactive_translate.py
```
### API
- **Run the API:**
```bash
uvicorn fast_api:app --reload
```
Open your browser and navigate to `http://127.0.0.1:8000` to use the web interface.
## Project Structure
```
saksi_translation/
β”œβ”€β”€ .gitignore
β”œβ”€β”€ fast_api.py # FastAPI application
β”œβ”€β”€ interactive_translate.py # Interactive translation script
β”œβ”€β”€ README.md # Project documentation
β”œβ”€β”€ requirements.txt # Python dependencies
β”œβ”€β”€ test_translation.py # Script for testing the translation model
β”œβ”€β”€ frontend/
β”‚ β”œβ”€β”€ index.html # Frontend HTML
β”‚ β”œβ”€β”€ script.js # Frontend JavaScript
β”‚ └── styles.css # Frontend CSS
β”œβ”€β”€ data/
β”‚ β”œβ”€β”€ processed/ # Processed data for training
β”‚ β”œβ”€β”€ raw/ # Raw data downloaded from the web
β”‚ └── test_sets/ # Test sets for evaluation
β”œβ”€β”€ mlruns/ # MLflow experiment tracking data
β”œβ”€β”€ models/
β”‚ └── nllb-finetuned-nepali-en/ # Fine-tuned model
β”œβ”€β”€ notebooks/ # Jupyter notebooks for experimentation
β”œβ”€β”€ scripts/
β”‚ β”œβ”€β”€ clean_text_data.py
β”‚ β”œβ”€β”€ create_test_set.py
β”‚ β”œβ”€β”€ download_model.py
β”‚ β”œβ”€β”€ fetch_parallel_data.py
β”‚ └── scrape_bbc_nepali.py
└── src/
β”œβ”€β”€ __init__.py
β”œβ”€β”€ evaluation.py # Script for evaluating the model
β”œβ”€β”€ train.py # Script for training the model
└── translate.py # Script for translating text
```
## Future Improvements
- **Support for more languages:** The project can be extended to support more languages by adding more parallel data and fine-tuning the model on it.
- **Improved Model:** The model can be improved by using a larger version of the NLLB model or by fine-tuning it on a larger and cleaner dataset.
- **Advanced Frontend:** The frontend can be improved by adding features like translation history, user accounts, and more advanced styling.
- **Containerization:** The application can be containerized using Docker for easier deployment and scaling.