Spaces:
Sleeping
Sleeping
| title: Translate | |
| emoji: π | |
| colorFrom: blue | |
| colorTo: indigo | |
| sdk: docker | |
| app_file: app.py | |
| pinned: false | |
| # Saksi Translation: Nepali-English Machine Translation | |
| This project provides a machine translation solution to translate text from Nepali and Sinhala to English. It leverages the power of the NLLB (No Language Left Behind) model from Meta AI, which is fine-tuned on a custom dataset for improved performance. The project includes a complete workflow from data acquisition to model deployment, featuring a REST API for easy integration. | |
| ## Table of Contents | |
| - [Features](#features) | |
| - [Workflow](#workflow) | |
| - [Tech Stack](#tech-stack) | |
| - [Model Details](#model-details) | |
| - [API Endpoints](#api-endpoints) | |
| - [Getting Started](#getting-started) | |
| - [Usage](#usage) | |
| - [Project Structure](#project-structure) | |
| - [Future Improvements](#future-improvements) | |
| ## Features | |
| - **High-Quality Translation:** Utilizes a fine-tuned NLLB model for accurate translations. | |
| - **Support for Multiple Languages:** Currently supports Nepali and Sinhala to English translation. | |
| - **REST API:** Exposes the translation model through a high-performance FastAPI application. | |
| - **Interactive Frontend:** A simple and intuitive web interface for easy translation. | |
| - **Batch Translation:** Supports translating multiple texts in a single request. | |
| - **PDF Translation:** Supports translating text directly from PDF files. | |
| - **Scalable and Reproducible:** Built with a modular structure and uses MLflow for experiment tracking. | |
| ## Workflow | |
| The project follows a standard machine learning workflow for building and deploying a translation model: | |
| 1. **Data Acquisition:** The process begins with collecting parallel text data (Nepali/Sinhala and English). The `scripts/fetch_parallel_data.py` script is used to download data from various online sources. The quality and quantity of this data are crucial for the model's performance. | |
| 2. **Data Cleaning and Preprocessing:** Raw data from the web is often noisy and requires cleaning. The `scripts/clean_text_data.py` script performs several preprocessing steps: | |
| * **HTML Tag Removal:** Strips out HTML tags and other web artifacts. | |
| * **Unicode Normalization:** Normalizes Unicode characters to ensure consistency. | |
| * **Sentence Filtering:** Removes sentences that are too long or too short, which can negatively impact training. | |
| * **Corpus Alignment:** Ensures a one-to-one correspondence between source and target sentences. | |
| 3. **Model Finetuning:** The core of the project is fine-tuning a pre-trained NLLB model on our custom parallel dataset. The `src/train.py` script, which leverages the Hugging Face `Trainer` API, handles this process. This script manages the entire training loop, including: | |
| * Loading the pre-trained NLLB model and tokenizer. | |
| * Creating a PyTorch Dataset from the preprocessed data. | |
| * Configuring training arguments like learning rate, batch size, and number of epochs. | |
| * Executing the training loop and saving the fine-tuned model checkpoints. | |
| 4. **Model Evaluation:** After training, the model's performance is evaluated using the `src/evaluation.py` script. This script calculates the **BLEU (Bilingual Evaluation Understudy)** score, a widely accepted metric for machine translation quality. It works by comparing the model's translations of a test set with a set of high-quality reference translations. | |
| 5. **Inference and Deployment:** Once the model is trained and evaluated, it's ready for use. | |
| * `interactive_translate.py`: A command-line script for quick, interactive translation tests. | |
| * `fast_api.py`: A production-ready REST API built with FastAPI that serves the translation model. This allows other applications to easily consume the translation service. | |
| ## Tech Stack | |
| The technologies used in this project were chosen to create a robust, efficient, and maintainable machine translation pipeline: | |
| - **Python:** The primary language for the project, offering a rich ecosystem of libraries and frameworks for machine learning. | |
| - **PyTorch:** A flexible and powerful deep learning framework that provides fine-grained control over the model training process. | |
| - **Hugging Face Transformers:** The backbone of the project, providing easy access to pre-trained models like NLLB and a standardized interface for training and inference. | |
| - **Hugging Face Datasets:** Simplifies the process of loading and preprocessing large datasets, with efficient data loading and manipulation capabilities. | |
| - **FastAPI:** A modern, high-performance web framework for building APIs with Python. It's used to serve the translation model as a REST API. | |
| - **Uvicorn:** A lightning-fast ASGI server, used to run the FastAPI application. | |
| - **MLflow:** Used for experiment tracking to ensure reproducibility. It logs training parameters, metrics, and model artifacts, which is crucial for managing machine learning projects. | |
| ## Model Details | |
| - **Base Model:** The project uses the `facebook/nllb-200-distilled-600M` model, a distilled version of the NLLB-200 model. This model is designed to be efficient while still providing high-quality translations for a large number of languages. | |
| - **Fine-tuning:** The base model is fine-tuned on a custom dataset of Nepali-English and Sinhala-English parallel text to improve its performance on these specific language pairs. | |
| - **Tokenizer:** The `NllbTokenizer` is used for tokenizing the text. It's a sentence-piece based tokenizer that is specifically designed for the NLLB model. | |
| ## API Endpoints | |
| The FastAPI application provides the following endpoints: | |
| - **`GET /`**: Returns the frontend HTML page. | |
| - **`GET /languages`**: Returns a list of supported languages. | |
| - **`POST /translate`**: Translates a single text. | |
| - **Request Body:** | |
| ```json | |
| { | |
| "text": "string", | |
| "source_language": "string" | |
| } | |
| ``` | |
| - **Response Body:** | |
| ```json | |
| { | |
| "original_text": "string", | |
| "translated_text": "string", | |
| "source_language": "string" | |
| } | |
| ``` | |
| - **`POST /batch-translate`**: Translates a batch of texts. | |
| - **Request Body:** | |
| ```json | |
| { | |
| "texts": [ | |
| "string" | |
| ], | |
| "source_language": "string" | |
| } | |
| ``` | |
| - **Response Body:** | |
| ```json | |
| { | |
| "original_texts": [ | |
| "string" | |
| ], | |
| "translated_texts": [ | |
| "string" | |
| ], | |
| "source_language": "string" | |
| } | |
| ``` | |
| - **`POST /translate-pdf`**: Translates a PDF file. | |
| - **Request:** `source_language: str`, `file: UploadFile` | |
| - **Response Body:** | |
| ```json | |
| { | |
| "filename": "string", | |
| "translated_text": "string", | |
| "source_language": "string" | |
| } | |
| ``` | |
| ## Getting Started | |
| ### Prerequisites | |
| - **Python 3.10 or higher:** Ensure you have a recent version of Python installed. | |
| - **Git and Git LFS:** Git is required to clone the repository, and Git LFS is required to handle large model files. | |
| - **(Optional) NVIDIA GPU with CUDA:** A GPU is highly recommended for training the model. | |
| ### Installation | |
| 1. **Clone the repository:** | |
| ```bash | |
| git clone <repository-url> | |
| cd saksi_translation | |
| ``` | |
| 2. **Create and activate a virtual environment:** | |
| ```bash | |
| python -m venv .venv | |
| # On Windows | |
| .venv\Scripts\activate | |
| # On macOS/Linux | |
| source .venv/bin/activate | |
| ``` | |
| 3. **Install dependencies:** | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ### Data Preparation | |
| - **Fetch Parallel Data:** | |
| ```bash | |
| python scripts/fetch_parallel_data.py --output_dir data/raw | |
| ``` | |
| - **Clean Text Data:** | |
| ```bash | |
| python scripts/clean_text_data.py --input_dir data/raw --output_dir data/processed | |
| ``` | |
| ### Training | |
| - **Start Training:** | |
| ```bash | |
| python src/train.py \ | |
| --model_name "facebook/nllb-200-distilled-600M" \ | |
| --dataset_path "data/processed" \ | |
| --output_dir "models/nllb-finetuned-nepali-en" \ | |
| --learning_rate 2e-5 \ | |
| --per_device_train_batch_size 8 \ | |
| --num_train_epochs 3 | |
| ``` | |
| ### Evaluation | |
| - **Evaluate the Model:** | |
| ```bash | |
| python src/evaluate.py \ | |
| --model_path "models/nllb-finetuned-nepali-en" \ | |
| --test_data_path "data/test_sets/test.en" \ | |
| --reference_data_path "data/test_sets/test.ne" | |
| ``` | |
| ### Interactive Translation | |
| - **Run the interactive script:** | |
| ```bash | |
| python interactive_translate.py | |
| ``` | |
| ### API | |
| - **Run the API:** | |
| ```bash | |
| uvicorn fast_api:app --reload | |
| ``` | |
| Open your browser and navigate to `http://127.0.0.1:8000` to use the web interface. | |
| ## Project Structure | |
| ``` | |
| saksi_translation/ | |
| βββ .gitignore | |
| βββ fast_api.py # FastAPI application | |
| βββ interactive_translate.py # Interactive translation script | |
| βββ README.md # Project documentation | |
| βββ requirements.txt # Python dependencies | |
| βββ test_translation.py # Script for testing the translation model | |
| βββ frontend/ | |
| β βββ index.html # Frontend HTML | |
| β βββ script.js # Frontend JavaScript | |
| β βββ styles.css # Frontend CSS | |
| βββ data/ | |
| β βββ processed/ # Processed data for training | |
| β βββ raw/ # Raw data downloaded from the web | |
| β βββ test_sets/ # Test sets for evaluation | |
| βββ mlruns/ # MLflow experiment tracking data | |
| βββ models/ | |
| β βββ nllb-finetuned-nepali-en/ # Fine-tuned model | |
| βββ notebooks/ # Jupyter notebooks for experimentation | |
| βββ scripts/ | |
| β βββ clean_text_data.py | |
| β βββ create_test_set.py | |
| β βββ download_model.py | |
| β βββ fetch_parallel_data.py | |
| β βββ scrape_bbc_nepali.py | |
| βββ src/ | |
| βββ __init__.py | |
| βββ evaluation.py # Script for evaluating the model | |
| βββ train.py # Script for training the model | |
| βββ translate.py # Script for translating text | |
| ``` | |
| ## Future Improvements | |
| - **Support for more languages:** The project can be extended to support more languages by adding more parallel data and fine-tuning the model on it. | |
| - **Improved Model:** The model can be improved by using a larger version of the NLLB model or by fine-tuning it on a larger and cleaner dataset. | |
| - **Advanced Frontend:** The frontend can be improved by adding features like translation history, user accounts, and more advanced styling. | |
| - **Containerization:** The application can be containerized using Docker for easier deployment and scaling. |