Spaces:
Runtime error
Runtime error
| title: Vietnamese Sentiment Analysis | |
| emoji: π | |
| colorFrom: green | |
| colorTo: blue | |
| sdk: gradio | |
| sdk_version: 4.44.0 | |
| app_file: app.py | |
| pinned: false | |
| # π Vietnamese Sentiment Analysis | |
| A Vietnamese sentiment analysis web interface built with Gradio and transformer models, optimized for Hugging Face Spaces deployment. | |
| ## π Features | |
| - **π€ Transformer-based Model**: Uses 5CD-AI/Vietnamese-Sentiment-visobert from Hugging Face Hub | |
| - **π Interactive Web Interface**: Real-time sentiment analysis via Gradio | |
| - **β‘ Memory Efficient**: Built-in memory management and batch processing limits | |
| - **π Visual Analysis**: Confidence scores with interactive charts | |
| - **π Batch Processing**: Analyze multiple texts at once | |
| - **π‘οΈ Memory Management**: Real-time memory monitoring and cleanup | |
| ## π― Usage | |
| ### Single Text Analysis | |
| 1. Enter Vietnamese text in the input field | |
| 2. Click "Analyze Sentiment" | |
| 3. View the sentiment prediction with confidence scores | |
| 4. See probability distribution in the chart | |
| ### Batch Analysis | |
| 1. Switch to "Batch Analysis" tab | |
| 2. Enter multiple Vietnamese texts (one per line) | |
| 3. Click "Analyze All" to process all texts | |
| 4. View comprehensive batch summary with sentiment distribution | |
| ### Memory Management | |
| - Monitor real-time memory usage | |
| - Use "Memory Cleanup" button if needed | |
| - Automatic cleanup after each prediction | |
| - Maximum 10 texts per batch for efficiency | |
| ## π Model Details | |
| - **Base Model**: 5CD-AI/Vietnamese-Sentiment-visobert | |
| - **Pre-trained Base**: 5CD-AI/visobert-14gb-corpus (continually pretrained on 14GB Vietnamese social content) | |
| - **Architecture**: XLM-RoBERTa (Transformer-based) | |
| - **Language**: Vietnamese (optimized for social content) | |
| - **Parameters**: 97.6M parameters (F32 tensor) | |
| - **Labels**: Negative (0), Positive (1), Neutral (2) | |
| - **Max Sequence Length**: 256 tokens (matching original model) | |
| - **File Format**: Safetensors | |
| - **Task**: Text classification | |
| - **Device**: Automatic CUDA/CPU detection | |
| ### Model Performance | |
| - **Benchmark Results**: Outperformed phobert-base on all benchmarks | |
| - **F1 Scores**: Up to 99.64% on some datasets | |
| - **Training Dataset**: 120K Vietnamese sentiment samples | |
| - **Evaluation Metric**: Weighted F1 score (wf1) | |
| ## π― Fine-Tuning Configuration | |
| ### Training Parameters (Based on 5CD-AI/Vietnamese-Sentiment-visobert) | |
| - **Learning Rate**: 2e-5 (same as original model) | |
| - **Batch Size**: 16 (train/eval) | |
| - **Training Epochs**: 5 (matching original model training) | |
| - **Weight Decay**: 0.01 (same as original) | |
| - **Seed**: 42 (for reproducibility, matching original) | |
| - **Gradient Accumulation**: 1 step | |
| - **Optimizer**: AdamW (betas=(0.9, 0.999), epsilon=1e-08) | |
| - **Max Sequence Length**: 256 tokens (matching original model) | |
| ### Training Strategy | |
| - **Evaluation Strategy**: Epoch-based evaluation | |
| - **Save Strategy**: Save model at each epoch | |
| - **Best Model Selection**: Based on weighted F1 score (wf1) | |
| - **Early Stopping**: Load best model at end | |
| - **Logging**: Every 10 steps | |
| - **Checkpoint Limit**: Save last 2 checkpoints | |
| - **Metric**: Weighted F1 score (matching original evaluation) | |
| ### Data Processing | |
| - **Tokenization**: AutoTokenizer with truncation and padding | |
| - **Max Length**: 256 tokens (matching original model configuration) | |
| - **Data Collator**: DataCollatorWithPadding for dynamic padding | |
| - **Text Columns**: Auto-detection (sentence, text, comment, feedback) | |
| - **Label Columns**: Auto-detection (sentiment, label, labels) | |
| - **Label Mapping**: 0=Negative, 1=Positive, 2=Neutral (matching original) | |
| ## π Dataset Information | |
| ### Original Model Training Datasets (120K samples) | |
| The 5CD-AI/Vietnamese-Sentiment-visobert model was trained on comprehensive Vietnamese sentiment datasets: | |
| **Academic Datasets**: | |
| - **SA-VLSP2016**: Sentiment Analysis VLSP 2016 competition dataset | |
| - **AIVIVN-2019**: AI for Vietnamese NLP 2019 sentiment dataset | |
| - **UIT-VSFC**: Vietnamese Students' Feedback Corpus (UIT) | |
| - **UIT-VSMEC**: Vietnamese Social Media Emotion Corpus (re-labeled) | |
| - **UIT-ViCTSD**: Vietnamese COVID-19 Sentiment Dataset (re-labeled) | |
| - **UIT-ViHSD**: Vietnamese Hate Speech Detection Dataset | |
| - **UIT-ViSFD**: Vietnamese Spam Feedback Dataset | |
| - **UIT-ViOCD**: Vietnamese Offensive Content Detection Dataset | |
| **E-commerce and Social Media Datasets**: | |
| - **Tiki-reviews**: Vietnamese e-commerce platform reviews | |
| - **VOZ-HSD**: Vietnamese forum hate speech dataset (re-labeled) | |
| - **Vietnamese-amazon-polarity**: Amazon reviews translated/adapted for Vietnamese | |
| **Label Processing**: | |
| - Some datasets were re-labeled using Gemini 1.5 Flash API for consistency | |
| - Final label mapping: 0=Negative, 1=Positive, 2=Neutral | |
| ### Primary Dataset (for fine-tuning) | |
| - **Name**: uitnlp/vietnamese_students_feedback | |
| - **Type**: Student feedback sentiment analysis | |
| - **Language**: Vietnamese | |
| - **Labels**: 3-way classification (Negative, Neutral, Positive) | |
| - **Purpose**: Recommended for educational domain fine-tuning | |
| ### Alternative Datasets (Fallback) | |
| - **Name**: linhtranvi/5cdAI-Vietnamese-sentiment | |
| - **Type**: General Vietnamese sentiment | |
| - **Purpose**: Backup dataset if primary fails | |
| ### Sample Dataset (Built-in) | |
| If external datasets fail, the system creates a sample dataset with: | |
| - **Total Samples**: 20 Vietnamese texts | |
| - **Distribution**: | |
| - Positive: 8 samples | |
| - Negative: 6 samples | |
| - Neutral: 6 samples | |
| - **Split**: 60% train, 20% validation, 20% test | |
| - **Content**: Educational feedback and reviews | |
| ### Sample Data Examples | |
| ```python | |
| # Positive examples | |
| "GiαΊ£ng viΓͺn dαΊ‘y rαΊ₯t hay vΓ tΓ’m huyαΊΏt, tΓ΄i hα»c Δược nhiα»u kiαΊΏn thα»©c bα» Γch." | |
| "MΓ΄n hα»c nΓ y rαΊ₯t thΓΊ vα» vΓ practical, giΓΊp tΓ΄i Γ‘p dα»₯ng Δược vΓ o thα»±c tαΊΏ." | |
| # Negative examples | |
| "MΓ΄n hα»c quΓ‘ khΓ³ vΓ nhΓ m chΓ‘n, khΓ΄ng cΓ³ gΓ¬ Δα» hα»c cαΊ£." | |
| "GiαΊ£ng viΓͺn dαΊ‘y khΓ΄ng rΓ΅ rΓ ng, tα»c Δα» quΓ‘ nhanh, khΓ΄ng theo kα»p." | |
| # Neutral examples | |
| "MΓ΄n hα»c α»n Δα»nh, khΓ΄ng cΓ³ gΓ¬ ΔαΊ·c biα»t Δα» nhαΊn xΓ©t." | |
| "Nα»i dung cΖ‘ bαΊ£n, phΓΉ hợp vα»i chΖ°Ζ‘ng trΓ¬nh Δα» ra." | |
| ``` | |
| ## π Model Performance & Evaluation | |
| ### Metrics Tracked | |
| - **Accuracy**: Overall prediction accuracy | |
| - **F1 Score**: Weighted F1 score (primary metric) | |
| - **Precision**: Weighted precision | |
| - **Recall**: Weighted recall | |
| - **Training Loss**: Loss progression over epochs | |
| - **Evaluation Loss**: Validation loss per epoch | |
| ### Evaluation Output | |
| - **Classification Report**: Detailed per-class metrics | |
| - **Confusion Matrix**: Visual confusion matrix saved as PNG | |
| - **Training History**: Loss and F1 plots saved as PNG | |
| - **Best Model**: Saved based on highest F1 score | |
| ### Expected Performance | |
| - **Target F1 Score**: >0.90 on validation set (original model achieves up to 99.64%) | |
| - **Target Accuracy**: >0.90 on validation set | |
| - **Training Time**: ~15-30 minutes (depending on hardware) | |
| - **Memory Usage**: ~2-4GB during training | |
| - **Benchmark Performance**: Original model outperformed phobert-base on all Vietnamese sentiment benchmarks | |
| - **Model Size**: 97.6M parameters for efficient deployment | |
| ## π‘ Example Usage | |
| Try these example Vietnamese texts: | |
| - "GiαΊ£ng viΓͺn dαΊ‘y rαΊ₯t hay vΓ tΓ’m huyαΊΏt." (Positive) | |
| - "MΓ΄n hα»c nΓ y quΓ‘ khΓ³ vΓ nhΓ m chΓ‘n." (Negative) | |
| - "Lα»p hα»c α»n Δα»nh, khΓ΄ng cΓ³ gΓ¬ ΔαΊ·c biα»t." (Neutral) | |
| ## π οΈ Technical Features | |
| ### Memory Optimization | |
| - Automatic GPU cache clearing | |
| - Garbage collection management | |
| - Memory usage monitoring | |
| - Batch size limits | |
| - Real-time memory tracking | |
| ### Performance | |
| - ~100ms processing time per text | |
| - Supports up to 512 token sequences | |
| - Efficient batch processing | |
| - Memory limit: 8GB (Hugging Face Spaces) | |
| ## π Project Structure | |
| ``` | |
| SentimentAnalysis/ | |
| βββ app.py # Main Hugging Face Spaces app | |
| βββ train.py # Training entry point | |
| βββ test.py # Testing entry point | |
| βββ demo.py # Demo entry point | |
| βββ web.py # Web interface entry point | |
| βββ main.py # Main program entry point | |
| βββ requirements.txt # Python dependencies | |
| βββ requirements_spaces.txt # Hugging Face Spaces dependencies | |
| βββ .space.yaml # Hugging Face Spaces configuration | |
| βββ .gitignore # Git ignore rules | |
| βββ README.md # This file | |
| βββ py/ # Core Python modules | |
| β βββ fine_tune_sentiment.py # Fine-tuning implementation | |
| β βββ test_model.py # Model testing utilities | |
| β βββ demo.py # Demo implementation | |
| βββ pdf/ # Documentation (paper.tex only) | |
| β βββ paper.tex # LaTeX paper (only tracked file) | |
| βββ vietnamese_sentiment_finetuned/ # Fine-tuned model output (if trained) | |
| βββ training_history.png # Training history plot | |
| βββ confusion_matrix.png # Confusion matrix visualization | |
| βββ deploy_package/ # Deployment artifacts | |
| ``` | |
| ## π¬ Model Training & Fine-Tuning | |
| ### How to Fine-Tune the Model | |
| 1. **Using the training script**: | |
| ```bash | |
| python train.py | |
| ``` | |
| 2. **Direct fine-tuning** (Recommended - matches original model config): | |
| ```python | |
| from py.fine_tune_sentiment import SentimentFineTuner | |
| # Initialize fine-tuner with original model | |
| fine_tuner = SentimentFineTuner() | |
| # Run complete fine-tuning pipeline with original parameters | |
| fine_tuner.run_fine_tuning( | |
| output_dir="./vietnamese_sentiment_finetuned", | |
| learning_rate=2e-5, # Same as original model | |
| batch_size=16, # Recommended batch size | |
| num_epochs=5 # Same as original model | |
| ) | |
| ``` | |
| 3. **Custom configuration**: | |
| ```python | |
| # Load model and tokenizer | |
| fine_tuner.load_model_and_tokenizer() | |
| # Load and prepare dataset | |
| fine_tuner.load_and_prepare_dataset() | |
| # Tokenize datasets | |
| fine_tuner.tokenize_datasets() | |
| # Setup custom training (matching original optimizer config) | |
| fine_tuner.setup_trainer( | |
| output_dir="./custom_model", | |
| learning_rate=2e-5, # Original learning rate | |
| batch_size=16, # Standard batch size | |
| num_epochs=5 # Same as original model | |
| ) | |
| # Train and evaluate | |
| fine_tuner.train_model() | |
| eval_results, y_pred, y_true = fine_tuner.evaluate_model() | |
| ``` | |
| ### Training Outputs | |
| - **Model Files**: Saved to specified output directory | |
| - **Tokenizer**: Saved with model configuration | |
| - **Training History**: `training_history.png` | |
| - **Confusion Matrix**: `confusion_matrix.png` | |
| - **Logs**: Training logs in `{output_dir}/logs/` | |
| ### Fine-Tuning Features | |
| - **Automatic Dataset Loading**: Supports multiple Vietnamese datasets | |
| - **Flexible Column Detection**: Auto-detects text and label columns | |
| - **Fallback Sample Dataset**: Built-in dataset if external fails | |
| - **Comprehensive Evaluation**: Multiple metrics and visualizations | |
| - **Memory Efficient**: Optimized for limited resources | |
| ## π Model Performance | |
| The model provides: | |
| - **Sentiment Classification**: Positive, Neutral, Negative | |
| - **Confidence Scores**: Probability distribution across classes | |
| - **Real-time Processing**: Fast inference on CPU/GPU | |
| - **Batch Analysis**: Efficient processing of multiple texts | |
| ## π§ Deployment | |
| This Space is configured for Hugging Face Spaces with: | |
| - **SDK**: Gradio 4.44.0 | |
| - **Hardware**: CPU (with CUDA support if available) | |
| - **Memory**: 8GB limit with optimization | |
| - **Model Loading**: Direct from Hugging Face Hub | |
| ## π Requirements | |
| See `requirements.txt` for complete dependency list: | |
| ### Core Dependencies | |
| - **torch>=2.0.0**: PyTorch for deep learning | |
| - **transformers>=4.21.0**: Hugging Face transformers | |
| - **gradio>=4.44.0**: Web interface framework | |
| - **psutil**: System and process monitoring | |
| ### Fine-Tuning Dependencies | |
| - **datasets**: Hugging Face datasets for loading training data | |
| - **scikit-learn**: Machine learning metrics and evaluation | |
| - **pandas**: Data manipulation and analysis | |
| - **numpy**: Numerical computing | |
| - **matplotlib**: Plotting and visualization | |
| - **seaborn**: Statistical data visualization | |
| - **tqdm**: Progress bars for training | |
| ### Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| For fine-tuning specifically: | |
| ```bash | |
| pip install torch transformers datasets scikit-learn pandas numpy matplotlib seaborn tqdm psutil gradio | |
| ``` | |
| ## π― Use Cases | |
| - **Education**: Analyze student feedback | |
| - **Customer Service**: Analyze customer reviews | |
| - **Social Media**: Monitor sentiment in posts | |
| - **Research**: Vietnamese text analysis | |
| - **Business**: Customer sentiment tracking | |
| ## π Troubleshooting | |
| ### Memory Issues | |
| - Use "Memory Cleanup" button | |
| - Reduce batch size | |
| - Refresh the page if needed | |
| ### Model Loading | |
| - Model loads automatically from Hugging Face Hub | |
| - No local training required | |
| - Automatic fallback to CPU if GPU unavailable | |
| ### Performance Tips | |
| - Clear, grammatically correct Vietnamese text works best | |
| - Longer texts (20-200 words) provide better context | |
| - Use batch processing for multiple texts | |
| ## π Citation | |
| If you use this model or Space, please cite the original model: | |
| ```bibtex | |
| @InProceedings{8573337, | |
| author={Nguyen, Kiet Van and Nguyen, Vu Duc and Nguyen, Phu X. V. and Truong, Tham T. H. and Nguyen, Ngan Luu-Thuy}, | |
| booktitle={2018 10th International Conference on Knowledge and Systems Engineering (KSE)}, | |
| title={UIT-VSFC: Vietnamese Students' Feedback Corpus for Sentiment Analysis}, | |
| year={2018}, | |
| volume={}, | |
| number={}, | |
| pages={19-24}, | |
| doi={10.1109/KSE.2018.8573337} | |
| } | |
| ``` | |
| ## π€ Contributing | |
| Feel free to: | |
| - Submit issues and feedback | |
| - Suggest improvements | |
| - Report bugs | |
| - Request new features | |
| ## π License | |
| This Space uses open-source components under MIT license. | |
| --- | |
| **Try it now!** Enter some Vietnamese text above to see the sentiment analysis in action. π |