A newer version of the Gradio SDK is available:
5.49.1
title: Multimodal Product Classification
emoji: 📈
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 5.44.0
app_file: app.py
pinned: true
license: mit
short_description: Product classification using image and text
🛍️Multimodal Product Classification with Gradio
Table of Contents
1. Project Description
This project implements a multimodal product classification system for Best Buy products. The core objective is to categorize products using both their text descriptions and images. The system was trained on a dataset of almost 50,000 items.
The entire system is deployed as a lightweight, web application using Gradio. The app allows users to:
- Use both text and an image for the most accurate prediction.
- Run predictions using only text or only an image to understand the contribution of each data modality.
This project showcases the power of combining different data types to build a more robust and intelligent classification system.
- Check out the deployed app here: 👉️ Multimodal Product Classification App 👈️
- Check out the Jupyter Notebook for a detailed walkthrough of the project here: 👉️ Jupyter Notebook 👈️
2. Methodology & Key Features
Core Task: Multimodal Product Classification on a Best Buy dataset.
Pipeline:
- Data: A dataset of ~50,000 products, each with a text description and an image.
- Feature Extraction: Pre-trained models are used to convert raw text and image data into high-dimensional embedding vectors.
- Classification: A custom-trained Multilayer Perceptron (MLP) model performs the final classification based on the embeddings.
Key Features:
- Multimodal: Combines text and image data for a more accurate prediction.
- Single-Service Deployment: The entire application runs as a single, deployable Gradio app.
- Flexible Inputs: The app supports multimodal, text-only, and image-only prediction modes.
3. Technology Stack
This project was built using the following technologies:
Deployment & Hosting:
- Gradio – interactive web app frontend.
- Hugging Face Spaces – for cost-effective deployment.
Modeling & Training:
- TensorFlow / Keras – used to train the final MLP classification model.
- Sentence-Transformers – for generating text embeddings.
- Hugging Face Transformers – for the image feature extractor (
TFConvNextV2Model).
Development Tools:
4. Model Details
The final classification is performed by a custom-trained Multilayer Perceptron (MLP) model that takes the extracted embeddings as input.
- Text Embedding Model:
SentenceTransformer(all-MiniLM-L6-v2) - Image Embedding Model:
TFConvNextV2Model(convnextv2-tiny-22k-224) - Classifier: A custom MLP model trained on top of the embeddings.
- Classes: The model classifies products into a set of specific Best Buy product categories.
| Model | Modality | Accuracy | Macro Avg F1-Score | Weighted Avg F1-Score |
|---|---|---|---|---|
| Random Forest | Text | 0.90 | 0.83 | 0.90 |
| Logistic Regression | Text | 0.90 | 0.84 | 0.90 |
| Random Forest | Image | 0.80 | 0.70 | 0.79 |
| Random Forest | Combined | 0.89 | 0.79 | 0.89 |
| Logistic Regression | Combined | 0.89 | 0.83 | 0.89 |
| MLP | Image | 0.84 | 0.77 | 0.84 |
| MLP | Text | 0.92 | 0.87 | 0.92 |
| MLP | Combined | 0.92 | 0.85 | 0.92 |
Based on the evaluation on the test set, the Multimodal MLP model achieved an excellent 92% accuracy and a 92% weighted F1-score, confirming its superior performance by leveraging both text and image data.
