|
|
--- |
|
|
title: Multimodal Product Classification |
|
|
emoji: π |
|
|
colorFrom: purple |
|
|
colorTo: yellow |
|
|
sdk: gradio |
|
|
sdk_version: 5.44.0 |
|
|
app_file: app.py |
|
|
pinned: true |
|
|
license: mit |
|
|
short_description: Product classification using image and text |
|
|
--- |
|
|
|
|
|
# ποΈMultimodal Product Classification with Gradio |
|
|
|
|
|
## Table of Contents |
|
|
|
|
|
1. [Project Description](#1-project-description) |
|
|
2. [Methodology & Key Features](#2-methodology--key-features) |
|
|
3. [Technology Stack](#3-technology-stack) |
|
|
4. [Model Details](#4-model-details) |
|
|
|
|
|
## 1. Project Description |
|
|
|
|
|
This project implements a **multimodal product classification system** for Best Buy products. The core objective is to categorize products using both their text descriptions and images. The system was trained on a dataset of **almost 50,000** items. |
|
|
|
|
|
The entire system is deployed as a lightweight, web application using **Gradio**. The app allows users to: |
|
|
|
|
|
- Use both text and an image for the most accurate prediction. |
|
|
- Run predictions using only text or only an image to understand the contribution of each data modality. |
|
|
|
|
|
This project showcases the power of combining different data types to build a more robust and intelligent classification system. |
|
|
|
|
|
> [!IMPORTANT] |
|
|
> |
|
|
> - Check out the deployed app here: ποΈ [Multimodal Product Classification App](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification) ποΈ |
|
|
> - Check out the Jupyter Notebook for a detailed walkthrough of the project here: ποΈ [Jupyter Notebook](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification/blob/main/notebook_guide.ipynb) ποΈ |
|
|
|
|
|
 |
|
|
|
|
|
## 2. Methodology & Key Features |
|
|
|
|
|
- **Core Task:** Multimodal Product Classification on a Best Buy dataset. |
|
|
|
|
|
- **Pipeline:** |
|
|
|
|
|
- **Data:** A dataset of \~50,000 products, each with a text description and an image. |
|
|
- **Feature Extraction:** Pre-trained models are used to convert raw text and image data into high-dimensional embedding vectors. |
|
|
- **Classification:** A custom-trained **Multilayer Perceptron (MLP)** model performs the final classification based on the embeddings. |
|
|
|
|
|
- **Key Features:** |
|
|
|
|
|
- **Multimodal:** Combines text and image data for a more accurate prediction. |
|
|
- **Single-Service Deployment:** The entire application runs as a single, deployable Gradio app. |
|
|
- **Flexible Inputs:** The app supports multimodal, text-only, and image-only prediction modes. |
|
|
|
|
|
## 3. Technology Stack |
|
|
|
|
|
This project was built using the following technologies: |
|
|
|
|
|
**Deployment & Hosting:** |
|
|
|
|
|
- [Gradio](https://gradio.app/) β interactive web app frontend. |
|
|
- [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) β for cost-effective deployment. |
|
|
|
|
|
**Modeling & Training:** |
|
|
|
|
|
- [TensorFlow / Keras](https://www.tensorflow.org/) β used to train the final MLP classification model. |
|
|
- [Sentence-Transformers](https://www.sbert.net/) β for generating text embeddings. |
|
|
- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) β for the image feature extractor (`TFConvNextV2Model`). |
|
|
|
|
|
**Development Tools:** |
|
|
|
|
|
- [Ruff](https://github.com/charliermarsh/ruff) β Python linter and formatter. |
|
|
- [uv](https://github.com/astral-sh/uv) β fast Python package installer and resolver. |
|
|
|
|
|
## 4. Model Details |
|
|
|
|
|
The final classification is performed by a custom-trained **Multilayer Perceptron (MLP)** model that takes the extracted embeddings as input. |
|
|
|
|
|
- **Text Embedding Model:** `SentenceTransformer` (`all-MiniLM-L6-v2`) |
|
|
- **Image Embedding Model:** `TFConvNextV2Model` (`convnextv2-tiny-22k-224`) |
|
|
- **Classifier:** A custom MLP model trained on top of the embeddings. |
|
|
- **Classes:** The model classifies products into a set of specific Best Buy product categories. |
|
|
|
|
|
| Model | Modality | Accuracy | Macro Avg F1-Score | Weighted Avg F1-Score | |
|
|
| :------------------ | :----------- | :------- | :----------------- | :-------------------- | |
|
|
| Random Forest | Text | 0.90 | 0.83 | 0.90 | |
|
|
| Logistic Regression | Text | 0.90 | 0.84 | 0.90 | |
|
|
| Random Forest | Image | 0.80 | 0.70 | 0.79 | |
|
|
| Random Forest | Combined | 0.89 | 0.79 | 0.89 | |
|
|
| Logistic Regression | Combined | 0.89 | 0.83 | 0.89 | |
|
|
| **MLP** | **Image** | **0.84** | **0.77** | **0.84** | |
|
|
| **MLP** | **Text** | **0.92** | **0.87** | **0.92** | |
|
|
| **MLP** | **Combined** | **0.92** | **0.85** | **0.92** | |
|
|
|
|
|
> [!TIP] |
|
|
> |
|
|
> Based on the evaluation on the test set, the Multimodal MLP model achieved an excellent **92% accuracy** and a **92% weighted F1-score**, confirming its superior performance by leveraging both text and image data. |
|
|
|