File size: 4,912 Bytes
0b256e2 8677b18 0b256e2 dca1a4b a9ded9e dca1a4b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
---
title: Multimodal Product Classification
emoji: π
colorFrom: purple
colorTo: yellow
sdk: gradio
sdk_version: 5.44.0
app_file: app.py
pinned: true
license: mit
short_description: Product classification using image and text
---
# ποΈMultimodal Product Classification with Gradio
## Table of Contents
1. [Project Description](#1-project-description)
2. [Methodology & Key Features](#2-methodology--key-features)
3. [Technology Stack](#3-technology-stack)
4. [Model Details](#4-model-details)
## 1. Project Description
This project implements a **multimodal product classification system** for Best Buy products. The core objective is to categorize products using both their text descriptions and images. The system was trained on a dataset of **almost 50,000** items.
The entire system is deployed as a lightweight, web application using **Gradio**. The app allows users to:
- Use both text and an image for the most accurate prediction.
- Run predictions using only text or only an image to understand the contribution of each data modality.
This project showcases the power of combining different data types to build a more robust and intelligent classification system.
> [!IMPORTANT]
>
> - Check out the deployed app here: ποΈ [Multimodal Product Classification App](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification) ποΈ
> - Check out the Jupyter Notebook for a detailed walkthrough of the project here: ποΈ [Jupyter Notebook](https://huggingface.co/spaces/iBrokeTheCode/Multimodal_Product_Classification/blob/main/notebook_guide.ipynb) ποΈ

## 2. Methodology & Key Features
- **Core Task:** Multimodal Product Classification on a Best Buy dataset.
- **Pipeline:**
- **Data:** A dataset of \~50,000 products, each with a text description and an image.
- **Feature Extraction:** Pre-trained models are used to convert raw text and image data into high-dimensional embedding vectors.
- **Classification:** A custom-trained **Multilayer Perceptron (MLP)** model performs the final classification based on the embeddings.
- **Key Features:**
- **Multimodal:** Combines text and image data for a more accurate prediction.
- **Single-Service Deployment:** The entire application runs as a single, deployable Gradio app.
- **Flexible Inputs:** The app supports multimodal, text-only, and image-only prediction modes.
## 3. Technology Stack
This project was built using the following technologies:
**Deployment & Hosting:**
- [Gradio](https://gradio.app/) β interactive web app frontend.
- [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces) β for cost-effective deployment.
**Modeling & Training:**
- [TensorFlow / Keras](https://www.tensorflow.org/) β used to train the final MLP classification model.
- [Sentence-Transformers](https://www.sbert.net/) β for generating text embeddings.
- [Hugging Face Transformers](https://huggingface.co/docs/transformers/index) β for the image feature extractor (`TFConvNextV2Model`).
**Development Tools:**
- [Ruff](https://github.com/charliermarsh/ruff) β Python linter and formatter.
- [uv](https://github.com/astral-sh/uv) β fast Python package installer and resolver.
## 4. Model Details
The final classification is performed by a custom-trained **Multilayer Perceptron (MLP)** model that takes the extracted embeddings as input.
- **Text Embedding Model:** `SentenceTransformer` (`all-MiniLM-L6-v2`)
- **Image Embedding Model:** `TFConvNextV2Model` (`convnextv2-tiny-22k-224`)
- **Classifier:** A custom MLP model trained on top of the embeddings.
- **Classes:** The model classifies products into a set of specific Best Buy product categories.
| Model | Modality | Accuracy | Macro Avg F1-Score | Weighted Avg F1-Score |
| :------------------ | :----------- | :------- | :----------------- | :-------------------- |
| Random Forest | Text | 0.90 | 0.83 | 0.90 |
| Logistic Regression | Text | 0.90 | 0.84 | 0.90 |
| Random Forest | Image | 0.80 | 0.70 | 0.79 |
| Random Forest | Combined | 0.89 | 0.79 | 0.89 |
| Logistic Regression | Combined | 0.89 | 0.83 | 0.89 |
| **MLP** | **Image** | **0.84** | **0.77** | **0.84** |
| **MLP** | **Text** | **0.92** | **0.87** | **0.92** |
| **MLP** | **Combined** | **0.92** | **0.85** | **0.92** |
> [!TIP]
>
> Based on the evaluation on the test set, the Multimodal MLP model achieved an excellent **92% accuracy** and a **92% weighted F1-score**, confirming its superior performance by leveraging both text and image data.
|