Aswin Raj R commited on
Commit
e51cd56
Β·
1 Parent(s): 2280846

Deploy multimodal search engine

Browse files
Files changed (3) hide show
  1. README.md +45 -6
  2. app.py +380 -0
  3. requirements.txt +20 -0
README.md CHANGED
@@ -1,14 +1,53 @@
1
  ---
2
- title: Multimodal Ai Search Engine
3
- emoji: πŸƒ
4
  colorFrom: blue
5
- colorTo: pink
6
  sdk: gradio
7
- sdk_version: 5.42.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
11
- short_description: Advanced multimodal image search engine using CLIP and FAISS
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ title: Multimodal AI Search Engine
3
+ emoji: πŸ”
4
  colorFrom: blue
5
+ colorTo: purple
6
  sdk: gradio
7
+ sdk_version: 4.0.0
8
  app_file: app.py
9
  pinned: false
10
  license: mit
 
11
  ---
12
 
13
+ # πŸ” Multimodal AI Search Engine
14
+
15
+ A sophisticated image search engine that enables both text-to-image and image-to-image similarity search using state-of-the-art deep learning models.
16
+
17
+ ## 🌟 Features
18
+
19
+ - **πŸ”€ Text-to-Image Search**: Find images using natural language descriptions
20
+ - **πŸ–ΌοΈ Image-to-Image Search**: Upload an image to find visually similar ones
21
+ - **⚑ Fast Search**: Sub-second query response times using FAISS indexing
22
+ - **🎯 High Accuracy**: Powered by OpenAI's CLIP-ViT-B-32 model
23
+ - **🎨 Modern UI**: Clean, responsive Gradio interface
24
+
25
+ ## πŸš€ How It Works
26
+
27
+ 1. **First Visit**: The app automatically downloads 500 images from Caltech101 dataset
28
+ 2. **Embedding Generation**: Creates CLIP embeddings for all images using ViT-B-32 model
29
+ 3. **Index Building**: Builds FAISS index for fast similarity search
30
+ 4. **Ready to Search**: Use text descriptions or upload images to find similar content
31
+
32
+ ## πŸ”§ Technology Stack
33
+
34
+ - **CLIP-ViT-B-32**: OpenAI's vision-language model
35
+ - **FAISS**: Facebook's similarity search library
36
+ - **Gradio**: Interactive web interface
37
+ - **Caltech101**: 500 diverse images across 101 categories
38
+
39
+ ## πŸ“Š Dataset
40
+
41
+ - **Source**: Caltech101 via HuggingFace
42
+ - **Size**: 500 randomly sampled images
43
+ - **Categories**: 101 different object classes
44
+ - **Auto-Setup**: Downloads and processes on first run
45
+
46
+ ## πŸ’‘ Usage Tips
47
+
48
+ - **Text Search**: Use descriptive phrases like "red car on road" or "cat sitting"
49
+ - **Image Search**: Upload any image to find visually similar ones
50
+ - **Results**: Adjust the number of results using the slider (1-20)
51
+ - **First Load**: May take 5-10 minutes to set up dataset initially
52
+
53
+ *Note: First-time setup may take several minutes as the app downloads and processes the image dataset.*
app.py ADDED
@@ -0,0 +1,380 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import numpy as np
3
+ import faiss
4
+ from sentence_transformers import SentenceTransformer
5
+ import torch
6
+ from PIL import Image
7
+ import os
8
+ from typing import List, Tuple, Optional
9
+ import time
10
+
11
+ # ============= DATASET SETUP FUNCTION =============
12
+ def setup_dataset():
13
+ """Download and prepare dataset if not exists."""
14
+ if not os.path.exists("dataset/images"):
15
+ print("πŸ“₯ First-time setup: downloading dataset...")
16
+
17
+ # Import required modules for setup
18
+ from datasets import load_dataset
19
+ from tqdm import tqdm
20
+
21
+ # Create directories
22
+ os.makedirs("dataset/images", exist_ok=True)
23
+
24
+ # 1. Download images (from download_images_hf.py)
25
+ print("πŸ“₯ Loading Caltech101 dataset...")
26
+ dataset = load_dataset("flwrlabs/caltech101", split="train")
27
+ dataset = dataset.shuffle(seed=42).select(range(min(500, len(dataset))))
28
+
29
+ print(f"πŸ’Ύ Saving {len(dataset)} images locally...")
30
+ for i, item in enumerate(tqdm(dataset)):
31
+ img = item["image"]
32
+ label = item["label"]
33
+ label_name = dataset.features["label"].int2str(label)
34
+ fname = f"{i:05d}_{label_name}.jpg"
35
+ img.save(os.path.join("dataset/images", fname))
36
+
37
+ # 2. Generate embeddings (from embed_images_clip.py)
38
+ print("πŸ” Generating image embeddings...")
39
+ device = "cuda" if torch.cuda.is_available() else "cpu"
40
+ model = SentenceTransformer("clip-ViT-B-32", device=device)
41
+
42
+ image_files = [f for f in os.listdir("dataset/images") if f.lower().endswith((".jpg", ".png"))]
43
+ embeddings = []
44
+
45
+ for fname in tqdm(image_files, desc="Encoding images"):
46
+ img_path = os.path.join("dataset/images", fname)
47
+ img = Image.open(img_path).convert("RGB")
48
+ emb = model.encode(img, convert_to_numpy=True, show_progress_bar=False, normalize_embeddings=True)
49
+ embeddings.append(emb)
50
+
51
+ embeddings = np.array(embeddings, dtype="float32")
52
+ np.save("dataset/image_embeddings.npy", embeddings)
53
+ np.save("dataset/image_filenames.npy", np.array(image_files))
54
+
55
+ # 3. Build FAISS index (from build_faiss_index.py)
56
+ print("πŸ“¦ Building FAISS index...")
57
+ dim = embeddings.shape[1]
58
+ index = faiss.IndexFlatIP(dim)
59
+ index.add(embeddings)
60
+ faiss.write_index(index, "dataset/faiss_index.bin")
61
+
62
+ print("βœ… Dataset setup complete!")
63
+ else:
64
+ print("βœ… Dataset found, ready to go!")
65
+
66
+ # Call setup before anything else
67
+ setup_dataset()
68
+
69
+ # Configuration
70
+ META_PATH = "dataset/image_filenames.npy"
71
+ INDEX_PATH = "dataset/faiss_index.bin"
72
+ IMG_DIR = "dataset/images"
73
+
74
+ class MultimodalSearchEngine:
75
+ def __init__(self):
76
+ """Initialize the search engine with pre-built index and model."""
77
+ self.device = "cuda" if torch.cuda.is_available() else "cpu"
78
+ print(f"πŸ” Using device: {self.device}")
79
+
80
+ # Load pre-built index and metadata
81
+ self.index = faiss.read_index(INDEX_PATH)
82
+ self.image_files = np.load(META_PATH)
83
+
84
+ # Load CLIP model
85
+ self.model = SentenceTransformer("clip-ViT-B-32", device=self.device)
86
+
87
+ print(f"βœ… Loaded index with {self.index.ntotal} images")
88
+
89
+ def search_by_text(self, query: str, k: int = 5) -> List[Tuple[str, float]]:
90
+ """Search for images matching a text query."""
91
+ if not query.strip():
92
+ return []
93
+
94
+ start_time = time.time()
95
+ query_emb = self.model.encode([query], convert_to_numpy=True, normalize_embeddings=True)
96
+ scores, idxs = self.index.search(query_emb, k)
97
+ search_time = time.time() - start_time
98
+
99
+ results = []
100
+ for j, i in enumerate(idxs[0]):
101
+ if i != -1: # Valid index
102
+ img_path = os.path.join(IMG_DIR, self.image_files[i])
103
+ results.append((img_path, float(scores[0][j]), search_time))
104
+
105
+ return results
106
+
107
+ def search_by_image(self, image: Image.Image, k: int = 5) -> List[Tuple[str, float]]:
108
+ """Search for images visually similar to the given image."""
109
+ if image is None:
110
+ return []
111
+
112
+ start_time = time.time()
113
+ # Convert to RGB if necessary
114
+ if image.mode != 'RGB':
115
+ image = image.convert('RGB')
116
+
117
+ query_emb = self.model.encode(image, convert_to_numpy=True, normalize_embeddings=True)
118
+ query_emb = np.expand_dims(query_emb, axis=0)
119
+ scores, idxs = self.index.search(query_emb, k)
120
+ search_time = time.time() - start_time
121
+
122
+ results = []
123
+ for j, i in enumerate(idxs[0]):
124
+ if i != -1: # Valid index
125
+ img_path = os.path.join(IMG_DIR, self.image_files[i])
126
+ results.append((img_path, float(scores[0][j]), search_time))
127
+
128
+ return results
129
+
130
+ # Initialize the search engine
131
+ try:
132
+ search_engine = MultimodalSearchEngine()
133
+ ENGINE_LOADED = True
134
+ except Exception as e:
135
+ print(f"❌ Error loading search engine: {e}")
136
+ ENGINE_LOADED = False
137
+
138
+ def format_results(results: List[Tuple[str, float, float]]) -> Tuple[List[str], str]:
139
+ """Format search results for Gradio display."""
140
+ if not results:
141
+ return [], "No results found."
142
+
143
+ image_paths = [result[0] for result in results]
144
+ search_time = results[0][2] if results else 0
145
+
146
+ # Create detailed results text
147
+ results_text = f"πŸ” **Search Results** (Search time: {search_time:.3f}s)\n\n"
148
+ for i, (path, score, _) in enumerate(results, 1):
149
+ filename = os.path.basename(path)
150
+ # Extract label from filename (format: 00000_label.jpg)
151
+ label = filename.split('_', 1)[1].rsplit('.', 1)[0] if '_' in filename else 'unknown'
152
+ results_text += f"**{i}.** {label} (similarity: {score:.3f})\n"
153
+
154
+ return image_paths, results_text
155
+
156
+ def text_search_interface(query: str, num_results: int) -> Tuple[List[str], str]:
157
+ """Interface function for text-based search."""
158
+ if not ENGINE_LOADED:
159
+ return [], "❌ Search engine not loaded. Please check if all files are available."
160
+
161
+ if not query.strip():
162
+ return [], "Please enter a search query."
163
+
164
+ try:
165
+ results = search_engine.search_by_text(query, k=num_results)
166
+ return format_results(results)
167
+ except Exception as e:
168
+ return [], f"❌ Error during search: {str(e)}"
169
+
170
+ def image_search_interface(image: Image.Image, num_results: int) -> Tuple[List[str], str]:
171
+ """Interface function for image-based search."""
172
+ if not ENGINE_LOADED:
173
+ return [], "❌ Search engine not loaded. Please check if all files are available."
174
+
175
+ if image is None:
176
+ return [], "Please upload an image."
177
+
178
+ try:
179
+ results = search_engine.search_by_image(image, k=num_results)
180
+ return format_results(results)
181
+ except Exception as e:
182
+ return [], f"❌ Error during search: {str(e)}"
183
+
184
+ def get_random_examples() -> List[str]:
185
+ """Get random example queries."""
186
+ examples = [
187
+ "a cat sitting on a chair",
188
+ "airplane in the sky",
189
+ "red car on the road",
190
+ "person playing guitar",
191
+ "dog running in the park",
192
+ "beautiful sunset landscape",
193
+ "computer on a desk",
194
+ "flowers in a garden"
195
+ ]
196
+ return examples
197
+
198
+ # Create the Gradio interface
199
+ with gr.Blocks(
200
+ title="πŸ” Multimodal AI Search Engine",
201
+ theme=gr.themes.Soft(),
202
+ css="""
203
+ .gradio-container {
204
+ max-width: 1200px !important;
205
+ }
206
+ .gallery img {
207
+ border-radius: 8px;
208
+ }
209
+ """
210
+ ) as demo:
211
+
212
+ gr.HTML("""
213
+ <div style="text-align: center; margin-bottom: 30px;">
214
+ <h1>πŸ” Multimodal AI Search Engine</h1>
215
+ <p style="font-size: 18px; color: #666;">
216
+ Search through 500 Caltech101 images using text descriptions or image similarity
217
+ </p>
218
+ <p style="font-size: 14px; color: #888;">
219
+ Powered by CLIP-ViT-B-32 and FAISS for fast similarity search
220
+ </p>
221
+ </div>
222
+ """)
223
+
224
+ with gr.Tabs() as tabs:
225
+
226
+ # Text Search Tab
227
+ with gr.Tab("πŸ“ Text Search", id="text_search"):
228
+ gr.Markdown("### Search images using natural language descriptions")
229
+
230
+ with gr.Row():
231
+ with gr.Column(scale=2):
232
+ text_query = gr.Textbox(
233
+ label="Search Query",
234
+ placeholder="Describe what you're looking for (e.g., 'a red car', 'person with guitar')",
235
+ lines=2
236
+ )
237
+
238
+ with gr.Column(scale=1):
239
+ text_num_results = gr.Slider(
240
+ minimum=1, maximum=20, value=5, step=1,
241
+ label="Number of Results"
242
+ )
243
+
244
+ text_search_btn = gr.Button("πŸ” Search", variant="primary", size="lg")
245
+
246
+ # Examples
247
+ gr.Examples(
248
+ examples=get_random_examples()[:4],
249
+ inputs=text_query,
250
+ label="Example Queries"
251
+ )
252
+
253
+ with gr.Row():
254
+ text_results = gr.Gallery(
255
+ label="Search Results",
256
+ show_label=True,
257
+ elem_id="text_gallery",
258
+ columns=5,
259
+ rows=1,
260
+ height="auto",
261
+ object_fit="contain"
262
+ )
263
+ text_info = gr.Markdown(label="Details")
264
+
265
+ # Image Search Tab
266
+ with gr.Tab("πŸ–ΌοΈ Image Search", id="image_search"):
267
+ gr.Markdown("### Find visually similar images")
268
+
269
+ with gr.Row():
270
+ with gr.Column(scale=2):
271
+ image_query = gr.Image(
272
+ label="Upload Query Image",
273
+ type="pil",
274
+ height=300
275
+ )
276
+
277
+ with gr.Column(scale=1):
278
+ image_num_results = gr.Slider(
279
+ minimum=1, maximum=20, value=5, step=1,
280
+ label="Number of Results"
281
+ )
282
+
283
+ image_search_btn = gr.Button("πŸ” Search Similar", variant="primary", size="lg")
284
+
285
+ with gr.Row():
286
+ image_results = gr.Gallery(
287
+ label="Similar Images",
288
+ show_label=True,
289
+ elem_id="image_gallery",
290
+ columns=5,
291
+ rows=1,
292
+ height="auto",
293
+ object_fit="contain"
294
+ )
295
+ image_info = gr.Markdown(label="Details")
296
+
297
+ # About Tab
298
+ with gr.Tab("ℹ️ About", id="about"):
299
+ gr.Markdown("""
300
+ ### πŸ”¬ Technical Details
301
+
302
+ This multimodal search engine demonstrates advanced AI techniques for content-based image retrieval:
303
+
304
+ **🧠 Model Architecture:**
305
+ - **CLIP-ViT-B-32**: OpenAI's Contrastive Language-Image Pre-training model
306
+ - **Vision Transformer**: Processes images using attention mechanisms
307
+ - **Dual-encoder**: Separate encoders for text and images mapping to shared embedding space
308
+
309
+ **⚑ Search Infrastructure:**
310
+ - **FAISS**: Facebook AI Similarity Search for efficient vector operations
311
+ - **Cosine Similarity**: Measures semantic similarity in embedding space
312
+ - **Inner Product Index**: Optimized for normalized embeddings
313
+
314
+ **πŸ“Š Dataset:**
315
+ - **Caltech101**: 500 randomly sampled images from 101 object categories
316
+ - **Preprocessing**: RGB conversion, CLIP-compatible normalization
317
+ - **Embeddings**: 512-dimensional feature vectors per image
318
+
319
+ **πŸš€ Performance Features:**
320
+ - **GPU Acceleration**: CUDA support for faster inference
321
+ - **Batch Processing**: Efficient embedding computation
322
+ - **Real-time Search**: Sub-second query response times
323
+ - **Normalized Embeddings**: L2 normalization for consistent similarity scores
324
+
325
+ **🎯 Applications:**
326
+ - Content-based image retrieval
327
+ - Visual search engines
328
+ - Cross-modal similarity matching
329
+ - Dataset exploration and analysis
330
+
331
+ ### πŸ› οΈ Implementation Highlights
332
+ - Modular architecture with separate indexing and search components
333
+ - Error handling and graceful degradation
334
+ - Configurable result counts and similarity thresholds
335
+ - Professional UI with responsive design
336
+ """)
337
+
338
+ # Event handlers
339
+ text_search_btn.click(
340
+ fn=text_search_interface,
341
+ inputs=[text_query, text_num_results],
342
+ outputs=[text_results, text_info]
343
+ )
344
+
345
+ image_search_btn.click(
346
+ fn=image_search_interface,
347
+ inputs=[image_query, image_num_results],
348
+ outputs=[image_results, image_info]
349
+ )
350
+
351
+ # Auto-search on Enter key for text
352
+ text_query.submit(
353
+ fn=text_search_interface,
354
+ inputs=[text_query, text_num_results],
355
+ outputs=[text_results, text_info]
356
+ )
357
+
358
+ # Launch configuration
359
+ if __name__ == "__main__":
360
+ print("\n" + "="*50)
361
+ print("πŸš€ Starting Multimodal AI Search Engine")
362
+ print("="*50)
363
+
364
+ if ENGINE_LOADED:
365
+ print(f"βœ… Search engine ready with {search_engine.index.ntotal} images")
366
+ print(f"βœ… Using device: {search_engine.device}")
367
+ else:
368
+ print("❌ Search engine failed to load")
369
+
370
+ print("\nπŸ’‘ Usage Tips:")
371
+ print("- Text search: Use natural language descriptions")
372
+ print("- Image search: Upload any image to find similar ones")
373
+ print("- Adjust result count using the slider")
374
+
375
+ demo.launch(
376
+ server_name="0.0.0.0",
377
+ server_port=7860,
378
+ share=False,
379
+ show_error=True
380
+ )
requirements.txt ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Core ML dependencies
2
+ torch>=1.11.0
3
+ torchvision>=0.12.0
4
+ sentence-transformers>=2.2.0
5
+ faiss-cpu>=1.7.0
6
+
7
+ # Data processing
8
+ numpy>=1.21.0
9
+ Pillow>=9.0.0
10
+ datasets>=2.0.0
11
+
12
+ # UI and visualization
13
+ gradio>=4.0.0
14
+
15
+ # Utilities
16
+ tqdm>=4.64.0
17
+ requests>=2.28.0
18
+
19
+ # HuggingFace specific
20
+ huggingface_hub>=0.16.0