Spaces:

userx2000
/

cloudzy_ai_challenge

Running

App Files Files Community

matinsn2000 commited on 3 days ago

Commit

d667f1f

1 Parent(s): 1cb8b50

Used better model for text embedding

Browse files

Files changed (6) hide show

AI_USAGE_REPORT.txt +22 -12
cloudzy/agents/image_analyzer_2.py +36 -22
cloudzy/ai_utils.py +28 -9
cloudzy/routes/photo.py +1 -1
cloudzy/routes/search.py +11 -20
cloudzy/search_engine.py +98 -33

AI_USAGE_REPORT.txt CHANGED Viewed

@@ -18,8 +18,8 @@ WHERE & HOW AI WAS USED:
    - Function: Generate images from text prompts
 3. Semantic Search (cloudzy/search_engine.py + cloudzy/routes/search.py)
-   - Tool: FAISS (vector database) with embeddings
-   - Function: Find visually similar photos via embedding vectors
 PROMPTS & MODEL INPUTS:
 Image Analysis Prompt #1 - Structured Metadata (image_analyzer.py):
@@ -39,8 +39,10 @@ Search Queries:
 - Album creation: Groups similar photos by distance threshold (randomized each call)
 MODEL OUTPUTS REFINED:
-✓ JSON parsing: Extracted structured data from model text response
-✓ Distance threshold tuning: Adjusted for FAISS L2 distance (default 0.3)
 ✓ Album randomization: Added random.shuffle() to prevent deterministic groupings
 ✓ Error handling: Wrapped API failures to graceful fallbacks
@@ -60,14 +62,22 @@ Manual Refinements (35%):
 - CORS middleware configuration
 KEY TECHNICAL DECISIONS:
-1. Distance threshold = 0.3: Filters visually similar photos
-2. Model choice: Qwen3-VL for balanced speed/quality
-3. FLUX.1-dev: High-quality image generation over speed
-4. Random album creation: Ensures different groupings per request
-5. HuggingFace Hub: Leveraged pre-tuned models vs training custom
 FILES MODIFIED FOR IMPROVEMENTS:
-- search_engine.py: Added randomization + album count control
 - image_analyzer.py: JSON error handling for vision model output
-- image_analyzer_2.py: Agentic image analysis with Gemini-2.0-Flash for aesthetic descriptions
-- text_to_image.py: Timestamp-based filename collision prevention

    - Function: Generate images from text prompts
 3. Semantic Search (cloudzy/search_engine.py + cloudzy/routes/search.py)
+   - Tool: FAISS (vector database) with embeddings from Qwen/Qwen3-Embedding-8B (4096-dimensional)
+   - Function: Find visually similar photos via L2-normalized embedding vectors
 PROMPTS & MODEL INPUTS:
 Image Analysis Prompt #1 - Structured Metadata (image_analyzer.py):
 - Album creation: Groups similar photos by distance threshold (randomized each call)
 MODEL OUTPUTS REFINED:
+✓ JSON parsing: Extracted structured data from model text response (with dict type-check for Gemini responses)
+✓ Embedding model upgrade: Migrated from multilingual-e5-large (1024-d) to Qwen3-Embedding-8B (4096-d)
+✓ L2 normalization: Added unit-vector normalization to embeddings for consistent distance calculations
+✓ Distance threshold tuning: Adjusted for normalized embeddings (0.5 → 1.0 for search, 0.3 → 1.5 for albums)
 ✓ Album randomization: Added random.shuffle() to prevent deterministic groupings
 ✓ Error handling: Wrapped API failures to graceful fallbacks
 - CORS middleware configuration
 KEY TECHNICAL DECISIONS:
+1. Embedding model: Qwen3-Embedding-8B (4096-d) for better semantic understanding than smaller models
+2. L2 normalization: Ensures normalized distances (0-2 range) independent of embedding dimension
+3. Distance thresholds: search() ≤ 1.0, create_albums() ≤ 1.5 (optimized for normalized embeddings)
+4. Model choice: Qwen3-VL for balanced speed/quality in image analysis
+5. FLUX.1-dev: High-quality image generation over speed
+6. Random album creation: Ensures different groupings per request
+7. HuggingFace Hub: Leveraged pre-tuned models vs training custom
 FILES MODIFIED FOR IMPROVEMENTS:
+- ai_utils.py: Added L2 normalization to both generate_embedding() and _embed_text() methods
+- search_engine.py: Updated distance thresholds (0.5→1.0 search, 0.3→1.5 albums) for normalized embeddings
 - image_analyzer.py: JSON error handling for vision model output
+- image_analyzer_2.py: Dict type-check for Gemini responses + agentic image analysis with Gemini-2.0-Flash
+- text_to_image.py: Timestamp-based filename collision prevention
+EMBEDDING UPGRADE SUMMARY:
+Old: multilingual-e5-large (1024-dimensional, unnormalized)
+New: Qwen/Qwen3-Embedding-8B (4096-dimensional, L2-normalized)
+Benefit: Better semantic understanding + consistent distance calculations across query types

cloudzy/agents/image_analyzer_2.py CHANGED Viewed

@@ -97,29 +97,43 @@ result: {
         response = self.agent.run(prompt, images=[image])
-        # Ensure response is a string
-        response_text = str(response) if response is not None else ""
-        # Extract JSON part from response
-        # Look for the pattern: result: { ... } (or { ... if closing brace is missing)
-        match = re.search(r'result:\s*(\{[\s\S]*)', response_text)
-        if not match:
-            raise ValueError(f"Could not find JSON in response: {response_text}")
-        json_str = match.group(1)
-        # If the extracted JSON doesn't end with }, try adding it
-        if not json_str.rstrip().endswith("}"):
-            print(f"[Warning] No closing brace found in JSON, attempting to add closing brace...")
-            json_str = json_str + "}"
         try:
-            # Parse the JSON string into a dictionary
-            result_dict = json.loads(json_str)
-            return result_dict
-        except json.JSONDecodeError as e:
-            raise ValueError(f"Failed to parse JSON from response: {json_str}\nError: {str(e)}")
 # Test with sample images

         response = self.agent.run(prompt, images=[image])
+        # If response is already a dict, return it directly
+        if isinstance(response, dict):
+            return response
+        # Safely convert to string, handling non-string types
+        if response is None:
+            text_content = ""
+        else:
+            text_content = str(response).strip()
+        if not text_content:
+            raise ValueError("Model returned empty response")
+        # Try to extract JSON-like dict from model output
         try:
+            if "{" not in text_content:
+                raise ValueError("Response does not contain valid JSON structure (missing opening brace)")
+            start = text_content.index("{")
+            # Try to find closing brace
+            if "}" not in text_content[start:]:
+                # No closing brace found, try adding one
+                print(f"[Warning] No closing brace found in response, attempting to add closing brace...")
+                json_str = text_content[start:] + "}"
+            else:
+                end = text_content.rindex("}") + 1
+                json_str = text_content[start:end]
+            result = json.loads(json_str)
+            return result
+        except ValueError as ve:
+            raise ValueError(f"Failed to parse model output: {text_content}\nError: {ve}")
+        except json.JSONDecodeError as je:
+            raise ValueError(f"Invalid JSON in model output: {text_content}\nError: {je}")
+        except Exception as e:
+            raise ValueError(f"Failed to parse model output: {text_content}\nError: {e}")
 # Test with sample images

cloudzy/ai_utils.py CHANGED Viewed

@@ -1,24 +1,28 @@
 import os
 import numpy as np
 from huggingface_hub import InferenceClient
 from dotenv import load_dotenv
 load_dotenv()
 class ImageEmbeddingGenerator:
-    def __init__(self, model_name: str = "intfloat/multilingual-e5-large"):
         """
         Initialize the embedding generator with a Hugging Face model.
         """
         self.client = InferenceClient(
-            provider="hf-inference",
             api_key=os.environ["HF_TOKEN_1"],
         )
         self.model_name = model_name
     def generate_embedding(self, tags: list[str], description: str, caption: str) -> np.ndarray:
         """
-        Generate a 512-d embedding for an image using its tags, description, and caption.
         Args:
             tags: List of tags related to the image
@@ -26,7 +30,7 @@ class ImageEmbeddingGenerator:
             caption: Short caption for the image
         Returns:
-            embedding: 1D numpy array of shape (512,)
         """
         # Combine text fields into a single string
         text = " ".join(tags) + " " + description + " " + caption
@@ -40,9 +44,15 @@ class ImageEmbeddingGenerator:
         # Convert to numpy array
         embedding = np.array(result, dtype=np.float32).reshape(-1)
-        # Ensure shape is (512,)
-        if embedding.shape[0] != 1024:
-            raise ValueError(f"Expected embedding of size 512, got {embedding.shape[0]}")
         return embedding
@@ -50,6 +60,7 @@ class ImageEmbeddingGenerator:
     def _embed_text(self, text: str) -> np.ndarray:
         """
         Internal helper to call Hugging Face feature_extraction and return a numpy array.
         """
         result = self.client.feature_extraction(
             text,
@@ -57,11 +68,19 @@ class ImageEmbeddingGenerator:
         )
         embedding = np.array(result, dtype=np.float32).reshape(-1)
-        if embedding.shape[0] != 1024:
-            raise ValueError(f"Expected embedding of size 1024, got {embedding.shape[0]}")
         return embedding
 class TextSummarizer:
     def __init__(self, model_name: str = "facebook/bart-large-cnn"):
         """

 import os
 import numpy as np
 from huggingface_hub import InferenceClient
+from typing import List, Dict, Tuple
+import re
 from dotenv import load_dotenv
 load_dotenv()
 class ImageEmbeddingGenerator:
+    def __init__(self, model_name: str = "Qwen/Qwen3-Embedding-8B"):
         """
         Initialize the embedding generator with a Hugging Face model.
         """
         self.client = InferenceClient(
+            provider="nebius",
             api_key=os.environ["HF_TOKEN_1"],
         )
         self.model_name = model_name
     def generate_embedding(self, tags: list[str], description: str, caption: str) -> np.ndarray:
         """
+        Generate a 4096-d embedding for an image using its tags, description, and caption.
         Args:
             tags: List of tags related to the image
             caption: Short caption for the image
         Returns:
+            embedding: 1D numpy array of shape (4096,), normalized to unit length
         """
         # Combine text fields into a single string
         text = " ".join(tags) + " " + description + " " + caption
         # Convert to numpy array
         embedding = np.array(result, dtype=np.float32).reshape(-1)
+        # Ensure shape is (4096,)
+        if embedding.shape[0] != 4096:
+            raise ValueError(f"Expected embedding of size 4096, got {embedding.shape[0]}")
+        # Normalize to unit length (L2 normalization)
+        # This ensures distances stay consistent across models and dimensions
+        norm = np.linalg.norm(embedding)
+        if norm > 0:
+            embedding = embedding / norm
         return embedding
     def _embed_text(self, text: str) -> np.ndarray:
         """
         Internal helper to call Hugging Face feature_extraction and return a numpy array.
+        Embeddings are normalized to unit length for consistent distance calculations.
         """
         result = self.client.feature_extraction(
             text,
         )
         embedding = np.array(result, dtype=np.float32).reshape(-1)
+        if embedding.shape[0] != 4096:
+            raise ValueError(f"Expected embedding of size 4096, got {embedding.shape[0]}")
+        # Normalize to unit length (L2 normalization)
+        norm = np.linalg.norm(embedding)
+        if norm > 0:
+            embedding = embedding / norm
         return embedding
 class TextSummarizer:
     def __init__(self, model_name: str = "facebook/bart-large-cnn"):
         """

cloudzy/routes/photo.py CHANGED Viewed

@@ -89,7 +89,7 @@ async def get_albums(
     """
     search_engine = SearchEngine()
-    albums_ids = search_engine.create_albums(top_k=top_k)
     APP_DOMAIN = os.getenv("APP_DOMAIN") or "http://127.0.0.1:8000/"
     summarizer = TextSummarizer()

     """
     search_engine = SearchEngine()
+    albums_ids = search_engine.create_albums_kmeans(top_k=top_k)
     APP_DOMAIN = os.getenv("APP_DOMAIN") or "http://127.0.0.1:8000/"
     summarizer = TextSummarizer()

cloudzy/routes/search.py CHANGED Viewed

@@ -21,56 +21,47 @@ async def search_photos(
     session: Session = Depends(get_session),
 ):
     """
-    Semantic search for similar photos using FAISS.
-    Converts query to embedding and finds most similar images.
     Args:
         q: Search query (used to generate embedding)
         top_k: Number of results to return (max 50)
-    Returns: List of similar photos with distance scores
     """
     generator = ImageEmbeddingGenerator()
     query_embedding = generator._embed_text(q)
-    # Search in FAISS
     search_engine = SearchEngine()
     search_results = search_engine.search(query_embedding, top_k=top_k)
     if not search_results:
         return SearchResponse(
             query=q,
             results=[],
             total_results=0,
         )
-    APP_DOMAIN = os.getenv("APP_DOMAIN")
-    # Fetch photo details from database
     result_objects = []
     for photo_id, distance in search_results:
         statement = select(Photo).where(Photo.id == photo_id)
         photo = session.exec(statement).first()
-        if photo:  # Only include if photo exists in DB
             result_objects.append(
                 SearchResult(
                     photo_id=photo.id,
                     filename=photo.filename,
-                    image_url = f"{APP_DOMAIN}uploads/{photo.filename}",
                     tags=photo.get_tags(),
                     caption=photo.caption,
                     distance=distance,
                 )
             )
     return SearchResponse(
         query=q,
         results=result_objects,

     session: Session = Depends(get_session),
 ):
     """
+    Semantic search endpoint using FAISS.
     Args:
         q: Search query (used to generate embedding)
         top_k: Number of results to return (max 50)
+    Returns: List of similar photos
     """
     generator = ImageEmbeddingGenerator()
     query_embedding = generator._embed_text(q)
     search_engine = SearchEngine()
     search_results = search_engine.search(query_embedding, top_k=top_k)
     if not search_results:
         return SearchResponse(
             query=q,
             results=[],
             total_results=0,
         )
+    APP_DOMAIN = os.getenv("APP_DOMAIN")
     result_objects = []
     for photo_id, distance in search_results:
         statement = select(Photo).where(Photo.id == photo_id)
         photo = session.exec(statement).first()
+        if photo:
             result_objects.append(
                 SearchResult(
                     photo_id=photo.id,
                     filename=photo.filename,
+                    image_url=f"{APP_DOMAIN}uploads/{photo.filename}",
                     tags=photo.get_tags(),
                     caption=photo.caption,
                     distance=distance,
                 )
             )
     return SearchResponse(
         query=q,
         results=result_objects,

cloudzy/search_engine.py CHANGED Viewed

@@ -9,7 +9,7 @@ import random
 class SearchEngine:
     """FAISS-based search engine for image embeddings"""
-    def __init__(self, dim: int = 1024, index_path: str = "faiss_index.bin"):
         self.dim = dim
         self.index_path = index_path
@@ -20,7 +20,7 @@ class SearchEngine:
             base_index = faiss.IndexFlatL2(dim)
             self.index = faiss.IndexIDMap(base_index)
-    def create_albums(self, top_k: int = 5, distance_threshold: float = 0.3, album_size: int = 5) -> List[List[int]]:
         """
         Group similar images into albums (clusters).
@@ -28,9 +28,14 @@ class SearchEngine:
         Photos are marked as visited to avoid duplicate albums.
         Only includes photos within the distance threshold.
         Args:
             top_k: Number of albums to return
-            distance_threshold: Maximum distance to consider photos as similar (default 0.3)
             album_size: How many similar photos to search for per album (default 5)
         Returns:
@@ -51,6 +56,20 @@ class SearchEngine:
         # Shuffle for randomization - different albums each call
         random.shuffle(all_ids)
         visited = set()
         albums = []
@@ -63,37 +82,80 @@ class SearchEngine:
             if photo_id in visited:
                 continue
-            # Get embedding from database
-            session = SessionLocal()
-            try:
-                photo = session.exec(select(Photo).where(Photo.id == photo_id)).first()
-                if not photo:
-                    continue
-                embedding = photo.get_embedding()
-                if not embedding:
-                    continue
-                # Search for similar images
-                query_embedding = np.array(embedding).reshape(1, -1).astype(np.float32)
-                distances, ids = self.index.search(query_embedding, album_size)
-                # Build album: collect similar photos that haven't been visited and are within threshold
-                album = []
-                for pid, distance in zip(ids[0], distances[0]):
-                    if pid != -1 and pid not in visited and distance <= distance_threshold:
-                        album.append(int(pid))
-                        visited.add(pid)
-                # Add album if it has at least 1 photo
-                if album:
-                    albums.append(album)
-            finally:
-                session.close()
         return albums
     def add_embedding(self, photo_id: int, embedding: np.ndarray) -> None:
         """
         Add an embedding to the index.
@@ -120,7 +182,7 @@ class SearchEngine:
             top_k: Number of results to return
         Returns:
-            List of (photo_id, distance) tuples with distance <= 0.5
         """
         self.load()
@@ -133,11 +195,14 @@ class SearchEngine:
         # Search in FAISS index
         distances, ids = self.index.search(query_embedding, top_k)
         # Filter invalid and distant results
         results = [
             (int(photo_id), float(distance))
             for photo_id, distance in zip(ids[0], distances[0])
-            if photo_id != -1 and distance <= 0.5
         ]
         return results

 class SearchEngine:
     """FAISS-based search engine for image embeddings"""
+    def __init__(self, dim: int = 4096, index_path: str = "faiss_index.bin"):
         self.dim = dim
         self.index_path = index_path
             base_index = faiss.IndexFlatL2(dim)
             self.index = faiss.IndexIDMap(base_index)
+    def create_albums(self, top_k: int = 5, distance_threshold: float = 1.5, album_size: int = 5) -> List[List[int]]:
         """
         Group similar images into albums (clusters).
         Photos are marked as visited to avoid duplicate albums.
         Only includes photos within the distance threshold.
+        OPTIMIZATIONS:
+        - Batch retrieves all photos in ONE database query (not per-photo)
+        - Caches embeddings in memory during execution
+        - Single session for all DB operations
         Args:
             top_k: Number of albums to return
+            distance_threshold: Maximum distance to consider photos as similar (default 1.0 for normalized embeddings)
             album_size: How many similar photos to search for per album (default 5)
         Returns:
         # Shuffle for randomization - different albums each call
         random.shuffle(all_ids)
+        # ✅ OPTIMIZATION 1: Batch retrieve all photos in ONE query
+        session = SessionLocal()
+        try:
+            # Fetch all photos at once, not in a loop
+            photos_query = session.exec(select(Photo).where(Photo.id.in_(all_ids))).all()
+            # ✅ OPTIMIZATION 2: Cache embeddings in memory
+            embedding_cache = {}
+            for photo in photos_query:
+                embedding = photo.get_embedding()
+                if embedding:
+                    embedding_cache[photo.id] = embedding
+        finally:
+            session.close()
         visited = set()
         albums = []
             if photo_id in visited:
                 continue
+            # Skip if no embedding cached
+            if photo_id not in embedding_cache:
+                continue
+            # Get embedding from cache (not DB)
+            embedding = embedding_cache[photo_id]
+            # Search for similar images
+            query_embedding = np.array(embedding).reshape(1, -1).astype(np.float32)
+            distances, ids = self.index.search(query_embedding, album_size)
+            # Build album: collect similar photos that haven't been visited and are within threshold
+            album = []
+            for pid, distance in zip(ids[0], distances[0]):
+                if pid != -1 and pid not in visited and distance <= distance_threshold:
+                    album.append(int(pid))
+                    visited.add(pid)
+            # Add album if it has at least 1 photo
+            if album:
+                albums.append(album)
         return albums
+    def create_albums_kmeans(self, top_k: int = 5, seed: int = 42) -> List[List[int]]:
+        """
+        Group similar images into albums using FAISS k-means clustering.
+        This is a BETTER approach than nearest-neighbor grouping:
+        - Uses true k-means clustering instead of ad-hoc neighbor search
+        - All photos get assigned to a cluster (no "orphans")
+        - Deterministic results for same seed
+        - Much faster for large datasets
+        Args:
+            top_k: Number of clusters (albums) to create
+            seed: Random seed for reproducibility
+        Returns:
+            List of top_k albums, each album is a list of photo_ids
+        """
+        self.load()
+        if self.index.ntotal < top_k:
+            return []
+        # Get all photo IDs from FAISS index
+        id_map = self.index.id_map
+        all_ids = np.array([id_map.at(i) for i in range(id_map.size())], dtype=np.int64)
+        # Get all embeddings from the underlying index (IndexIDMap wraps the actual index)
+        underlying_index = faiss.downcast_index(self.index.index)
+        all_embeddings = underlying_index.reconstruct_n(0, self.index.ntotal).astype(np.float32)
+        # ✅ Run k-means clustering
+        kmeans = faiss.Kmeans(
+            d=self.dim,
+            k=top_k,
+            niter=20,
+            verbose=False,
+            seed=seed
+        )
+        kmeans.train(all_embeddings)
+        # Assign each embedding to nearest cluster
+        distances, cluster_assignments = kmeans.index.search(all_embeddings, 1)
+        # Group photos by cluster
+        albums = [[] for _ in range(top_k)]
+        for photo_id, cluster_id in zip(all_ids, cluster_assignments.flatten()):
+            albums[cluster_id].append(int(photo_id))
+        # Remove empty albums and return
+        return [album for album in albums if album]
     def add_embedding(self, photo_id: int, embedding: np.ndarray) -> None:
         """
         Add an embedding to the index.
             top_k: Number of results to return
         Returns:
+            List of (photo_id, distance) tuples with distance <= 1.0 (normalized embeddings)
         """
         self.load()
         # Search in FAISS index
         distances, ids = self.index.search(query_embedding, top_k)
+        print(distances)
         # Filter invalid and distant results
+        # With normalized embeddings, L2 distance range is 0-2, threshold of 1.0 works well
         results = [
             (int(photo_id), float(distance))
             for photo_id, distance in zip(ids[0], distances[0])
+            if photo_id != -1 and distance <= 1.5
         ]
         return results