|  | --- | 
					
						
						|  | license: mit | 
					
						
						|  | pipeline_tag: visual-document-retrieval | 
					
						
						|  | --- | 
					
						
						|  |  | 
					
						
						|  |  | 
					
						
						|  | # EndpointHandler | 
					
						
						|  |  | 
					
						
						|  | `EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features. | 
					
						
						|  |  | 
					
						
						|  | ## Overview | 
					
						
						|  |  | 
					
						
						|  | - **Efficient Document Retrieval**: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval. | 
					
						
						|  | - **Multi-vector Representation**: Generates ColBERT-style multi-vector embeddings for improved similarity search. | 
					
						
						|  | - **Flexible Image Resolution**: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency. | 
					
						
						|  | - **Device Compatibility**: Automatically utilizes available CUDA devices or defaults to CPU. | 
					
						
						|  |  | 
					
						
						|  | ## Model Details | 
					
						
						|  |  | 
					
						
						|  | The **ColQwen2** model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include: | 
					
						
						|  | - **Training**: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token. | 
					
						
						|  | - **Input Flexibility**: Handles various image resolutions without resizing, ensuring accurate multi-vector representation. | 
					
						
						|  | - **Similarity Scoring**: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities. | 
					
						
						|  |  | 
					
						
						|  | This base version is untrained, providing deterministic initialization of the projection layer for further customization. | 
					
						
						|  |  | 
					
						
						|  | ## How to Use | 
					
						
						|  |  | 
					
						
						|  | The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler. | 
					
						
						|  |  | 
					
						
						|  | ### Example Script | 
					
						
						|  |  | 
					
						
						|  | ```python | 
					
						
						|  | import torch | 
					
						
						|  | from pdf2image import convert_from_path | 
					
						
						|  | import base64 | 
					
						
						|  | from io import BytesIO | 
					
						
						|  | import requests | 
					
						
						|  |  | 
					
						
						|  | # Function to convert PIL Image to base64 string | 
					
						
						|  | def pil_image_to_base64(image): | 
					
						
						|  | """Converts a PIL Image to a base64 encoded string.""" | 
					
						
						|  | buffer = BytesIO() | 
					
						
						|  | image.save(buffer, format="PNG") | 
					
						
						|  | return base64.b64encode(buffer.getvalue()).decode() | 
					
						
						|  |  | 
					
						
						|  | # Function to convert PDF pages to base64 images | 
					
						
						|  | def convert_pdf_to_base64_images(pdf_path): | 
					
						
						|  | """Converts PDF pages to base64 encoded images.""" | 
					
						
						|  | pages = convert_from_path(pdf_path) | 
					
						
						|  | return [pil_image_to_base64(page) for page in pages] | 
					
						
						|  |  | 
					
						
						|  | # Function to send payload to API and retrieve response | 
					
						
						|  | def query_api(payload, api_url, headers): | 
					
						
						|  | """Sends a POST request to the API and returns the response.""" | 
					
						
						|  | response = requests.post(api_url, headers=headers, json=payload) | 
					
						
						|  | return response.json() | 
					
						
						|  |  | 
					
						
						|  | # Main execution | 
					
						
						|  | if __name__ == "__main__": | 
					
						
						|  | # Convert PDF pages to base64 encoded images | 
					
						
						|  | encoded_images = convert_pdf_to_base64_images('document.pdf') | 
					
						
						|  |  | 
					
						
						|  | # Prepare payload | 
					
						
						|  | payload = { | 
					
						
						|  | "inputs": [], | 
					
						
						|  | "image": encoded_images, | 
					
						
						|  | "text": ["example query text"] | 
					
						
						|  | } | 
					
						
						|  |  | 
					
						
						|  | # API configuration | 
					
						
						|  | API_URL = "https://your-api-url" | 
					
						
						|  | headers = { | 
					
						
						|  | "Accept": "application/json", | 
					
						
						|  | "Authorization": "Bearer your_access_token", | 
					
						
						|  | "Content-Type": "application/json" | 
					
						
						|  | } | 
					
						
						|  |  | 
					
						
						|  | # Query the API and get output | 
					
						
						|  | output = query_api(payload=payload, api_url=API_URL, headers=headers) | 
					
						
						|  | print(output) | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ## Inputs and Outputs | 
					
						
						|  |  | 
					
						
						|  | ### Input Format | 
					
						
						|  | The `EndpointHandler` expects a dictionary containing: | 
					
						
						|  | - **image**: A list of base64-encoded strings for images (e.g., PDF pages converted to images). | 
					
						
						|  | - **text**: A list of text strings representing queries or document contents. | 
					
						
						|  | - **batch_size** (optional): The batch size for processing images and text. Defaults to `4`. | 
					
						
						|  |  | 
					
						
						|  | Example payload: | 
					
						
						|  | ```json | 
					
						
						|  | { | 
					
						
						|  | "image": ["base64_image_string_1", "base64_image_string_2"], | 
					
						
						|  | "text": ["sample text 1", "sample text 2"], | 
					
						
						|  | "batch_size": 4 | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Output Format | 
					
						
						|  | The handler returns a dictionary with the following keys: | 
					
						
						|  | - **image**: List of embeddings for each image. | 
					
						
						|  | - **text**: List of embeddings for each text entry. | 
					
						
						|  | - **scores**: List of similarity scores between the image and text embeddings. | 
					
						
						|  |  | 
					
						
						|  | Example output: | 
					
						
						|  | ```json | 
					
						
						|  | { | 
					
						
						|  | "image": [[0.12, 0.34, ...], [0.56, 0.78, ...]], | 
					
						
						|  | "text": [[0.11, 0.22, ...], [0.33, 0.44, ...]], | 
					
						
						|  | "scores": [[0.87, 0.45], [0.23, 0.67]] | 
					
						
						|  | } | 
					
						
						|  | ``` | 
					
						
						|  |  | 
					
						
						|  | ### Error Handling | 
					
						
						|  | If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary. | 
					
						
						|  |  | 
					
						
						|  |  |