colqwen2-1.0-alpha-inference / README.md

Update metadata with huggingface_hub (#1)

566b087 verified 9 months ago

4.63 kB

	---
	license: mit
	pipeline_tag: visual-document-retrieval
	---


	# EndpointHandler

	`EndpointHandler` is a Python class that processes image and text data to generate embeddings and similarity scores using the ColQwen2 model—a visual retriever based on Qwen2-VL-2B-Instruct with the ColBERT strategy. This handler is optimized for retrieving documents and visual information based on their visual and textual features.

	## Overview

	- Efficient Document Retrieval: Uses the ColQwen2 model to produce embeddings for images and text for accurate document retrieval.
	- Multi-vector Representation: Generates ColBERT-style multi-vector embeddings for improved similarity search.
	- Flexible Image Resolution: Supports dynamic image resolution without altering the aspect ratio, capped at 768 patches for memory efficiency.
	- Device Compatibility: Automatically utilizes available CUDA devices or defaults to CPU.

	## Model Details

	The ColQwen2 model extends Qwen2-VL-2B with a focus on vision-language tasks, making it suitable for content indexing and retrieval. Key features include:
	- Training: Pre-trained with a batch size of 256 over 5 epochs, with a modified pad token.
	- Input Flexibility: Handles various image resolutions without resizing, ensuring accurate multi-vector representation.
	- Similarity Scoring: Utilizes a ColBERT-style scoring approach for efficient retrieval across image and text modalities.

	This base version is untrained, providing deterministic initialization of the projection layer for further customization.

	## How to Use

	The following example demonstrates how to use `EndpointHandler` for processing PDF documents and text. PDF pages are converted to base64 images, which are then passed as input alongside text data to the handler.

	### Example Script

	```python
	import torch
	from pdf2image import convert_from_path
	import base64
	from io import BytesIO
	import requests

	# Function to convert PIL Image to base64 string
	def pil_image_to_base64(image):
	"""Converts a PIL Image to a base64 encoded string."""
	buffer = BytesIO()
	image.save(buffer, format="PNG")
	return base64.b64encode(buffer.getvalue()).decode()

	# Function to convert PDF pages to base64 images
	def convert_pdf_to_base64_images(pdf_path):
	"""Converts PDF pages to base64 encoded images."""
	pages = convert_from_path(pdf_path)
	return [pil_image_to_base64(page) for page in pages]

	# Function to send payload to API and retrieve response
	def query_api(payload, api_url, headers):
	"""Sends a POST request to the API and returns the response."""
	response = requests.post(api_url, headers=headers, json=payload)
	return response.json()

	# Main execution
	if __name__ == "__main__":
	# Convert PDF pages to base64 encoded images
	encoded_images = convert_pdf_to_base64_images('document.pdf')

	# Prepare payload
	payload = {
	"inputs": [],
	"image": encoded_images,
	"text": ["example query text"]
	}

	# API configuration
	API_URL = "https://your-api-url"
	headers = {
	"Accept": "application/json",
	"Authorization": "Bearer your_access_token",
	"Content-Type": "application/json"
	}

	# Query the API and get output
	output = query_api(payload=payload, api_url=API_URL, headers=headers)
	print(output)
	```

	## Inputs and Outputs

	### Input Format
	The `EndpointHandler` expects a dictionary containing:
	- image: A list of base64-encoded strings for images (e.g., PDF pages converted to images).
	- text: A list of text strings representing queries or document contents.
	- batch_size (optional): The batch size for processing images and text. Defaults to `4`.

	Example payload:
	```json
	{
	"image": ["base64_image_string_1", "base64_image_string_2"],
	"text": ["sample text 1", "sample text 2"],
	"batch_size": 4
	}
	```

	### Output Format
	The handler returns a dictionary with the following keys:
	- image: List of embeddings for each image.
	- text: List of embeddings for each text entry.
	- scores: List of similarity scores between the image and text embeddings.

	Example output:
	```json
	{
	"image": [[0.12, 0.34, ...], [0.56, 0.78, ...]],
	"text": [[0.11, 0.22, ...], [0.33, 0.44, ...]],
	"scores": [[0.87, 0.45], [0.23, 0.67]]
	}
	```

	### Error Handling
	If any issues occur during processing (e.g., decoding images or model inference), the handler logs the error and returns an error message in the output dictionary.