Spaces:
Runtime error
Runtime error
| title: OmniScience -- Agentic Imaging Analysis | |
| emoji: 🔬🧫 | |
| colorFrom: indigo | |
| colorTo: gray | |
| sdk: gradio | |
| sdk_version: 4.36.1 | |
| app_file: app.py | |
| pinned: false | |
| license: openrail | |
| # Using Landing AI's Vision Agent to architect an app for brain tumor detection | |
| - a quick overview of the inner workings of LandingAI's Vision Agent, how it breaks down an initial user requirement to identify candidate components in the application architecture. | |
| - the diagram below captures what I had in mind for a multi-agent system implementation -- but LandingAI's vision agent starts this much earlier, taking a fresh approach on old school architecture trade-off analysis. | |
| - the design-time flow in the most recent version of the official [Vision Agent](https://va.landing.ai/agent) app has shifted somewhat, but the number of concepts it helped bring together for me was amazing. | |
| - if you want a deeper understanding of the run-time flow of the application I encourage you to instrument it with Weave. Additional information on how to instrument the app can be found in [this GitHub repo](https://github.com/donbr/vision-agent). | |
|  | |
| # Summary of Vision Agent design-time workflow | |
| MY PROMPT: use detection (bounding box) and segmentation (segmenation and mask) techniques to detect brain tumors in the image. | |
| ## Creating instructions | |
| - Load the image from the given file path '/home/user/xxxx.jpg'. | |
| - Use the 'owl_v2' tool to detect brain tumors in the image. The prompt should be 'brain tumor'. | |
| - Use the 'grounding_sam' tool to segment brain tumors in the image. The prompt should be 'brain tumor'. | |
| - Overlay the bounding boxes from the detection results on the original image using the 'overlay_bounding_boxes' utility. | |
| - Overlay the segmentation masks from the segmentation results on the original image using the 'overlay_segmentation_masks' utility. | |
| - Save the final image with both bounding boxes and segmentation masks to a specified output path. | |
| ## Retrieving Vision Agent tools | |
| - 'load_image' is a utility function that loads an image from the given file path string. | |
| 'save_image' is a utility function that saves an image to a file path. | |
| - 'owl_v2' is a tool that can detect and count multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas. It returns a list of bounding boxes with normalized coordinates, label names and associated probability scores. | |
| - 'florencev2_object_detection' is a tool that can detect common objects in an image without any text prompt or thresholding. It returns a list of detected objects as labels and their location as bounding boxes. | |
| - 'grounding_sam' is a tool that can segment multiple objects given a text prompt such as category names or referring expressions. The categories in text prompt are separated by commas or periods. It returns a list of bounding boxes, label names, mask file names and associated probability scores. | |
| - 'detr_segmentation' is a tool that can segment common objects in an image without any text prompt. It returns a list of detected objects as labels, their regions as masks and their scores. | |
| - 'overlay_bounding_boxes' is a utility function that displays bounding boxes on an image. | |
| - 'overlay_heat_map' is a utility function that displays a heat map on an image. | |
| - 'overlay_segmentation_masks' is a utility function that displays segmentation masks. | |
| ### Retrieving tools - detailed notes from Vision Agent tool selection | |
| load_image(image_path: str) -> numpy.ndarray: | |
| 'load_image' is a utility function that loads an image from the given file path string. | |
| Parameters: | |
| image_path (str): The path to the image. | |
| Returns: | |
| np.ndarray: The image as a NumPy array. | |
| Example | |
| ------- | |
| >>> load_image("path/to/image.jpg") | |
| save_image(image: numpy.ndarray, file_path: str) -> None: | |
| 'save_image' is a utility function that saves an image to a file path. | |
| Parameters: | |
| image (np.ndarray): The image to save. | |
| file_path (str): The path to save the image file. | |
| Example | |
| ------- | |
| >>> save_image(image) | |
| owl_v2(prompt: str, image: numpy.ndarray, box_threshold: float = 0.1, iou_threshold: float = 0.1) -> List[Dict[str, Any]]: | |
| 'owl_v2' is a tool that can detect and count multiple objects given a text | |
| prompt such as category names or referring expressions. The categories in text prompt | |
| are separated by commas. It returns a list of bounding boxes with | |
| normalized coordinates, label names and associated probability scores. | |
| Parameters: | |
| prompt (str): The prompt to ground to the image. | |
| image (np.ndarray): The image to ground the prompt to. | |
| box_threshold (float, optional): The threshold for the box detection. Defaults | |
| to 0.10. | |
| iou_threshold (float, optional): The threshold for the Intersection over Union | |
| (IoU). Defaults to 0.10. | |
| Returns: | |
| List[Dict[str, Any]]: A list of dictionaries containing the score, label, and | |
| bounding box of the detected objects with normalized coordinates between 0 | |
| and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the | |
| top-left and xmax and ymax are the coordinates of the bottom-right of the | |
| bounding box. | |
| Example | |
| ------- | |
| >>> owl_v2("car. dinosaur", image) | |
| [ | |
| {'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}, | |
| {'score': 0.98, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5}, | |
| ] | |
| florencev2_object_detection(image: numpy.ndarray) -> List[Dict[str, Any]]: | |
| 'florencev2_object_detection' is a tool that can detect common objects in an | |
| image without any text prompt or thresholding. It returns a list of detected objects | |
| as labels and their location as bounding boxes. | |
| Parameters: | |
| image (np.ndarray): The image to used to detect objects | |
| Returns: | |
| List[Dict[str, Any]]: A list of dictionaries containing the score, label, and | |
| bounding box of the detected objects with normalized coordinates between 0 | |
| and 1 (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the | |
| top-left and xmax and ymax are the coordinates of the bottom-right of the | |
| bounding box. The scores are always 1.0 and cannot be thresholded | |
| Example | |
| ------- | |
| >>> florencev2_object_detection(image) | |
| [ | |
| {'score': 1.0, 'label': 'window', 'bbox': [0.1, 0.11, 0.35, 0.4]}, | |
| {'score': 1.0, 'label': 'car', 'bbox': [0.2, 0.21, 0.45, 0.5}, | |
| {'score': 1.0, 'label': 'person', 'bbox': [0.34, 0.21, 0.85, 0.5}, | |
| ] | |
| grounding_sam(prompt: str, image: numpy.ndarray, box_threshold: float = 0.2, iou_threshold: float = 0.2) -> List[Dict[str, Any]]: | |
| 'grounding_sam' is a tool that can segment multiple objects given a | |
| text prompt such as category names or referring expressions. The categories in text | |
| prompt are separated by commas or periods. It returns a list of bounding boxes, | |
| label names, mask file names and associated probability scores. | |
| Parameters: | |
| prompt (str): The prompt to ground to the image. | |
| image (np.ndarray): The image to ground the prompt to. | |
| box_threshold (float, optional): The threshold for the box detection. Defaults | |
| to 0.20. | |
| iou_threshold (float, optional): The threshold for the Intersection over Union | |
| (IoU). Defaults to 0.20. | |
| Returns: | |
| List[Dict[str, Any]]: A list of dictionaries containing the score, label, | |
| bounding box, and mask of the detected objects with normalized coordinates | |
| (xmin, ymin, xmax, ymax). xmin and ymin are the coordinates of the top-left | |
| and xmax and ymax are the coordinates of the bottom-right of the bounding box. | |
| The mask is binary 2D numpy array where 1 indicates the object and 0 indicates | |
| the background. | |
| Example | |
| ------- | |
| >>> grounding_sam("car. dinosaur", image) | |
| [ | |
| { | |
| 'score': 0.99, | |
| 'label': 'dinosaur', | |
| 'bbox': [0.1, 0.11, 0.35, 0.4], | |
| 'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0], | |
| ..., | |
| [0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
| }, | |
| ] | |
| detr_segmentation(image: numpy.ndarray) -> List[Dict[str, Any]]: | |
| 'detr_segmentation' is a tool that can segment common objects in an | |
| image without any text prompt. It returns a list of detected objects | |
| as labels, their regions as masks and their scores. | |
| Parameters: | |
| image (np.ndarray): The image used to segment things and objects | |
| Returns: | |
| List[Dict[str, Any]]: A list of dictionaries containing the score, label | |
| and mask of the detected objects. The mask is binary 2D numpy array where 1 | |
| indicates the object and 0 indicates the background. | |
| Example | |
| ------- | |
| >>> detr_segmentation(image) | |
| [ | |
| { | |
| 'score': 0.45, | |
| 'label': 'window', | |
| 'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0], | |
| ..., | |
| [0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
| }, | |
| { | |
| 'score': 0.70, | |
| 'label': 'bird', | |
| 'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0], | |
| ..., | |
| [0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
| }, | |
| ] | |
| overlay_bounding_boxes(image: numpy.ndarray, bboxes: List[Dict[str, Any]]) -> numpy.ndarray: | |
| 'overlay_bounding_boxes' is a utility function that displays bounding boxes on | |
| an image. | |
| Parameters: | |
| image (np.ndarray): The image to display the bounding boxes on. | |
| bboxes (List[Dict[str, Any]]): A list of dictionaries containing the bounding | |
| boxes. | |
| Returns: | |
| np.ndarray: The image with the bounding boxes, labels and scores displayed. | |
| Example | |
| ------- | |
| >>> image_with_bboxes = overlay_bounding_boxes( | |
| image, [{'score': 0.99, 'label': 'dinosaur', 'bbox': [0.1, 0.11, 0.35, 0.4]}], | |
| ) | |
| overlay_heat_map(image: numpy.ndarray, heat_map: Dict[str, Any], alpha: float = 0.8) -> numpy.ndarray: | |
| 'overlay_heat_map' is a utility function that displays a heat map on an image. | |
| Parameters: | |
| image (np.ndarray): The image to display the heat map on. | |
| heat_map (Dict[str, Any]): A dictionary containing the heat map under the key | |
| 'heat_map'. | |
| alpha (float, optional): The transparency of the overlay. Defaults to 0.8. | |
| Returns: | |
| np.ndarray: The image with the heat map displayed. | |
| Example | |
| ------- | |
| >>> image_with_heat_map = overlay_heat_map( | |
| image, | |
| { | |
| 'heat_map': array([[0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0], | |
| ..., | |
| [0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 125, 125, 125]], dtype=uint8), | |
| }, | |
| ) | |
| overlay_segmentation_masks(image: numpy.ndarray, masks: List[Dict[str, Any]]) -> numpy.ndarray: | |
| 'overlay_segmentation_masks' is a utility function that displays segmentation | |
| masks. | |
| Parameters: | |
| image (np.ndarray): The image to display the masks on. | |
| masks (List[Dict[str, Any]]): A list of dictionaries containing the masks. | |
| Returns: | |
| np.ndarray: The image with the masks displayed. | |
| Example | |
| ------- | |
| >>> image_with_masks = overlay_segmentation_masks( | |
| image, | |
| [{ | |
| 'score': 0.99, | |
| 'label': 'dinosaur', | |
| 'mask': array([[0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0], | |
| ..., | |
| [0, 0, 0, ..., 0, 0, 0], | |
| [0, 0, 0, ..., 0, 0, 0]], dtype=uint8), | |
| }], | |
| ) | |
| ## Vision Agent Tools - model summary | |
| - any mistakes in the following table are mine. my efforts to do some QUICK reverse engineering to identify target models. | |
| | Model Name | Hugging Face Model | Primary Function | Use Cases | | |
| |---------------------|-------------------------------------|-------------------------------|--------------------------------------------------------------| | |
| | OWL-ViT v2 | google/owlv2-base-patch16-ensemble | Object detection and localization | - Open-world object detection<br>- Locating specific objects based on text prompts | | |
| | Florence-2 | microsoft/florence-base | Multi-purpose vision tasks | - Image captioning<br>- Visual question answering<br>- Object detection | | |
| | Depth Anything V2 | LiheYoung/depth-anything-v2-small | Depth estimation | - Estimating depth in images<br>- Generating depth maps | | |
| | CLIP | openai/clip-vit-base-patch32 | Image-text similarity | - Zero-shot image classification<br>- Image-text matching | | |
| | BLIP | Salesforce/blip-image-captioning-base | Image captioning | - Generating text descriptions of images | | |
| | LOCA | Custom implementation | Object counting | - Zero-shot object counting<br>- Object counting with visual prompts | | |
| | GIT v2 | microsoft/git-base-vqav2 | Visual question answering and image captioning | - Answering questions about image content<br>- Generating text descriptions of images | | |
| | Grounding DINO | groundingdino/groundingdino-swint-ogc | Object detection and localization | - Detecting objects based on text prompts | | |
| | SAM | facebook/sam-vit-huge | Instance segmentation | - Text-prompted instance segmentation | | |
| | DETR | facebook/detr-resnet-50 | Object detection | - General object detection | | |
| | ViT | google/vit-base-patch16-224 | Image classification | - General image classification<br>- NSFW content detection | | |
| | DPT | Intel/dpt-hybrid-midas | Monocular depth estimation | - Estimating depth from single images | | |