Update handpose estimation model from MediaPipe (2023feb) (#133)

* update handpose model

* update quantize model

* fix quantize path

* update readme of quantization and benchmark result

* fix document

Files changed (9) hide show

README.md +2 -2
benchmark/config/handpose_estimation_mediapipe.yaml +1 -1
models/handpose_estimation_mediapipe/README.md +9 -4
models/handpose_estimation_mediapipe/demo.py +101 -39
models/handpose_estimation_mediapipe/mp_handpose.py +17 -7
models/palm_detection_mediapipe/README.md +3 -0
tools/quantize/README.md +1 -1
tools/quantize/quantize-ort.py +7 -2
tools/quantize/transform.py +70 -1

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ Guidelines:
 | [DaSiamRPN](./models/object_tracking_dasiamrpn)         | Object Tracking               | 1280x720   | 36.15          | 705.48       | 76.82           | ---          | ---         |
 | [YoutuReID](./models/person_reid_youtureid)             | Person Re-Identification      | 128x256    | 35.81          | 521.98       | 90.07           | 44.61        | ---         |
 | [MP-PalmDet](./models/palm_detection_mediapipe)         | Palm Detection                | 192x192    | 11.09          | 63.79        | 83.20           | 33.81        | ---         |
-| [MP-HandPose](./models/handpose_estimation_mediapipe)   | Hand Pose Estimation          | 256x256    | 20.16          | 148.24       | 156.30          | 42.70        | ---         |
 \*: Models are quantized in per-channel mode, which run slower than per-tensor quantized models on NPU.
@@ -91,7 +91,7 @@ Some examples are listed below. You can find more in the directory of each model
 ### Hand Pose Estimation with [MP-HandPose](models/handpose_estimation_mediapipe/)
-![handpose estimation](models/handpose_estimation_mediapipe/examples/mphandpose_demo.gif)
 ### QR Code Detection and Parsing with [WeChatQRCode](./models/qrcode_wechatqrcode/)

 | [DaSiamRPN](./models/object_tracking_dasiamrpn)         | Object Tracking               | 1280x720   | 36.15          | 705.48       | 76.82           | ---          | ---         |
 | [YoutuReID](./models/person_reid_youtureid)             | Person Re-Identification      | 128x256    | 35.81          | 521.98       | 90.07           | 44.61        | ---         |
 | [MP-PalmDet](./models/palm_detection_mediapipe)         | Palm Detection                | 192x192    | 11.09          | 63.79        | 83.20           | 33.81        | ---         |
+| [MP-HandPose](./models/handpose_estimation_mediapipe)   | Hand Pose Estimation          | 224x224    | 4.28           | 36.19        | 40.10           | 19.47        | ---         |
 \*: Models are quantized in per-channel mode, which run slower than per-tensor quantized models on NPU.
 ### Hand Pose Estimation with [MP-HandPose](models/handpose_estimation_mediapipe/)
+![handpose estimation](models/handpose_estimation_mediapipe/examples/mphandpose_demo.webp)
 ### QR Code Detection and Parsing with [WeChatQRCode](./models/qrcode_wechatqrcode/)

benchmark/config/handpose_estimation_mediapipe.yaml CHANGED Viewed

@@ -5,7 +5,7 @@ Benchmark:
     path: "data/palm_detection_20230125"
     files: ["palm1.jpg", "palm2.jpg", "palm3.jpg"]
     sizes: # [[w1, h1], ...], Omit to run at original scale
-      - [256, 256]
   metric:
     warmup: 30
     repeat: 10

     path: "data/palm_detection_20230125"
     files: ["palm1.jpg", "palm2.jpg", "palm3.jpg"]
     sizes: # [[w1, h1], ...], Omit to run at original scale
+      - [224, 224]
   metric:
     warmup: 30
     repeat: 10

models/handpose_estimation_mediapipe/README.md CHANGED Viewed

@@ -4,11 +4,14 @@ This model estimates 21 hand keypoints per detected hand from [palm detector](..
 ![MediaPipe Hands Keypoints](./examples/hand_keypoints.png)
-This model is converted from Tensorflow-JS to ONNX using following tools:
-- tfjs to tf_saved_model:  https://github.com/patlevin/tfjs-to-tf/
-- tf_saved_model to ONNX: https://github.com/onnx/tensorflow-onnx
 - simplified by [onnx-simplifier](https://github.com/daquexian/onnx-simplifier)
 ## Demo
 Run the following commands to try the demo:
@@ -21,7 +24,7 @@ python demo.py -i /path/to/image
 ### Example outputs
-![webcam demo](./examples/mphandpose_demo.gif)
 ## License
@@ -30,3 +33,5 @@ All files in this directory are licensed under [Apache 2.0 License](./LICENSE).
 ## Reference
 - MediaPipe Handpose: https://github.com/tensorflow/tfjs-models/tree/master/handpose

 ![MediaPipe Hands Keypoints](./examples/hand_keypoints.png)
+This model is converted from TFlite to ONNX using following tools:
+- TFLite model to ONNX: https://github.com/onnx/tensorflow-onnx
 - simplified by [onnx-simplifier](https://github.com/daquexian/onnx-simplifier)
+**Note**:
+- The int8-quantized model may produce invalid results due to a significant drop of accuracy.
+- Visit https://google.github.io/mediapipe/solutions/models.html#hands for models of larger scale.
 ## Demo
 Run the following commands to try the demo:
 ### Example outputs
+![webcam demo](./examples/mphandpose_demo.webp)
 ## License
 ## Reference
 - MediaPipe Handpose: https://github.com/tensorflow/tfjs-models/tree/master/handpose
+- MediaPipe hands model and model card: https://google.github.io/mediapipe/solutions/models.html#hands
+- Int8 model quantized with rgb evaluation set of FreiHAND: https://lmb.informatik.uni-freiburg.de/resources/datasets/FreihandDataset.en.html

models/handpose_estimation_mediapipe/demo.py CHANGED Viewed

@@ -31,69 +31,126 @@ except:
 parser = argparse.ArgumentParser(description='Hand Pose Estimation from MediaPipe')
 parser.add_argument('--input', '-i', type=str, help='Path to the input image. Omit for using default camera.')
-parser.add_argument('--model', '-m', type=str, default='./handpose_estimation_mediapipe_2022may.onnx', help='Path to the model.')
 parser.add_argument('--backend', '-b', type=int, default=backends[0], help=help_msg_backends.format(*backends))
 parser.add_argument('--target', '-t', type=int, default=targets[0], help=help_msg_targets.format(*targets))
-parser.add_argument('--conf_threshold', type=float, default=0.8, help='Filter out hands of confidence < conf_threshold.')
 parser.add_argument('--save', '-s', type=str, default=False, help='Set true to save results. This flag is invalid when using camera.')
 parser.add_argument('--vis', '-v', type=str2bool, default=True, help='Set true to open a window for result visualization. This flag is invalid when using camera.')
 args = parser.parse_args()
 def visualize(image, hands, print_result=False):
-    output = image.copy()
     for idx, handpose in enumerate(hands):
         conf = handpose[-1]
         bbox = handpose[0:4].astype(np.int32)
-        landmarks = handpose[4:-1].reshape(21, 2).astype(np.int32)
         # Print results
         if print_result:
             print('-----------hand {}-----------'.format(idx + 1))
             print('conf: {:.2f}'.format(conf))
             print('hand box: {}'.format(bbox))
             print('hand landmarks: ')
-            for l in landmarks:
                 print('\t{}'.format(l))
         # Draw line between each key points
-        cv.line(output, landmarks[0], landmarks[1], (255, 255, 255), 2)
-        cv.line(output, landmarks[1], landmarks[2], (255, 255, 255), 2)
-        cv.line(output, landmarks[2], landmarks[3], (255, 255, 255), 2)
-        cv.line(output, landmarks[3], landmarks[4], (255, 255, 255), 2)
-        cv.line(output, landmarks[0], landmarks[5], (255, 255, 255), 2)
-        cv.line(output, landmarks[5], landmarks[6], (255, 255, 255), 2)
-        cv.line(output, landmarks[6], landmarks[7], (255, 255, 255), 2)
-        cv.line(output, landmarks[7], landmarks[8], (255, 255, 255), 2)
-        cv.line(output, landmarks[0], landmarks[9], (255, 255, 255), 2)
-        cv.line(output, landmarks[9], landmarks[10], (255, 255, 255), 2)
-        cv.line(output, landmarks[10], landmarks[11], (255, 255, 255), 2)
-        cv.line(output, landmarks[11], landmarks[12], (255, 255, 255), 2)
-        cv.line(output, landmarks[0], landmarks[13], (255, 255, 255), 2)
-        cv.line(output, landmarks[13], landmarks[14], (255, 255, 255), 2)
-        cv.line(output, landmarks[14], landmarks[15], (255, 255, 255), 2)
-        cv.line(output, landmarks[15], landmarks[16], (255, 255, 255), 2)
-        cv.line(output, landmarks[0], landmarks[17], (255, 255, 255), 2)
-        cv.line(output, landmarks[17], landmarks[18], (255, 255, 255), 2)
-        cv.line(output, landmarks[18], landmarks[19], (255, 255, 255), 2)
-        cv.line(output, landmarks[19], landmarks[20], (255, 255, 255), 2)
-        for p in landmarks:
-            cv.circle(output, p, 2, (0, 0, 255), 2)
-    return output
 if __name__ == '__main__':
     # palm detector
     palm_detector = MPPalmDet(modelPath='../palm_detection_mediapipe/palm_detection_mediapipe_2023feb.onnx',
                               nmsThreshold=0.3,
-                              scoreThreshold=0.8,
                               backendId=args.backend,
                               targetId=args.target)
     # handpose detector
@@ -108,7 +165,7 @@ if __name__ == '__main__':
         # Palm detector inference
         palms = palm_detector.infer(image)
-        hands = np.empty(shape=(0, 47))
         # Estimate the pose of each hand
         for palm in palms:
@@ -117,10 +174,12 @@ if __name__ == '__main__':
             if handpose is not None:
                 hands = np.vstack((hands, handpose))
         # Draw results on the input image
-        image = visualize(image, hands, True)
         if len(palms) == 0:
             print('No palm detected!')
         # Save results
         if args.save:
@@ -131,6 +190,7 @@ if __name__ == '__main__':
         if args.vis:
             cv.namedWindow(args.input, cv.WINDOW_AUTOSIZE)
             cv.imshow(args.input, image)
             cv.waitKey(0)
     else:  # Omit input to call default camera
         deviceId = 0
@@ -145,7 +205,7 @@ if __name__ == '__main__':
             # Palm detector inference
             palms = palm_detector.infer(frame)
-            hands = np.empty(shape=(0, 47))
             tm.start()
             # Estimate the pose of each hand
@@ -156,12 +216,14 @@ if __name__ == '__main__':
                     hands = np.vstack((hands, handpose))
             tm.stop()
             # Draw results on the input image
-            frame = visualize(frame, hands)
             if len(palms) == 0:
                 print('No palm detected!')
             else:
                 cv.putText(frame, 'FPS: {:.2f}'.format(tm.getFPS()), (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))
             cv.imshow('MediaPipe Handpose Detection Demo', frame)
             tm.reset()

 parser = argparse.ArgumentParser(description='Hand Pose Estimation from MediaPipe')
 parser.add_argument('--input', '-i', type=str, help='Path to the input image. Omit for using default camera.')
+parser.add_argument('--model', '-m', type=str, default='./handpose_estimation_mediapipe_2023feb.onnx', help='Path to the model.')
 parser.add_argument('--backend', '-b', type=int, default=backends[0], help=help_msg_backends.format(*backends))
 parser.add_argument('--target', '-t', type=int, default=targets[0], help=help_msg_targets.format(*targets))
+parser.add_argument('--conf_threshold', type=float, default=0.9, help='Filter out hands of confidence < conf_threshold.')
 parser.add_argument('--save', '-s', type=str, default=False, help='Set true to save results. This flag is invalid when using camera.')
 parser.add_argument('--vis', '-v', type=str2bool, default=True, help='Set true to open a window for result visualization. This flag is invalid when using camera.')
 args = parser.parse_args()
 def visualize(image, hands, print_result=False):
+    display_screen = image.copy()
+    display_3d = np.zeros((400, 400, 3), np.uint8)
+    cv.line(display_3d, (200, 0), (200, 400), (255, 255, 255), 2)
+    cv.line(display_3d, (0, 200), (400, 200), (255, 255, 255), 2)
+    cv.putText(display_3d, 'Main View', (0, 12), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
+    cv.putText(display_3d, 'Top View', (200, 12), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
+    cv.putText(display_3d, 'Left View', (0, 212), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
+    cv.putText(display_3d, 'Right View', (200, 212), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
+    is_draw = False  # ensure only one hand is drawn
+    def draw_lines(image, landmarks, is_draw_point=True, thickness=2):
+        cv.line(image, landmarks[0], landmarks[1], (255, 255, 255), thickness)
+        cv.line(image, landmarks[1], landmarks[2], (255, 255, 255), thickness)
+        cv.line(image, landmarks[2], landmarks[3], (255, 255, 255), thickness)
+        cv.line(image, landmarks[3], landmarks[4], (255, 255, 255), thickness)
+        cv.line(image, landmarks[0], landmarks[5], (255, 255, 255), thickness)
+        cv.line(image, landmarks[5], landmarks[6], (255, 255, 255), thickness)
+        cv.line(image, landmarks[6], landmarks[7], (255, 255, 255), thickness)
+        cv.line(image, landmarks[7], landmarks[8], (255, 255, 255), thickness)
+        cv.line(image, landmarks[0], landmarks[9], (255, 255, 255), thickness)
+        cv.line(image, landmarks[9], landmarks[10], (255, 255, 255), thickness)
+        cv.line(image, landmarks[10], landmarks[11], (255, 255, 255), thickness)
+        cv.line(image, landmarks[11], landmarks[12], (255, 255, 255), thickness)
+        cv.line(image, landmarks[0], landmarks[13], (255, 255, 255), thickness)
+        cv.line(image, landmarks[13], landmarks[14], (255, 255, 255), thickness)
+        cv.line(image, landmarks[14], landmarks[15], (255, 255, 255), thickness)
+        cv.line(image, landmarks[15], landmarks[16], (255, 255, 255), thickness)
+        cv.line(image, landmarks[0], landmarks[17], (255, 255, 255), thickness)
+        cv.line(image, landmarks[17], landmarks[18], (255, 255, 255), thickness)
+        cv.line(image, landmarks[18], landmarks[19], (255, 255, 255), thickness)
+        cv.line(image, landmarks[19], landmarks[20], (255, 255, 255), thickness)
+        if is_draw_point:
+            for p in landmarks:
+                cv.circle(image, p, thickness, (0, 0, 255), -1)
     for idx, handpose in enumerate(hands):
         conf = handpose[-1]
         bbox = handpose[0:4].astype(np.int32)
+        handedness = handpose[-2]
+        if handedness <= 0.5:
+            handedness_text = 'Left'
+        else:
+            handedness_text = 'Right'
+        landmarks_screen = handpose[4:67].reshape(21, 3).astype(np.int32)
+        landmarks_word = handpose[67:130].reshape(21, 3)
         # Print results
         if print_result:
             print('-----------hand {}-----------'.format(idx + 1))
             print('conf: {:.2f}'.format(conf))
+            print('handedness: {}'.format(handedness_text))
             print('hand box: {}'.format(bbox))
             print('hand landmarks: ')
+            for l in landmarks_screen:
+                print('\t{}'.format(l))
+            print('hand world landmarks: ')
+            for l in landmarks_word:
                 print('\t{}'.format(l))
+        # draw box
+        cv.rectangle(display_screen, (bbox[0], bbox[1]), (bbox[2], bbox[3]), (0, 255, 0), 2)
+        # draw handedness
+        cv.putText(display_screen, '{}'.format(handedness_text), (bbox[0], bbox[1] + 12), cv.FONT_HERSHEY_DUPLEX, 0.5, (0, 0, 255))
         # Draw line between each key points
+        landmarks_xy = landmarks_screen[:, 0:2]
+        draw_lines(display_screen, landmarks_xy, is_draw_point=False)
+        # z value is relative to WRIST
+        for p in landmarks_screen:
+            r = max(5 - p[2] // 5, 0)
+            r = min(r, 14)
+            cv.circle(display_screen, np.array([p[0], p[1]]), r, (0, 0, 255), -1)
+        if is_draw is False:
+            is_draw = True
+            # Main view
+            landmarks_xy = landmarks_word[:, [0, 1]]
+            landmarks_xy = (landmarks_xy * 1000 + 100).astype(np.int32)
+            draw_lines(display_3d, landmarks_xy, thickness=5)
+            # Top view
+            landmarks_xz = landmarks_word[:, [0, 2]]
+            landmarks_xz[:, 1] = -landmarks_xz[:, 1]
+            landmarks_xz = (landmarks_xz * 1000 + np.array([300, 100])).astype(np.int32)
+            draw_lines(display_3d, landmarks_xz, thickness=5)
+            # Left view
+            landmarks_yz = landmarks_word[:, [2, 1]]
+            landmarks_yz[:, 0] = -landmarks_yz[:, 0]
+            landmarks_yz = (landmarks_yz * 1000 + np.array([100, 300])).astype(np.int32)
+            draw_lines(display_3d, landmarks_yz, thickness=5)
+            # Right view
+            landmarks_zy = landmarks_word[:, [2, 1]]
+            landmarks_zy = (landmarks_zy * 1000 + np.array([300, 300])).astype(np.int32)
+            draw_lines(display_3d, landmarks_zy, thickness=5)
+    return display_screen, display_3d
 if __name__ == '__main__':
     # palm detector
     palm_detector = MPPalmDet(modelPath='../palm_detection_mediapipe/palm_detection_mediapipe_2023feb.onnx',
                               nmsThreshold=0.3,
+                              scoreThreshold=0.6,
                               backendId=args.backend,
                               targetId=args.target)
     # handpose detector
         # Palm detector inference
         palms = palm_detector.infer(image)
+        hands = np.empty(shape=(0, 132))
         # Estimate the pose of each hand
         for palm in palms:
             if handpose is not None:
                 hands = np.vstack((hands, handpose))
         # Draw results on the input image
+        image, view_3d = visualize(image, hands, True)
         if len(palms) == 0:
             print('No palm detected!')
+        else:
+            print('Palm detected!')
         # Save results
         if args.save:
         if args.vis:
             cv.namedWindow(args.input, cv.WINDOW_AUTOSIZE)
             cv.imshow(args.input, image)
+            cv.imshow('3D HandPose Demo', view_3d)
             cv.waitKey(0)
     else:  # Omit input to call default camera
         deviceId = 0
             # Palm detector inference
             palms = palm_detector.infer(frame)
+            hands = np.empty(shape=(0, 132))
             tm.start()
             # Estimate the pose of each hand
                     hands = np.vstack((hands, handpose))
             tm.stop()
             # Draw results on the input image
+            frame, view_3d = visualize(frame, hands)
             if len(palms) == 0:
                 print('No palm detected!')
             else:
+                print('Palm detected!')
                 cv.putText(frame, 'FPS: {:.2f}'.format(tm.getFPS()), (0, 15), cv.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 255))
             cv.imshow('MediaPipe Handpose Detection Demo', frame)
+            cv.imshow('3D HandPose Demo', view_3d)
             tm.reset()

models/handpose_estimation_mediapipe/mp_handpose.py CHANGED Viewed

@@ -9,7 +9,7 @@ class MPHandPose:
         self.backend_id = backendId
         self.target_id = targetId
-        self.input_size = np.array([256, 256])  # wh
         self.PALM_LANDMARK_IDS = [0, 5, 9, 13, 17, 1, 2]
         self.PALM_LANDMARKS_INDEX_OF_PALM_BASE = 0
         self.PALM_LANDMARKS_INDEX_OF_MIDDLE_FINGER_BASE = 2
@@ -115,20 +115,25 @@ class MPHandPose:
         return results # [bbox_coords, landmarks_coords, conf]
     def _postprocess(self, blob, rotated_palm_bbox, angle, rotation_matrix):
-        landmarks, conf = blob
         if conf < self.conf_threshold:
             return None
-        landmarks = landmarks.reshape(-1, 3)  # shape: (1, 63) -> (21, 3)
         # transform coords back to the input coords
         wh_rotated_palm_bbox = rotated_palm_bbox[1] - rotated_palm_bbox[0]
         scale_factor = wh_rotated_palm_bbox / self.input_size
         landmarks[:, :2] = (landmarks[:, :2] - self.input_size / 2) * scale_factor
         coords_rotation_matrix = cv.getRotationMatrix2D((0, 0), angle, 1.0)
         rotated_landmarks = np.dot(landmarks[:, :2], coords_rotation_matrix[:, :2])
         rotated_landmarks = np.c_[rotated_landmarks, landmarks[:, 2]]
         #  invert rotation
         rotation_component = np.array([
             [rotation_matrix[0][0], rotation_matrix[1][0]],
@@ -144,12 +149,12 @@ class MPHandPose:
         original_center = np.array([
             np.dot(center, inverse_rotation_matrix[0]),
             np.dot(center, inverse_rotation_matrix[1])])
-        landmarks = rotated_landmarks[:, :2] + original_center
         # get bounding box from rotated_landmarks
         bbox = np.array([
-            np.amin(landmarks, axis=0),
-            np.amax(landmarks, axis=0)])  # [top-left, bottom-right]
         # shift bounding box
         wh_bbox = bbox[1] - bbox[0]
         shift_vector = self.HAND_BOX_SHIFT_VECTOR * wh_bbox
@@ -162,4 +167,9 @@ class MPHandPose:
             center_bbox - new_half_size,
             center_bbox + new_half_size])
-        return np.r_[bbox.reshape(-1), landmarks.reshape(-1), conf[0]]

         self.backend_id = backendId
         self.target_id = targetId
+        self.input_size = np.array([224, 224])  # wh
         self.PALM_LANDMARK_IDS = [0, 5, 9, 13, 17, 1, 2]
         self.PALM_LANDMARKS_INDEX_OF_PALM_BASE = 0
         self.PALM_LANDMARKS_INDEX_OF_MIDDLE_FINGER_BASE = 2
         return results # [bbox_coords, landmarks_coords, conf]
     def _postprocess(self, blob, rotated_palm_bbox, angle, rotation_matrix):
+        landmarks, conf, handedness, landmarks_word = blob
+        conf = conf[0][0]
         if conf < self.conf_threshold:
             return None
+        landmarks = landmarks[0].reshape(-1, 3)  # shape: (1, 63) -> (21, 3)
+        landmarks_word = landmarks_word[0].reshape(-1, 3) # shape: (1, 63) -> (21, 3)
         # transform coords back to the input coords
         wh_rotated_palm_bbox = rotated_palm_bbox[1] - rotated_palm_bbox[0]
         scale_factor = wh_rotated_palm_bbox / self.input_size
         landmarks[:, :2] = (landmarks[:, :2] - self.input_size / 2) * scale_factor
+        landmarks[:, 2] = landmarks[:, 2] * max(scale_factor) # depth scaling
         coords_rotation_matrix = cv.getRotationMatrix2D((0, 0), angle, 1.0)
         rotated_landmarks = np.dot(landmarks[:, :2], coords_rotation_matrix[:, :2])
         rotated_landmarks = np.c_[rotated_landmarks, landmarks[:, 2]]
+        rotated_landmarks_world = np.dot(landmarks_word[:, :2], coords_rotation_matrix[:, :2])
+        rotated_landmarks_world = np.c_[rotated_landmarks_world, landmarks_word[:, 2]]
         #  invert rotation
         rotation_component = np.array([
             [rotation_matrix[0][0], rotation_matrix[1][0]],
         original_center = np.array([
             np.dot(center, inverse_rotation_matrix[0]),
             np.dot(center, inverse_rotation_matrix[1])])
+        landmarks[:, :2] = rotated_landmarks[:, :2] + original_center
         # get bounding box from rotated_landmarks
         bbox = np.array([
+            np.amin(landmarks[:, :2], axis=0),
+            np.amax(landmarks[:, :2], axis=0)])  # [top-left, bottom-right]
         # shift bounding box
         wh_bbox = bbox[1] - bbox[0]
         shift_vector = self.HAND_BOX_SHIFT_VECTOR * wh_bbox
             center_bbox - new_half_size,
             center_bbox + new_half_size])
+        # [0: 4]: hand bounding box found in image of format [x1, y1, x2, y2] (top-left and bottom-right points)
+        # [4: 67]: screen landmarks with format [x1, y1, z1, x2, y2 ... x21, y21, z21], z value is relative to WRIST
+        # [67: 130]: world landmarks with format [x1, y1, z1, x2, y2 ... x21, y21, z21], 3D metric x, y, z coordinate
+        # [130]: handedness, (left)[0, 1](right) hand
+        # [131]: confidence
+        return np.r_[bbox.reshape(-1), landmarks.reshape(-1), rotated_landmarks_world.reshape(-1), handedness[0][0], conf]

models/palm_detection_mediapipe/README.md CHANGED Viewed

@@ -7,6 +7,9 @@ This model detects palm bounding boxes and palm landmarks, and is converted from
 - SSD Anchors are generated from [GenMediaPipePalmDectionSSDAnchors](https://github.com/VimalMollyn/GenMediaPipePalmDectionSSDAnchors)
 ## Demo
 Run the following commands to try the demo:

 - SSD Anchors are generated from [GenMediaPipePalmDectionSSDAnchors](https://github.com/VimalMollyn/GenMediaPipePalmDectionSSDAnchors)
+**Note**:
+- Visit https://google.github.io/mediapipe/solutions/models.html#hands for models of larger scale.
 ## Demo
 Run the following commands to try the demo:

tools/quantize/README.md CHANGED Viewed

@@ -54,4 +54,4 @@ python quantize-inc.py model1
 ## Dataset
 Some models are quantized with extra datasets.
-- [MP-PalmDet](../../models/palm_detection_mediapipe) int8 model quantized with evaluation set of [FreiHAND](https://lmb.informatik.uni-freiburg.de/resources/datasets/FreihandDataset.en.html). The dataset downloaded from [link](https://lmb.informatik.uni-freiburg.de/data/freihand/FreiHAND_pub_v2_eval.zip). Unpack it and path to `FreiHAND_pub_v2_eval/evaluation/rgb`.

 ## Dataset
 Some models are quantized with extra datasets.
+- [MP-PalmDet](../../models/palm_detection_mediapipe) and [MP-HandPose](../../models/handpose_estimation_mediapipe) are quantized with evaluation set of [FreiHAND](https://lmb.informatik.uni-freiburg.de/resources/datasets/FreihandDataset.en.html). Download the dataset from [this link](https://lmb.informatik.uni-freiburg.de/data/freihand/FreiHAND_pub_v2_eval.zip). Unpack it and replace `path/to/dataset` with the path to `FreiHAND_pub_v2_eval/evaluation/rgb`.

tools/quantize/quantize-ort.py CHANGED Viewed

@@ -14,7 +14,7 @@ from onnx import version_converter
 import onnxruntime
 from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat
-from transform import Compose, Resize, CenterCrop, Normalize, ColorConvert
 class DataReader(CalibrationDataReader):
     def __init__(self, model_path, image_dir, transforms, data_dim):
@@ -37,6 +37,8 @@ class DataReader(CalibrationDataReader):
                 continue
             img = cv.imread(os.path.join(image_dir, image_name))
             img = self.transforms(img)
             blob = cv.dnn.blobFromImage(img)
             if self.data_dim == 'hwc':
                 blob = cv.transposeND(blob, [0, 2, 3, 1])
@@ -110,7 +112,10 @@ models=dict(
                         calibration_image_dir='path/to/dataset',
                         transforms=Compose([Resize(size=(192, 192)), Normalize(std=[255, 255, 255]),
                         ColorConvert(ctype=cv.COLOR_BGR2RGB)]), data_dim='hwc'),
 )
 if __name__ == '__main__':

 import onnxruntime
 from onnxruntime.quantization import quantize_static, CalibrationDataReader, QuantType, QuantFormat
+from transform import Compose, Resize, CenterCrop, Normalize, ColorConvert, HandAlign
 class DataReader(CalibrationDataReader):
     def __init__(self, model_path, image_dir, transforms, data_dim):
                 continue
             img = cv.imread(os.path.join(image_dir, image_name))
             img = self.transforms(img)
+            if img is None:
+                continue
             blob = cv.dnn.blobFromImage(img)
             if self.data_dim == 'hwc':
                 blob = cv.transposeND(blob, [0, 2, 3, 1])
                         calibration_image_dir='path/to/dataset',
                         transforms=Compose([Resize(size=(192, 192)), Normalize(std=[255, 255, 255]),
                         ColorConvert(ctype=cv.COLOR_BGR2RGB)]), data_dim='hwc'),
+    mp_handpose=Quantize(model_path='../../models/handpose_estimation_mediapipe/handpose_estimation_mediapipe_2023feb.onnx',
+                        calibration_image_dir='path/to/dataset',
+                        transforms=Compose([HandAlign("mp_handpose"), Resize(size=(224, 224)), Normalize(std=[255, 255, 255]),
+                        ColorConvert(ctype=cv.COLOR_BGR2RGB)]), data_dim='hwc'),
 )
 if __name__ == '__main__':

tools/quantize/transform.py CHANGED Viewed

@@ -5,8 +5,9 @@
 # Third party copyrights are property of their respective owners.
 import collections
-import numpy as numpy
 import cv2 as cv
 class Compose:
     def __init__(self, transforms=[]):
@@ -15,6 +16,8 @@ class Compose:
     def __call__(self, img):
         for t in self.transforms:
             img = t(img)
         return img
 class Resize:
@@ -58,3 +61,69 @@ class ColorConvert:
     def __call__(self, img):
         return cv.cvtColor(img, self.ctype)

 # Third party copyrights are property of their respective owners.
 import collections
+import numpy as np
 import cv2 as cv
+import sys
 class Compose:
     def __init__(self, transforms=[]):
     def __call__(self, img):
         for t in self.transforms:
             img = t(img)
+            if img is None:
+                break
         return img
 class Resize:
     def __call__(self, img):
         return cv.cvtColor(img, self.ctype)
+class HandAlign:
+    def __init__(self, model):
+        self.model = model
+        sys.path.append('../../models/palm_detection_mediapipe')
+        from mp_palmdet import MPPalmDet
+        self.palm_detector = MPPalmDet(modelPath='../../models/palm_detection_mediapipe/palm_detection_mediapipe_2023feb.onnx', nmsThreshold=0.3, scoreThreshold=0.9)
+    def __call__(self, img):
+        return self.mp_handpose_align(img)
+    def mp_handpose_align(self, img):
+        palms = self.palm_detector.infer(img)
+        if len(palms) == 0:
+            return None
+        palm = palms[0]
+        palm_bbox = palm[0:4].reshape(2, 2)
+        palm_landmarks = palm[4:18].reshape(7, 2)
+        p1 = palm_landmarks[0]
+        p2 = palm_landmarks[2]
+        radians = np.pi / 2 - np.arctan2(-(p2[1] - p1[1]), p2[0] - p1[0])
+        radians = radians - 2 * np.pi * np.floor((radians + np.pi) / (2 * np.pi))
+        angle = np.rad2deg(radians)
+        #  get bbox center
+        center_palm_bbox = np.sum(palm_bbox, axis=0) / 2
+        #  get rotation matrix
+        rotation_matrix = cv.getRotationMatrix2D(center_palm_bbox, angle, 1.0)
+        #  get rotated image
+        rotated_image = cv.warpAffine(img, rotation_matrix, (img.shape[1], img.shape[0]))
+        #  get bounding boxes from rotated palm landmarks
+        homogeneous_coord = np.c_[palm_landmarks, np.ones(palm_landmarks.shape[0])]
+        rotated_palm_landmarks = np.array([
+            np.dot(homogeneous_coord, rotation_matrix[0]),
+            np.dot(homogeneous_coord, rotation_matrix[1])])
+        #  get landmark bounding box
+        rotated_palm_bbox = np.array([
+            np.amin(rotated_palm_landmarks, axis=1),
+            np.amax(rotated_palm_landmarks, axis=1)])  # [top-left, bottom-right]
+        #  shift bounding box
+        wh_rotated_palm_bbox = rotated_palm_bbox[1] - rotated_palm_bbox[0]
+        shift_vector = [0, -0.1] * wh_rotated_palm_bbox
+        rotated_palm_bbox = rotated_palm_bbox + shift_vector
+        #  squarify bounding boxx
+        center_rotated_plam_bbox = np.sum(rotated_palm_bbox, axis=0) / 2
+        wh_rotated_palm_bbox = rotated_palm_bbox[1] - rotated_palm_bbox[0]
+        new_half_size = np.amax(wh_rotated_palm_bbox) / 2
+        rotated_palm_bbox = np.array([
+            center_rotated_plam_bbox - new_half_size,
+            center_rotated_plam_bbox + new_half_size])
+        #  enlarge bounding box
+        center_rotated_plam_bbox = np.sum(rotated_palm_bbox, axis=0) / 2
+        wh_rotated_palm_bbox = rotated_palm_bbox[1] - rotated_palm_bbox[0]
+        new_half_size = wh_rotated_palm_bbox * 1.5
+        rotated_palm_bbox = np.array([
+            center_rotated_plam_bbox - new_half_size,
+            center_rotated_plam_bbox + new_half_size])
+        # Crop the rotated image by the bounding box
+        [[x1, y1], [x2, y2]] = rotated_palm_bbox.astype(np.int32)
+        diff = np.maximum([-x1, -y1, x2 - rotated_image.shape[1], y2 - rotated_image.shape[0]], 0)
+        [x1, y1, x2, y2] = [x1, y1, x2, y2] + diff
+        crop = rotated_image[y1:y2, x1:x2, :]
+        crop = cv.copyMakeBorder(crop, diff[1], diff[3], diff[0], diff[2], cv.BORDER_CONSTANT, value=(0, 0, 0))
+        return crop