Gate ROCm grouped_gemm hipBLASLt behind env flag

Browse files

Files changed (5) hide show

_dev/TODO-gg-linter.md +1 -1
_dev/TODO-gg.md +1 -1
_dev/TODO-hip.md +19 -0
csrc/grouped_gemm/grouped_gemm.cu +181 -51
csrc/grouped_gemm/grouped_gemm.hip +178 -51

_dev/TODO-gg-linter.md CHANGED Viewed

@@ -96,7 +96,7 @@ Both scripts consistently demonstrate:
 - ✅ **Fix implemented** — `_allocate_output` now returns a zeroed tensor
 - ✅ **Reproduction cases clean** — `_dev/debug-gg-small.py` and `_dev/debug-tensor-copy.py` match the Python reference
 - ✅ **hipify behavior understood** — edit `.cu`, not `.hip`, or adjust the build pipeline if we need custom HIP-only changes
-- ⚠️ **hipBLASLt path unsuitable** — re-enabling hipBLASLt caused HIP memory access faults on the large expert setups from `tests/ops_test.py`, so we reverted to the cleaned-up FP32 fallback for stability.
 ## Files Modified During Investigation

 - ✅ **Fix implemented** — `_allocate_output` now returns a zeroed tensor
 - ✅ **Reproduction cases clean** — `_dev/debug-gg-small.py` and `_dev/debug-tensor-copy.py` match the Python reference
 - ✅ **hipify behavior understood** — edit `.cu`, not `.hip`, or adjust the build pipeline if we need custom HIP-only changes
+- ⚠️ **hipBLASLt path experimental** — enabling hipBLASLt via `MEGABLOCKS_GG_USE_HIPBLASLT=1` still triggers HIP memory access faults on the large expert setups from `tests/ops_test.py`. Leave the flag off for production; use the FP32 fallback until the hipBLASLt issues are resolved.
 ## Files Modified During Investigation

_dev/TODO-gg.md CHANGED Viewed

@@ -149,7 +149,7 @@ python debug-gg-step-by-step.py  # Manual computation verification
 - **Misdiagnosed linter**: The perceived “linter” reverting our HIP edits was actually `hipify` regenerating `csrc/grouped_gemm/grouped_gemm.hip` from the CUDA source each time `build.sh` ran. Any HIP-only tweak has to live in `grouped_gemm.cu` (or we adjust the hipify step) to persist.
 - **Actual corruption cause**: The ROCm fallback path inside `hipblaslt_gmm_internal` accumulates into the output tensor passed from Python. `_allocate_output` in `torch-ext/megablocks/grouped_gemm/backend.py` created that buffer with `torch.empty`, so the accumulation mixed correct products with uninitialised memory, yielding the 10^17–10^25 explosions.
 - **Workaround**: Switching `_allocate_output` to use `torch.zeros` ensures the accumulation starts from a clean slate. After rebuilding, `_dev/debug-gg-small.py` and `_dev/debug-tensor-copy.py` now match the Python reference for all tested expert counts.
-- **hipBLASLt evaluation**: We briefly reinstated the hipBLASLt-backed path, but large expert batches triggered HIP memory access faults and the `run-tests.sh` suite aborted in `tests/ops_test.py`. We therefore kept the FP32 fallback in place for now, but stripped the debug prints and ensured it overwrites (rather than accumulates into) the destination tensor.
 - **Next steps**: Leave the zero-initialisation in place while exploring a higher-performance HIP kernel; if we need HIP-specific logic, implement it in the `.cu` so hipify preserves the change.
 ```

 - **Misdiagnosed linter**: The perceived “linter” reverting our HIP edits was actually `hipify` regenerating `csrc/grouped_gemm/grouped_gemm.hip` from the CUDA source each time `build.sh` ran. Any HIP-only tweak has to live in `grouped_gemm.cu` (or we adjust the hipify step) to persist.
 - **Actual corruption cause**: The ROCm fallback path inside `hipblaslt_gmm_internal` accumulates into the output tensor passed from Python. `_allocate_output` in `torch-ext/megablocks/grouped_gemm/backend.py` created that buffer with `torch.empty`, so the accumulation mixed correct products with uninitialised memory, yielding the 10^17–10^25 explosions.
 - **Workaround**: Switching `_allocate_output` to use `torch.zeros` ensures the accumulation starts from a clean slate. After rebuilding, `_dev/debug-gg-small.py` and `_dev/debug-tensor-copy.py` now match the Python reference for all tested expert counts.
+- **hipBLASLt evaluation**: We briefly reinstated the hipBLASLt-backed path, but large expert batches triggered HIP memory access faults and the `run-tests.sh` suite aborted in `tests/ops_test.py`. We therefore kept the FP32 fallback in place for now, gated by the `MEGABLOCKS_GG_USE_HIPBLASLT` env var so we can experiment with hipBLASLt when desired, while production defaults to the stable FP32 path that overwrites (rather than accumulates into) the destination tensor.
 - **Next steps**: Leave the zero-initialisation in place while exploring a higher-performance HIP kernel; if we need HIP-specific logic, implement it in the `.cu` so hipify preserves the change.
 ```

_dev/TODO-hip.md ADDED Viewed

	@@ -0,0 +1,19 @@

+# HIP Grouped GEMM Status (2025-09-18)
+## Current toggle
+- Set `MEGABLOCKS_GG_USE_HIPBLASLT=1` to force the ROCm build to run the hipBLASLt backend instead of the FP32 fallback in `hipblaslt_gmm_internal`.
+- Without the flag the code uses the stable FP32 `torch::matmul` path that overwrites the destination buffer.
+## What works with hipBLASLt enabled
+- `_dev/debug-gg-small.py`, `_dev/debug-tensor-copy.py`, and `_dev/debug-gg-detailed.py` finish with finite outputs (differences are within ~1e-3..1e-2 due to BF16).
+- `python -m pytest tests/test_gg.py -q` passes with the flag set.
+## Known failures
+- `PYTHONPATH=build/... MEGABLOCKS_GG_USE_HIPBLASLT=1 python -m pytest tests/ops_test.py -q` aborts with a HIP memory access fault (`Memory access fault by GPU node-2` during `OpsTest.testGroupedGemm_FixedSizes`).
+- The same failure occurs early when the test suite is run via `run-tests.sh`, so hipBLASLt is not yet production-ready.
+## Next steps
+- Reproduce the fault in isolation (likely the large `(z=16, m=128, k=128, n=128)` cases) and inspect the arguments passed into `hipblaslt_run_matmul` (leading dimensions/layout).
+- Investigate whether hipBLASLt requires column-major layouts or non-zero workspace to handle the grouped GEMM shapes.
+- Consider hybrid strategy: attempt hipBLASLt per expert and fall back to FP32 for shapes that exceed stability thresholds (e.g., by catching `hipblaslt_run_matmul` errors once we can reliably detect them).
+- Once hipBLASLt is stable, tighten tolerances/grad checks in `tests/test_gg.py` and re-enable the high-performance path by default.

csrc/grouped_gemm/grouped_gemm.cu CHANGED Viewed

@@ -7,10 +7,35 @@
 #include <hipblaslt/hipblaslt.h>
 #include <torch/autograd.h>
 #include <vector>
 namespace grouped_gemm {
 namespace {
 inline void hipblaslt_check(hipblasStatus_t status, const char* expr) {
   TORCH_CHECK(status == HIPBLAS_STATUS_SUCCESS, "hipBLASLt call failed with status ", status, " when executing ", expr);
 }
@@ -152,6 +177,7 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
   auto device = a.device();
   auto dtype = a.scalar_type();
   const auto counts_ptr = batch_sizes.data_ptr<int64_t>();
   const int64_t num_experts = batch_sizes.size(0);
@@ -174,28 +200,64 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
     auto b_contig = b.contiguous();
-    int64_t start = 0;
-    for (int64_t expert = 0; expert < num_experts; ++expert) {
-      const int64_t end = prefix[expert];
-      const int64_t rows = end - start;
-      auto out_chunk = out.select(0, expert);
-      if (rows == 0) {
-        out_chunk.zero_();
         start = end;
-        continue;
       }
-      auto a_slice = a.narrow(0, start, rows);
-      auto b_slice = b_contig.narrow(0, start, rows);
-      auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
-      auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
-      auto prod = torch::matmul(a_f32.transpose(0, 1), b_f32);
-      auto prod_bf16 = prod.to(dtype);
-      out_chunk.copy_(prod_bf16);
-      start = end;
     }
     return out;
   }
@@ -208,6 +270,104 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
     auto b_contig = b.contiguous();
     int64_t start = 0;
     for (int64_t expert = 0; expert < num_experts; ++expert) {
       const int64_t end = prefix[expert];
@@ -223,42 +383,12 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
       auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
       auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
-      auto prod = torch::matmul(a_f32, b_f32.transpose(0, 1));
       auto prod_bf16 = prod.to(dtype);
       out_chunk.copy_(prod_bf16);
       start = end;
     }
-    return out;
-  }
-  const int64_t hidden_out = a.size(1);
-  const int64_t hidden_in = b.size(2);
-  out = c_opt.value_or(torch::empty({tokens, hidden_in}, a.options()));
-  TORCH_CHECK(out.is_contiguous(), "Output tensor must be contiguous");
-  auto b_contig = b.contiguous();
-  int64_t start = 0;
-  for (int64_t expert = 0; expert < num_experts; ++expert) {
-    const int64_t end = prefix[expert];
-    const int64_t rows = end - start;
-    if (rows == 0) {
-      start = end;
-      continue;
-    }
-    auto a_slice = a.narrow(0, start, rows);
-    auto b_slice = b_contig.select(0, expert);
-    auto out_chunk = out.narrow(0, start, rows);
-    auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
-    auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
-    auto prod = torch::matmul(a_f32, b_f32);
-    auto prod_bf16 = prod.to(dtype);
-    out_chunk.copy_(prod_bf16);
-    start = end;
   }
   return out;
 }

 #include <hipblaslt/hipblaslt.h>
 #include <torch/autograd.h>
 #include <vector>
+#include <algorithm>
+#include <cctype>
+#include <cstdlib>
+#include <string>
 namespace grouped_gemm {
 namespace {
+// Experimental: toggled via MEGABLOCKS_GG_USE_HIPBLASLT=1. This flag is
+// intentionally off by default because the hipBLASLt path still fails on the
+// largest `tests/ops_test.py` configurations.
+bool use_hipblaslt_backend() {
+  static int cached = [] {
+    const char* raw = std::getenv("MEGABLOCKS_GG_USE_HIPBLASLT");
+    if (raw == nullptr) {
+      return 0;
+    }
+    std::string value(raw);
+    std::transform(value.begin(), value.end(), value.begin(), [](unsigned char c) {
+      return static_cast<char>(std::tolower(c));
+    });
+    if (value == "1" || value == "true" || value == "yes" || value == "on") {
+      return 1;
+    }
+    return 0;
+  }();
+  return cached == 1;
+}
 inline void hipblaslt_check(hipblasStatus_t status, const char* expr) {
   TORCH_CHECK(status == HIPBLAS_STATUS_SUCCESS, "hipBLASLt call failed with status ", status, " when executing ", expr);
 }
   auto device = a.device();
   auto dtype = a.scalar_type();
+  const bool use_hip = use_hipblaslt_backend();
   const auto counts_ptr = batch_sizes.data_ptr<int64_t>();
   const int64_t num_experts = batch_sizes.size(0);
     auto b_contig = b.contiguous();
+    if (use_hip) {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        auto out_chunk = out.select(0, expert);
+        if (rows == 0) {
+          out_chunk.zero_();
+          start = end;
+          continue;
+        }
+        auto a_chunk = a.narrow(0, start, rows).contiguous();
+        auto b_chunk = b_contig.narrow(0, start, rows).contiguous();
+        hipblaslt_run_matmul(a_chunk.data_ptr(),
+                             b_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             rows,
+                             hidden_in,
+                             rows,
+                             hidden_out,
+                             hidden_in,
+                             hidden_out,
+                             hidden_in,
+                             hidden_out,
+                             hidden_out,
+                             hidden_out,
+                             HIPBLAS_OP_T,
+                             HIPBLAS_OP_N,
+                             /*accumulate=*/false);
+        start = end;
+      }
+    } else {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        auto out_chunk = out.select(0, expert);
+        if (rows == 0) {
+          out_chunk.zero_();
+          start = end;
+          continue;
+        }
+        auto a_slice = a.narrow(0, start, rows);
+        auto b_slice = b_contig.narrow(0, start, rows);
+        auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
+        auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
+        auto prod = torch::matmul(a_f32.transpose(0, 1), b_f32);
+        auto prod_bf16 = prod.to(dtype);
+        out_chunk.copy_(prod_bf16);
         start = end;
       }
     }
     return out;
   }
     auto b_contig = b.contiguous();
+    if (use_hip) {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        if (rows == 0) {
+          start = end;
+          continue;
+        }
+        auto a_chunk = a.narrow(0, start, rows).contiguous();
+        auto b_chunk = b_contig.select(0, expert).contiguous();
+        auto out_chunk = out.narrow(0, start, rows);
+        hipblaslt_run_matmul(a_chunk.data_ptr(),
+                             b_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             rows,
+                             hidden_in,
+                             hidden_out,
+                             hidden_in,
+                             rows,
+                             hidden_out,
+                             hidden_in,
+                             hidden_in,
+                             hidden_out,
+                             hidden_out,
+                             HIPBLAS_OP_N,
+                             HIPBLAS_OP_T,
+                             /*accumulate=*/false);
+        start = end;
+      }
+    } else {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        if (rows == 0) {
+          start = end;
+          continue;
+        }
+        auto a_slice = a.narrow(0, start, rows);
+        auto b_slice = b_contig.select(0, expert);
+        auto out_chunk = out.narrow(0, start, rows);
+        auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
+        auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
+        auto prod = torch::matmul(a_f32, b_f32.transpose(0, 1));
+        auto prod_bf16 = prod.to(dtype);
+        out_chunk.copy_(prod_bf16);
+        start = end;
+      }
+    }
+    return out;
+  }
+  const int64_t hidden_out = a.size(1);
+  const int64_t hidden_in = b.size(2);
+  out = c_opt.value_or(torch::empty({tokens, hidden_in}, a.options()));
+  TORCH_CHECK(out.is_contiguous(), "Output tensor must be contiguous");
+  auto b_contig = b.contiguous();
+  if (use_hip) {
+    int64_t start = 0;
+    for (int64_t expert = 0; expert < num_experts; ++expert) {
+      const int64_t end = prefix[expert];
+      const int64_t rows = end - start;
+      if (rows == 0) {
+        start = end;
+        continue;
+      }
+      auto a_chunk = a.narrow(0, start, rows).contiguous();
+      auto b_chunk = b_contig.select(0, expert).contiguous();
+      auto out_chunk = out.narrow(0, start, rows);
+      hipblaslt_run_matmul(a_chunk.data_ptr(),
+                           b_chunk.data_ptr(),
+                           out_chunk.data_ptr(),
+                           out_chunk.data_ptr(),
+                           rows,
+                           hidden_out,
+                           hidden_out,
+                           hidden_in,
+                           rows,
+                           hidden_in,
+                           hidden_out,
+                           hidden_in,
+                           hidden_in,
+                           hidden_in,
+                           HIPBLAS_OP_N,
+                           HIPBLAS_OP_N,
+                           /*accumulate=*/false);
+      start = end;
+    }
+  } else {
     int64_t start = 0;
     for (int64_t expert = 0; expert < num_experts; ++expert) {
       const int64_t end = prefix[expert];
       auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
       auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
+      auto prod = torch::matmul(a_f32, b_f32);
       auto prod_bf16 = prod.to(dtype);
       out_chunk.copy_(prod_bf16);
       start = end;
     }
   }
   return out;
 }

csrc/grouped_gemm/grouped_gemm.hip CHANGED Viewed

@@ -9,10 +9,32 @@
 #include <hipblaslt/hipblaslt.h>
 #include <torch/autograd.h>
 #include <vector>
 namespace grouped_gemm {
 namespace {
 inline void hipblaslt_check(hipblasStatus_t status, const char* expr) {
   TORCH_CHECK(status == HIPBLAS_STATUS_SUCCESS, "hipBLASLt call failed with status ", status, " when executing ", expr);
 }
@@ -154,6 +176,7 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
   auto device = a.device();
   auto dtype = a.scalar_type();
   const auto counts_ptr = batch_sizes.data_ptr<int64_t>();
   const int64_t num_experts = batch_sizes.size(0);
@@ -176,28 +199,64 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
     auto b_contig = b.contiguous();
-    int64_t start = 0;
-    for (int64_t expert = 0; expert < num_experts; ++expert) {
-      const int64_t end = prefix[expert];
-      const int64_t rows = end - start;
-      auto out_chunk = out.select(0, expert);
-      if (rows == 0) {
-        out_chunk.zero_();
         start = end;
-        continue;
       }
-      auto a_slice = a.narrow(0, start, rows);
-      auto b_slice = b_contig.narrow(0, start, rows);
-      auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
-      auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
-      auto prod = torch::matmul(a_f32.transpose(0, 1), b_f32);
-      auto prod_bf16 = prod.to(dtype);
-      out_chunk.copy_(prod_bf16);
-      start = end;
     }
     return out;
   }
@@ -210,6 +269,104 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
     auto b_contig = b.contiguous();
     int64_t start = 0;
     for (int64_t expert = 0; expert < num_experts; ++expert) {
       const int64_t end = prefix[expert];
@@ -225,42 +382,12 @@ torch::Tensor hipblaslt_gmm_internal(torch::Tensor a,
       auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
       auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
-      auto prod = torch::matmul(a_f32, b_f32.transpose(0, 1));
       auto prod_bf16 = prod.to(dtype);
       out_chunk.copy_(prod_bf16);
       start = end;
     }
-    return out;
-  }
-  const int64_t hidden_out = a.size(1);
-  const int64_t hidden_in = b.size(2);
-  out = c_opt.value_or(torch::empty({tokens, hidden_in}, a.options()));
-  TORCH_CHECK(out.is_contiguous(), "Output tensor must be contiguous");
-  auto b_contig = b.contiguous();
-  int64_t start = 0;
-  for (int64_t expert = 0; expert < num_experts; ++expert) {
-    const int64_t end = prefix[expert];
-    const int64_t rows = end - start;
-    if (rows == 0) {
-      start = end;
-      continue;
-    }
-    auto a_slice = a.narrow(0, start, rows);
-    auto b_slice = b_contig.select(0, expert);
-    auto out_chunk = out.narrow(0, start, rows);
-    auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
-    auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
-    auto prod = torch::matmul(a_f32, b_f32);
-    auto prod_bf16 = prod.to(dtype);
-    out_chunk.copy_(prod_bf16);
-    start = end;
   }
   return out;
 }

 #include <hipblaslt/hipblaslt.h>
 #include <torch/autograd.h>
 #include <vector>
+#include <algorithm>
+#include <cctype>
+#include <cstdlib>
+#include <string>
 namespace grouped_gemm {
 namespace {
+bool use_hipblaslt_backend() {
+  static int cached = [] {
+    const char* raw = std::getenv("MEGABLOCKS_GG_USE_HIPBLASLT");
+    if (raw == nullptr) {
+      return 0;
+    }
+    std::string value(raw);
+    std::transform(value.begin(), value.end(), value.begin(), [](unsigned char c) {
+      return static_cast<char>(std::tolower(c));
+    });
+    if (value == "1" || value == "true" || value == "yes" || value == "on") {
+      return 1;
+    }
+    return 0;
+  }();
+  return cached == 1;
+}
 inline void hipblaslt_check(hipblasStatus_t status, const char* expr) {
   TORCH_CHECK(status == HIPBLAS_STATUS_SUCCESS, "hipBLASLt call failed with status ", status, " when executing ", expr);
 }
   auto device = a.device();
   auto dtype = a.scalar_type();
+  const bool use_hip = use_hipblaslt_backend();
   const auto counts_ptr = batch_sizes.data_ptr<int64_t>();
   const int64_t num_experts = batch_sizes.size(0);
     auto b_contig = b.contiguous();
+    if (use_hip) {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        auto out_chunk = out.select(0, expert);
+        if (rows == 0) {
+          out_chunk.zero_();
+          start = end;
+          continue;
+        }
+        auto a_chunk = a.narrow(0, start, rows).contiguous();
+        auto b_chunk = b_contig.narrow(0, start, rows).contiguous();
+        hipblaslt_run_matmul(a_chunk.data_ptr(),
+                             b_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             rows,
+                             hidden_in,
+                             rows,
+                             hidden_out,
+                             hidden_in,
+                             hidden_out,
+                             hidden_in,
+                             hidden_out,
+                             hidden_out,
+                             hidden_out,
+                             HIPBLAS_OP_T,
+                             HIPBLAS_OP_N,
+                             /*accumulate=*/false);
+        start = end;
+      }
+    } else {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        auto out_chunk = out.select(0, expert);
+        if (rows == 0) {
+          out_chunk.zero_();
+          start = end;
+          continue;
+        }
+        auto a_slice = a.narrow(0, start, rows);
+        auto b_slice = b_contig.narrow(0, start, rows);
+        auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
+        auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
+        auto prod = torch::matmul(a_f32.transpose(0, 1), b_f32);
+        auto prod_bf16 = prod.to(dtype);
+        out_chunk.copy_(prod_bf16);
         start = end;
       }
     }
     return out;
   }
     auto b_contig = b.contiguous();
+    if (use_hip) {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        if (rows == 0) {
+          start = end;
+          continue;
+        }
+        auto a_chunk = a.narrow(0, start, rows).contiguous();
+        auto b_chunk = b_contig.select(0, expert).contiguous();
+        auto out_chunk = out.narrow(0, start, rows);
+        hipblaslt_run_matmul(a_chunk.data_ptr(),
+                             b_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             out_chunk.data_ptr(),
+                             rows,
+                             hidden_in,
+                             hidden_out,
+                             hidden_in,
+                             rows,
+                             hidden_out,
+                             hidden_in,
+                             hidden_in,
+                             hidden_out,
+                             hidden_out,
+                             HIPBLAS_OP_N,
+                             HIPBLAS_OP_T,
+                             /*accumulate=*/false);
+        start = end;
+      }
+    } else {
+      int64_t start = 0;
+      for (int64_t expert = 0; expert < num_experts; ++expert) {
+        const int64_t end = prefix[expert];
+        const int64_t rows = end - start;
+        if (rows == 0) {
+          start = end;
+          continue;
+        }
+        auto a_slice = a.narrow(0, start, rows);
+        auto b_slice = b_contig.select(0, expert);
+        auto out_chunk = out.narrow(0, start, rows);
+        auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
+        auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
+        auto prod = torch::matmul(a_f32, b_f32.transpose(0, 1));
+        auto prod_bf16 = prod.to(dtype);
+        out_chunk.copy_(prod_bf16);
+        start = end;
+      }
+    }
+    return out;
+  }
+  const int64_t hidden_out = a.size(1);
+  const int64_t hidden_in = b.size(2);
+  out = c_opt.value_or(torch::empty({tokens, hidden_in}, a.options()));
+  TORCH_CHECK(out.is_contiguous(), "Output tensor must be contiguous");
+  auto b_contig = b.contiguous();
+  if (use_hip) {
+    int64_t start = 0;
+    for (int64_t expert = 0; expert < num_experts; ++expert) {
+      const int64_t end = prefix[expert];
+      const int64_t rows = end - start;
+      if (rows == 0) {
+        start = end;
+        continue;
+      }
+      auto a_chunk = a.narrow(0, start, rows).contiguous();
+      auto b_chunk = b_contig.select(0, expert).contiguous();
+      auto out_chunk = out.narrow(0, start, rows);
+      hipblaslt_run_matmul(a_chunk.data_ptr(),
+                           b_chunk.data_ptr(),
+                           out_chunk.data_ptr(),
+                           out_chunk.data_ptr(),
+                           rows,
+                           hidden_out,
+                           hidden_out,
+                           hidden_in,
+                           rows,
+                           hidden_in,
+                           hidden_out,
+                           hidden_in,
+                           hidden_in,
+                           hidden_in,
+                           HIPBLAS_OP_N,
+                           HIPBLAS_OP_N,
+                           /*accumulate=*/false);
+      start = end;
+    }
+  } else {
     int64_t start = 0;
     for (int64_t expert = 0; expert < num_experts; ++expert) {
       const int64_t end = prefix[expert];
       auto a_f32 = a_slice.contiguous().to(torch::kFloat32);
       auto b_f32 = b_slice.contiguous().to(torch::kFloat32);
+      auto prod = torch::matmul(a_f32, b_f32);
       auto prod_bf16 = prod.to(dtype);
       out_chunk.copy_(prod_bf16);
       start = end;
     }
   }
   return out;
 }