fix source
Browse files- content/article.md +19 -5
- src/fragments/memory-profiler.html +2 -2
- webpack.config.js +2 -2
content/article.md
CHANGED
|
@@ -82,7 +82,7 @@ We often read and understand that `kwargs` are criticized, and we are typing the
|
|
| 82 |
|
| 83 |
It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
|
| 84 |
|
| 85 |
-
For
|
| 86 |
|
| 87 |
## Community Kernels
|
| 88 |
|
|
@@ -94,7 +94,7 @@ class GlmRMSNorm(nn.Module):
|
|
| 94 |
...
|
| 95 |
```
|
| 96 |
|
| 97 |
-
Plus, this opened another angle of contribution for the community. People who are GPU
|
| 98 |
|
| 99 |
## The good modularity
|
| 100 |
|
|
@@ -241,7 +241,21 @@ It just works with PyTorch models and is especially useful when aligning outputs
|
|
| 241 |
|
| 242 |
Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
|
| 243 |
|
| 244 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
## Community reusability
|
| 246 |
|
| 247 |
|
|
@@ -249,9 +263,9 @@ Adding a model to transformers means:
|
|
| 249 |
- having it immediately available to the community
|
| 250 |
- usable in vLLM, SGLang, and so on without additional code.
|
| 251 |
|
| 252 |
-
## Inner cooking:
|
| 253 |
|
| 254 |
-
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the
|
| 255 |
|
| 256 |
{{{fragment-memory-profiler}}}
|
| 257 |
|
|
|
|
| 82 |
|
| 83 |
It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a [minimal user api](#minimal-user-api).
|
| 84 |
|
| 85 |
+
For better _information_, we plan to use `python` features such as `Annotated` for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations.
|
| 86 |
|
| 87 |
## Community Kernels
|
| 88 |
|
|
|
|
| 94 |
...
|
| 95 |
```
|
| 96 |
|
| 97 |
+
Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the [kernel community blog post](https://huggingface.co/blog/hello-hf-kernels) to learn more about it!
|
| 98 |
|
| 99 |
## The good modularity
|
| 100 |
|
|
|
|
| 241 |
|
| 242 |
Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.
|
| 243 |
|
| 244 |
+
```bash
|
| 245 |
+
# Start serving a model with transformers serve
|
| 246 |
+
transformers serve microsoft/DialoGPT-medium --port 8000
|
| 247 |
+
|
| 248 |
+
# Query the model using OpenAI-compatible API
|
| 249 |
+
curl -X POST http://localhost:8000/v1/chat/completions \
|
| 250 |
+
-H "Content-Type: application/json" \
|
| 251 |
+
-d "{
|
| 252 |
+
\"model\": \"microsoft/DialoGPT-medium\",
|
| 253 |
+
\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
|
| 254 |
+
\"max_tokens\": 50
|
| 255 |
+
}"
|
| 256 |
+
```
|
| 257 |
+
|
| 258 |
+
This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.
|
| 259 |
## Community reusability
|
| 260 |
|
| 261 |
|
|
|
|
| 263 |
- having it immediately available to the community
|
| 264 |
- usable in vLLM, SGLang, and so on without additional code.
|
| 265 |
|
| 266 |
+
## Inner cooking: CUDA Warmup
|
| 267 |
|
| 268 |
+
Having a clean _external_ API allows us to work on the true inner workings of transformers. One of the few recent additions was the _CUDA warmup_ via `caching_allocator_warmup` which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.
|
| 269 |
|
| 270 |
{{{fragment-memory-profiler}}}
|
| 271 |
|
src/fragments/memory-profiler.html
CHANGED
|
@@ -1,8 +1,8 @@
|
|
| 1 |
<div style="border: 1px solid #e2e8f0; border-radius: 8px; background: white; margin: 1.5rem 0;">
|
| 2 |
<div style="padding: 1rem; border-bottom: 1px solid #e2e8f0; background: #f8f9fa;">
|
| 3 |
-
<h4 style="margin: 0 0 0.5rem 0; color: #495057;">🚀
|
| 4 |
<p style="margin: 0; font-size: 0.9em; color: #6c757d;">
|
| 5 |
-
Compare model loading with and without transformers'
|
| 6 |
</p>
|
| 7 |
</div>
|
| 8 |
|
|
|
|
| 1 |
<div style="border: 1px solid #e2e8f0; border-radius: 8px; background: white; margin: 1.5rem 0;">
|
| 2 |
<div style="padding: 1rem; border-bottom: 1px solid #e2e8f0; background: #f8f9fa;">
|
| 3 |
+
<h4 style="margin: 0 0 0.5rem 0; color: #495057;">🚀 CUDA Warmup Efficiency Benchmark</h4>
|
| 4 |
<p style="margin: 0; font-size: 0.9em; color: #6c757d;">
|
| 5 |
+
Compare model loading with and without transformers' CUDA warmup via `caching_allocator_warmup`. This demonstrates the loading time and memory efficiency improvements.
|
| 6 |
</p>
|
| 7 |
</div>
|
| 8 |
|
webpack.config.js
CHANGED
|
@@ -235,7 +235,7 @@ module.exports = {
|
|
| 235 |
"title": "Transformers Feature Showcase",
|
| 236 |
"description": "An interactive demonstration of transformers library features and design philosophy.",
|
| 237 |
"published": "Aug 21, 2025",
|
| 238 |
-
"authors": [{"author": "Pablo
|
| 239 |
}</script>
|
| 240 |
</d-front-matter>
|
| 241 |
<d-title>
|
|
@@ -330,4 +330,4 @@ module.exports = {
|
|
| 330 |
},
|
| 331 |
};
|
| 332 |
|
| 333 |
-
console.log(process.env.NODE_ENV)
|
|
|
|
| 235 |
"title": "Transformers Feature Showcase",
|
| 236 |
"description": "An interactive demonstration of transformers library features and design philosophy.",
|
| 237 |
"published": "Aug 21, 2025",
|
| 238 |
+
"authors": [{"author": "Pablo Montalvo", "authorURL": "https://huggingface.co/Molbap"}]
|
| 239 |
}</script>
|
| 240 |
</d-front-matter>
|
| 241 |
<d-title>
|
|
|
|
| 330 |
},
|
| 331 |
};
|
| 332 |
|
| 333 |
+
console.log(process.env.NODE_ENV)
|