basic example
Browse files- dist/index.html +16 -4
dist/index.html
CHANGED
|
@@ -294,7 +294,7 @@ if self.config._attn_implementation != "eager":
|
|
| 294 |
</code></pre>
|
| 295 |
<p>We often read and understand that <code>kwargs</code> are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don’'t use the same kwargs.</p>
|
| 296 |
<p>It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
|
| 297 |
-
<p>For
|
| 298 |
<h2><a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism</h2>
|
| 299 |
<p>We want to touch minimally to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
|
| 300 |
<h2><a id="layers-attentions-caches"></a> Layers, attentions and caches</h2>
|
|
@@ -305,7 +305,7 @@ if self.config._attn_implementation != "eager":
|
|
| 305 |
class GlmRMSNorm(nn.Module):
|
| 306 |
...
|
| 307 |
</code></pre>
|
| 308 |
-
<p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
|
| 309 |
<h2>The good modularity</h2>
|
| 310 |
<p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.</p>
|
| 311 |
<p>My capacity for abstraction is not that great, compared to other computer scientists and engineers: I need to look at little doodles and drawings, especially when components pile up.</p>
|
|
@@ -481,7 +481,19 @@ machinery is the <code>attention mask</code>, cause of confusion. Thankfully, we
|
|
| 481 |
<p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
|
| 482 |
<h3>Transformers-serve</h3>
|
| 483 |
<p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.</p>
|
| 484 |
-
<
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 485 |
<h2>Community reusability</h2>
|
| 486 |
<p>Adding a model to transformers means:</p>
|
| 487 |
<ul>
|
|
@@ -489,7 +501,7 @@ machinery is the <code>attention mask</code>, cause of confusion. Thankfully, we
|
|
| 489 |
<li>usable in vLLM, SGLang, and so on without additional code.</li>
|
| 490 |
</ul>
|
| 491 |
<p>## Inner cooking: CUDA Warmup</p>
|
| 492 |
-
<p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading
|
| 493 |
<p><div style="border: 1px solid #e2e8f0; border-radius: 8px; background: white; margin: 1.5rem 0;">
|
| 494 |
<div style="padding: 1rem; border-bottom: 1px solid #e2e8f0; background: #f8f9fa;">
|
| 495 |
<h4 style="margin: 0 0 0.5rem 0; color: #495057;">🚀 CUDA Warmup Efficiency Benchmark</h4>
|
|
|
|
| 294 |
</code></pre>
|
| 295 |
<p>We often read and understand that <code>kwargs</code> are criticized, and we are typing them however we can, but we cannot enforce them all the time because other libraries such as vLLM don’'t use the same kwargs.</p>
|
| 296 |
<p>It is a strength of the new attention interface, where it can be plugged in various backends, because most of the signature is not enforced. We INFORM but do not ENFORCE. That way, the current system is a <a href="#minimal-user-api">minimal user api</a>.</p>
|
| 297 |
+
<p>For better <em>information</em>, we plan to use <code>python</code> features such as <code>Annotated</code> for example, to inform users of what we expect typically in an argument. That way, higher-level information could be included directly in the type annotations.</p>
|
| 298 |
<h2><a id="simpler-tensor-parallelism"></a> Simpler Tensor Parallelism</h2>
|
| 299 |
<p>We want to touch minimally to the modeling code, and only modify it when <em>architectural changes</em> are involved. For instance, for tensor parallelism, we instead now specify a simple <code>tp_plan</code>.</p>
|
| 300 |
<h2><a id="layers-attentions-caches"></a> Layers, attentions and caches</h2>
|
|
|
|
| 305 |
class GlmRMSNorm(nn.Module):
|
| 306 |
...
|
| 307 |
</code></pre>
|
| 308 |
+
<p>Plus, this opened another angle of contribution for the community. People who are GPU whisperers can now contribute optimized kernels. You can check on the <a href="https://huggingface.co/blog/hello-hf-kernels">kernel community blog post</a> to learn more about it!</p>
|
| 309 |
<h2>The good modularity</h2>
|
| 310 |
<p>Now, we have a form of inheritance in our codebase. Some models become standards, and model contributors are given the opportunity to <em>define standards</em>. Pushing the boundaries of scientific knowledge can translate into the boundaries of engineering if this effort is made, and we’re striving for it.</p>
|
| 311 |
<p>My capacity for abstraction is not that great, compared to other computer scientists and engineers: I need to look at little doodles and drawings, especially when components pile up.</p>
|
|
|
|
| 481 |
<p><img src="static/model_debugger.png" alt="Model debugger interface"></p>
|
| 482 |
<h3>Transformers-serve</h3>
|
| 483 |
<p>Having all these models readily available allows to use all of them with transformers-serve, and enable interfacing with them with an Open API-like pattern.</p>
|
| 484 |
+
<pre><code class="language-bash"># Start serving a model with transformers serve
|
| 485 |
+
transformers serve microsoft/DialoGPT-medium --port 8000
|
| 486 |
+
|
| 487 |
+
# Query the model using OpenAI-compatible API
|
| 488 |
+
curl -X POST http://localhost:8000/v1/chat/completions \
|
| 489 |
+
-H "Content-Type: application/json" \
|
| 490 |
+
-d "{
|
| 491 |
+
\"model\": \"microsoft/DialoGPT-medium\",
|
| 492 |
+
\"messages\": [{\"role\": \"user\", \"content\": \"Hello, how are you?\"}],
|
| 493 |
+
\"max_tokens\": 50
|
| 494 |
+
}"
|
| 495 |
+
</code></pre>
|
| 496 |
+
<p>This provides an OpenAI-compatible API with features like continuous batching for better GPU utilization.</p>
|
| 497 |
<h2>Community reusability</h2>
|
| 498 |
<p>Adding a model to transformers means:</p>
|
| 499 |
<ul>
|
|
|
|
| 501 |
<li>usable in vLLM, SGLang, and so on without additional code.</li>
|
| 502 |
</ul>
|
| 503 |
<p>## Inner cooking: CUDA Warmup</p>
|
| 504 |
+
<p>Having a clean <em>external</em> API allows us to work on the true inner workings of transformers. One of the few recent additions was the <em>CUDA warmup</em> via <code>caching_allocator_warmup</code> which improved massively the loading footprint by pre-allocating GPU memory to avoid malloc bottlenecks during model loading.</p>
|
| 505 |
<p><div style="border: 1px solid #e2e8f0; border-radius: 8px; background: white; margin: 1.5rem 0;">
|
| 506 |
<div style="padding: 1rem; border-bottom: 1px solid #e2e8f0; background: #f8f9fa;">
|
| 507 |
<h4 style="margin: 0 0 0.5rem 0; color: #495057;">🚀 CUDA Warmup Efficiency Benchmark</h4>
|