Update index.html
Browse files- index.html +6 -2
index.html
CHANGED
|
@@ -78,8 +78,12 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 78 |
</div>
|
| 79 |
|
| 80 |
<h3 id="refusal-loss">Refusal Loss</h3>
|
| 81 |
-
<p>
|
| 82 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
|
| 84 |
<div class="container jailbreak-intro-sec">
|
| 85 |
<div><img id="jailbreak-intro-img" src="images/metrics/intro-metric-example.png" /></div>
|
|
|
|
| 78 |
</div>
|
| 79 |
|
| 80 |
<h3 id="refusal-loss">Refusal Loss</h3>
|
| 81 |
+
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 82 |
+
autoregressive sampling-based generation. With this randomness, it is an
|
| 83 |
+
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
| 84 |
+
sometimes be able to bypass the safety guardrail. Based on this observation, for a given LLM <p>$T_\theta$</p> parameterized with $\theta$, we
|
| 85 |
+
define the refusal loss function $\phi_\theta(x)$ for a given input user query $x$ as below:
|
| 86 |
+
</p>
|
| 87 |
|
| 88 |
<div class="container jailbreak-intro-sec">
|
| 89 |
<div><img id="jailbreak-intro-img" src="images/metrics/intro-metric-example.png" /></div>
|