Spaces:
Running
Running
Update index.html
Browse files- index.html +4 -2
index.html
CHANGED
|
@@ -90,9 +90,11 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 90 |
</div>
|
| 91 |
|
| 92 |
<p>
|
| 93 |
-
|
|
|
|
|
|
|
| 94 |
Below we present the definition of the <strong>Refusal Loss</strong> and how we approximate it's function value and gradient.
|
| 95 |
-
See more details about the concept, approximation and
|
| 96 |
</p>
|
| 97 |
|
| 98 |
<div id="refusal-loss-formula" class="container">
|
|
|
|
| 90 |
</div>
|
| 91 |
|
| 92 |
<p>
|
| 93 |
+
From the above plot, we find that the loss landscape is more precipitous for malicious queries than for benign queries, which implies that
|
| 94 |
+
the <strong>Refusal Loss</strong> tends to have a large gradient norm if the input represents a malicious query. This observation motivates our proposal of using
|
| 95 |
+
the gradient norm of <strong>Refusal Loss</strong> to detect jailbreak attempts that pass the initial filtering of rejecting the input query when the function value is under 0.5.
|
| 96 |
Below we present the definition of the <strong>Refusal Loss</strong> and how we approximate it's function value and gradient.
|
| 97 |
+
See more details about the concept, approximation, gradient estimation and landscape drawing of it in our paper.
|
| 98 |
</p>
|
| 99 |
|
| 100 |
<div id="refusal-loss-formula" class="container">
|