Update index.html
Browse files- index.html +6 -3
index.html
CHANGED
|
@@ -81,9 +81,12 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 81 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 82 |
autoregressive sampling-based generation. With this randomness, it is an
|
| 83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
| 84 |
-
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
|
| 85 |
-
the LLM won't reject the input user query.
|
| 86 |
-
|
|
|
|
|
|
|
|
|
|
| 87 |
</p>
|
| 88 |
|
| 89 |
<div class="container jailbreak-intro-sec">
|
|
|
|
| 81 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 82 |
autoregressive sampling-based generation. With this randomness, it is an
|
| 83 |
interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
|
| 84 |
+
sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
|
| 85 |
+
represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
|
| 86 |
+
the opposite, we compute the empirical refusal loss as the sample mean of the jailbroken results returned from the target LLM.
|
| 87 |
+
<!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
|
| 88 |
+
mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
|
| 89 |
+
We visualize the 2-D landscape of the empirical Refusal Loss as below:
|
| 90 |
</p>
|
| 91 |
|
| 92 |
<div class="container jailbreak-intro-sec">
|