Spaces:
Running
Running
Update index.html
Browse files- index.html +4 -6
index.html
CHANGED
|
@@ -62,16 +62,14 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
| 63 |
we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
|
| 64 |
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
|
| 65 |
-
landscape and based on the characteristics of this landscape
|
| 66 |
methods and show the defense performance.
|
| 67 |
</p>
|
| 68 |
|
| 69 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|
| 70 |
-
<p>
|
| 71 |
-
|
| 72 |
-
|
| 73 |
-
This phenomenon could hamper scenarios requiring accurate uncertainty estimation, such as safety-related tasks
|
| 74 |
-
(e.g., autonomous driving systems, medical diagnosis, etc.).</p>
|
| 75 |
|
| 76 |
<div class="container">
|
| 77 |
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
|
|
|
|
| 62 |
jailbreak attempts aiming at subverting the embedded safety guardrails. To address this challenge,
|
| 63 |
we define and investigate the \textbf{Refusal Loss} of LLMs and then propose a method called \textbf{Gradient Cuff} to
|
| 64 |
detect jailbreak attempts. In this demonstration, we first introduce the concept of "Jailbreak". Then we present the refusal loss
|
| 65 |
+
landscape and propose the Gradient Cuff based on the characteristics of this landscape. Lastly, we compare Gradient Cuff with other jailbreak defense
|
| 66 |
methods and show the defense performance.
|
| 67 |
</p>
|
| 68 |
|
| 69 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|
| 70 |
+
<p>Jailbreak attacks involve maliciously inserting or replacing tokens in the user instruction or rewriting it to bypass and circumvent
|
| 71 |
+
the safety guardrails of aligned LLMs. A notable example is that a jailbroken LLM would be tricked into
|
| 72 |
+
generating hate speech targeting certain groups of people, as demonstrated below.</p>
|
|
|
|
|
|
|
| 73 |
|
| 74 |
<div class="container">
|
| 75 |
<div id="jailbreak-intro" class="row align-items-center jailbreak-intro-sec">
|