Spaces:
Running
Running
Update index.html
Browse files- index.html +13 -9
index.html
CHANGED
|
@@ -88,15 +88,19 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 88 |
<main id="content" class="main-content" role="main">
|
| 89 |
<h2 id="introduction">Introduction</h2>
|
| 90 |
|
| 91 |
-
<p>Large Language Models (LLMs) are
|
| 92 |
-
|
| 93 |
-
|
| 94 |
-
|
| 95 |
-
|
| 96 |
-
|
| 97 |
-
|
| 98 |
-
|
| 99 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 100 |
</p>
|
| 101 |
|
| 102 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|
|
|
|
| 88 |
<main id="content" class="main-content" role="main">
|
| 89 |
<h2 id="introduction">Introduction</h2>
|
| 90 |
|
| 91 |
+
<p>Large Language Models (LLMs) are increasingly being integrated into services such as ChatGPT to provide responses to user queries.
|
| 92 |
+
To mitigate potential harm and prevent misuse, there have been concerted efforts to align the LLMs with human values and legal compliance
|
| 93 |
+
by incorporating various techniques, such as Reinforcement Learning from Human Feedback (RLHF),
|
| 94 |
+
into the training of the LLMs. However, recent research has exposed that even aligned
|
| 95 |
+
LLMs are susceptible to adversarial manipulations known as Jailbreak Attacks. To address this challenge, this paper proposes a method called
|
| 96 |
+
<strong>Token Highlighter</strong> to inspect and mitigate the potential jailbreak threats in the user query.
|
| 97 |
+
Token Highlighter introduced a concept called Affirmation Loss to measure the LLM's willingness to answer the user query.
|
| 98 |
+
It then uses the gradient of Affirmation Loss for each token in the user query to locate the jailbreak-critical tokens. Further,
|
| 99 |
+
Token Highlighter exploits our proposed Soft Removal technique to mitigate the jailbreak effects of critical tokens via shrinking their
|
| 100 |
+
token embeddings. Experimental results on two aligned LLMs (LLaMA-2 and Vicuna-V1.5) demonstrate that the proposed method can effectively
|
| 101 |
+
defend against a variety of Jailbreak Attacks while maintaining competent performance on benign questions of the AlpacaEval benchmark. In
|
| 102 |
+
addition, Token Highlighter is a cost-effective and interpretable defense because it only needs to query the protected LLM once to compute
|
| 103 |
+
the Affirmation Loss and can highlight the critical tokens upon refusal.
|
| 104 |
</p>
|
| 105 |
|
| 106 |
<h2 id="what-is-jailbreak">What is Jailbreak?</h2>
|