Spaces:
Running
Running
Update index.html
Browse files- index.html +17 -121
index.html
CHANGED
|
@@ -114,119 +114,6 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 114 |
</div>
|
| 115 |
</div>
|
| 116 |
|
| 117 |
-
<p>We summarized some recent advances in <strong>Jailbreak Attack</strong> and <strong>Jailbreak Defense</strong> in the below table: </p>
|
| 118 |
-
<div id="tabs">
|
| 119 |
-
<ul>
|
| 120 |
-
<li><a href="#jailbreak-attacks">Jailbreak Attack</a></li>
|
| 121 |
-
<li><a href="#jailbreak-defenses">Jailbreak Defense</a></li>
|
| 122 |
-
</ul>
|
| 123 |
-
<div id="jailbreak-attacks">
|
| 124 |
-
<div id="accordion-attacks">
|
| 125 |
-
<h3>GCG</h3>
|
| 126 |
-
<div>
|
| 127 |
-
<ul>
|
| 128 |
-
<li>Paper: <a href="https://arxiv.org/abs/2307.15043" target="_blank" rel="noopener noreferrer">
|
| 129 |
-
Universal and Transferable Adversarial Attacks on Aligned Language Models</a></li>
|
| 130 |
-
<li>Brief Introduction: Given a (potentially harmful) user query, GCG trains and appends an adversarial suffix to the query
|
| 131 |
-
that attempts to induce negative behavior from the target LLM. </li>
|
| 132 |
-
</ul>
|
| 133 |
-
</div>
|
| 134 |
-
<h3>AutoDAN</h3>
|
| 135 |
-
<div>
|
| 136 |
-
<ul>
|
| 137 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.04451" target="_blank" rel="noopener noreferrer">
|
| 138 |
-
AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models</a></li>
|
| 139 |
-
<li>Brief Introduction: AutoDAN, an automatic stealthy jailbreak prompts generation framework based on a carefully designed
|
| 140 |
-
hierarchical genetic algorithm. AUtoDAN preserves the meaningfulness and fluency (i.e., stealthiness) of jailbreak prompts,
|
| 141 |
-
akin to handcrafted ones, while also ensuring automated deployment as introduced in prior token-level research like GCG.
|
| 142 |
-
</li>
|
| 143 |
-
</ul>
|
| 144 |
-
</div>
|
| 145 |
-
<h3>PAIR</h3>
|
| 146 |
-
<div>
|
| 147 |
-
<ul>
|
| 148 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.08419" target="_blank" rel="noopener noreferrer">
|
| 149 |
-
Jailbreaking Black Box Large Language Models in Twenty Queries</a></li>
|
| 150 |
-
<li>Brief Introduction: PAIR uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM
|
| 151 |
-
without human intervention. The attacker LLM iteratively queries the target LLM to update and refine a candidate
|
| 152 |
-
jailbreak based on the comments and the rated score provided by another Judge model.
|
| 153 |
-
Empirically, PAIR often requires fewer than twenty queries to produce a successful jailbreak.</li>
|
| 154 |
-
</ul>
|
| 155 |
-
</div>
|
| 156 |
-
<h3>TAP</h3>
|
| 157 |
-
<div>
|
| 158 |
-
<ul>
|
| 159 |
-
<li>Paper: <a href="https://arxiv.org/abs/2312.02119" target="_blank" rel="noopener noreferrer">
|
| 160 |
-
Tree of Attacks: Jailbreaking Black-Box LLMs Automatically</a></li>
|
| 161 |
-
<li>Brief Introduction: TAP is similar to PAIR. The main difference is that
|
| 162 |
-
the attacker in TAP iteratively refines candidate (attack) prompts using tree-of-thought
|
| 163 |
-
reasoning.</li>
|
| 164 |
-
</ul>
|
| 165 |
-
</div>
|
| 166 |
-
<h3>Base64</h3>
|
| 167 |
-
<div>
|
| 168 |
-
<ul>
|
| 169 |
-
<li>Paper: <a href="https://arxiv.org/abs/2307.02483" target="_blank" rel="noopener noreferrer">
|
| 170 |
-
Jailbroken: How Does LLM Safety Training Fail?</a></li>
|
| 171 |
-
<li>Brief Introduction: Encode the malicious user query into base64 format before using it to query the model.</li>
|
| 172 |
-
</ul>
|
| 173 |
-
</div>
|
| 174 |
-
<h3>LRL</h3>
|
| 175 |
-
<div>
|
| 176 |
-
<ul>
|
| 177 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.02446" target="_blank" rel="noopener noreferrer">
|
| 178 |
-
Low-Resource Languages Jailbreak GPT-4</a></li>
|
| 179 |
-
<li>Brief Introduction: Translate the malicious user query into low-resource language before using it to query the model.</li>
|
| 180 |
-
</ul>
|
| 181 |
-
</div>
|
| 182 |
-
</div>
|
| 183 |
-
</div>
|
| 184 |
-
|
| 185 |
-
<div id="jailbreak-defenses">
|
| 186 |
-
<div id="accordion-defenses">
|
| 187 |
-
<h3>Perpleixty Filter</h3>
|
| 188 |
-
<div>
|
| 189 |
-
<ul>
|
| 190 |
-
<li>Paper: <a href="https://arxiv.org/abs/2309.00614" target="_blank" rel="noopener noreferrer">
|
| 191 |
-
Baseline Defenses for Adversarial Attacks Against Aligned Language Models</a></li>
|
| 192 |
-
<li>Brief Introduction: Perplexity Filter uses an LLM to compute the perplexity of the input query and rejects those
|
| 193 |
-
with high perplexity.</li>
|
| 194 |
-
</ul>
|
| 195 |
-
</div>
|
| 196 |
-
<h3>SmoothLLM</h3>
|
| 197 |
-
<div>
|
| 198 |
-
<ul>
|
| 199 |
-
<li>Paper: <a href="https://arxiv.org/abs/2310.03684" target="_blank" rel="noopener noreferrer">
|
| 200 |
-
SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks</a></li>
|
| 201 |
-
<li>Brief Introduction: SmoothLLM perturbs the original input query to obtain several copies and aggregates
|
| 202 |
-
the intermediate responses of the target LLM to these perturbed queries to give the final response to the
|
| 203 |
-
original query.
|
| 204 |
-
</li>
|
| 205 |
-
</ul>
|
| 206 |
-
</div>
|
| 207 |
-
<h3>Erase-Check</h3>
|
| 208 |
-
<div>
|
| 209 |
-
<ul>
|
| 210 |
-
<li>Paper: <a href="https://arxiv.org/abs/2309.02705" target="_blank" rel="noopener noreferrer">
|
| 211 |
-
Certifying LLM Safety against Adversarial Prompting</a></li>
|
| 212 |
-
<li>Brief Introduction: Erase-Check employs a model to check whether the original query or any of its erased subsentences
|
| 213 |
-
is harmful. The query would be rejected if the query or one of its sub-sentences is regarded as harmful by the safety checker</li>
|
| 214 |
-
</ul>
|
| 215 |
-
</div>
|
| 216 |
-
<h3>Self-Reminder</h3>
|
| 217 |
-
<div>
|
| 218 |
-
<ul>
|
| 219 |
-
<li>Paper: <a href="https://assets.researchsquare.com/files/rs-2873090/v1_covered_eb589a01-bf05-4f32-b3eb-0d6864f64ad9.pdf?c=1702456350" target="_blank" rel="noopener noreferrer">
|
| 220 |
-
Defending ChatGPT against Jailbreak Attack via Self-Reminder</a></li>
|
| 221 |
-
<li>Brief Introduction: Self-Reminder modifying the system prompt of the target LLM so that the model reminds itself to process
|
| 222 |
-
and respond to the user in the context of being an aligned LLM.</li>
|
| 223 |
-
</ul>
|
| 224 |
-
</div>
|
| 225 |
-
</div>
|
| 226 |
-
</div>
|
| 227 |
-
|
| 228 |
-
</div>
|
| 229 |
-
|
| 230 |
<h2 id="refusal-loss">Interpretability</h2>
|
| 231 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 232 |
autoregressive sampling-based generation. With this randomness, it is an
|
|
@@ -290,7 +177,7 @@ Exploring Refusal Loss Landscapes </title>
|
|
| 290 |
</div>
|
| 291 |
</div>
|
| 292 |
|
| 293 |
-
<h2 id="proposed-approach-gradient-cuff">
|
| 294 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
| 295 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
| 296 |
</p>
|
|
@@ -377,13 +264,22 @@ and <a href="Mailto:pin-yu.chen@ibm.com">Pin-Yu Chen</a>
|
|
| 377 |
<h2 id="citations">Citations</h2>
|
| 378 |
<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
|
| 379 |
|
| 380 |
-
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@
|
| 381 |
-
|
| 382 |
-
|
| 383 |
-
|
| 384 |
-
|
| 385 |
-
|
| 386 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 387 |
}
|
| 388 |
</code></pre></div></div>
|
| 389 |
|
|
|
|
| 114 |
</div>
|
| 115 |
</div>
|
| 116 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 117 |
<h2 id="refusal-loss">Interpretability</h2>
|
| 118 |
<p>Current transformer-based LLMs will return different responses to the same query due to the randomness of
|
| 119 |
autoregressive sampling-based generation. With this randomness, it is an
|
|
|
|
| 177 |
</div>
|
| 178 |
</div>
|
| 179 |
|
| 180 |
+
<h2 id="proposed-approach-gradient-cuff">Performance evaluation against practical Jailbreaks</h2>
|
| 181 |
<p> With the exploration of the Refusal Loss landscape, we propose Gradient Cuff,
|
| 182 |
a two-step jailbreak detection method based on checking the refusal loss and its gradient norm. Our detection procedure is shown below:
|
| 183 |
</p>
|
|
|
|
| 264 |
<h2 id="citations">Citations</h2>
|
| 265 |
<p>If you find Gradient Cuff helpful and useful for your research, please cite our main paper as follows:</p>
|
| 266 |
|
| 267 |
+
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>@article{DBLP:journals/corr/abs-2412-18171,
|
| 268 |
+
author = {Xiaomeng Hu and
|
| 269 |
+
Pin{-}Yu Chen and
|
| 270 |
+
Tsung{-}Yi Ho},
|
| 271 |
+
title = {Token Highlighter: Inspecting and Mitigating Jailbreak Prompts for
|
| 272 |
+
Large Language Models},
|
| 273 |
+
journal = {CoRR},
|
| 274 |
+
volume = {abs/2412.18171},
|
| 275 |
+
year = {2024},
|
| 276 |
+
url = {https://doi.org/10.48550/arXiv.2412.18171},
|
| 277 |
+
doi = {10.48550/ARXIV.2412.18171},
|
| 278 |
+
eprinttype = {arXiv},
|
| 279 |
+
eprint = {2412.18171},
|
| 280 |
+
timestamp = {Sat, 25 Jan 2025 12:51:16 +0100},
|
| 281 |
+
biburl = {https://dblp.org/rec/journals/corr/abs-2412-18171.bib},
|
| 282 |
+
bibsource = {dblp computer science bibliography, https://dblp.org}
|
| 283 |
}
|
| 284 |
</code></pre></div></div>
|
| 285 |
|