Spaces:

TrustSafeAI
/

GradientCuff-Jailbreak-Defense

Running

gregH commited on Feb 29, 2024

Commit

582057e

verified ·

1 Parent(s): ea6ff45

Update index.html

Files changed (1) hide show

index.html CHANGED Viewed

@@ -83,7 +83,7 @@ Exploring Refusal Loss Landscapes </title>
   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
   sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
   represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
-  the opposite, we compute the empirical refusal loss as the sample mean of the jailbroken results returned from the target LLM.
   <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
   mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
   We visualize the 2-D landscape of the empirical Refusal Loss as below:
@@ -105,7 +105,7 @@ Exploring Refusal Loss Landscapes </title>
 <div id="refusal-loss-formula" class="container">
 <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
   <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
-  <a href="#Refusal-Loss-Approximation">Refusal Loss Approximation</a>
   <a href="#Gradient-Estimation">Gradient Estimation</a>
   <div style="clear: both"></div>
 </div>

   interesting phenomenon that a malicious user query will sometimes be rejected by the target LLM, but
   sometimes be able to bypass the safety guardrail. Based on this observation, we propose a new concept called <strong>Refusal Loss</strong> to
   represent the probability with which the LLM won't reject the input user query. By using 1 to denote successful jailbroken and 0 to denote
+  the opposite, we compute the empirical Refusal Loss as the sample mean of the jailbroken results returned from the target LLM.
   <!--Since the refusal loss is not computable, we query the target LLM multiple times using the same query and using the sample
   mean of the Jailbroken results (1 indicates successful jailbreak, 0 indicates the opposite) to approximate the function value. -->
   We visualize the 2-D landscape of the empirical Refusal Loss as below:
 <div id="refusal-loss-formula" class="container">
 <div id="refusal-loss-formula-list" class="row align-items-center formula-list">
   <a href="#Refusal-Loss" class="selected">Refusal Loss Definition</a>
+  <a href="#Refusal-Loss-Approximation">Refusal Loss Computation</a>
   <a href="#Gradient-Estimation">Gradient Estimation</a>
   <div style="clear: both"></div>
 </div>