Spaces:

evaluate-metric
/

perplexity

Running

App Files Files Community

lvwerra HF Staff commited on May 28, 2022

Commit

9ea7bd1

1 Parent(s): a508230

Update Space (evaluate main: b2a25b3f)

Browse files

Files changed (3) hide show

README.md +11 -9
app.py +1 -1
perplexity.py +3 -2

README.md CHANGED Viewed

@@ -1,6 +1,6 @@
 ---
 title: Perplexity
-emoji: 🤗
 colorFrom: blue
 colorTo: red
 sdk: gradio
@@ -15,11 +15,12 @@ tags:
 # Metric Card for Perplexity
 ## Metric Description
-Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence. This can be used in two main ways:
-1. to evaluate how well the model has learned the distribution of the text it was trained on
-    - In this case, the model input should be the trained model to be evaluated, and the input texts should be the text that the model was trained on.
-2. to evaluate how well a selection of text matches the distribution of text that the input model was trained on
-    - In this case, the model input should be a trained model, and the input texts should be the text to be evaluated.
 ## Intended Uses
 Any language generation task.
@@ -30,7 +31,7 @@ The metric takes a list of text as input, as well as the name of the model used
 ```python
 from evaluate import load
-perplexity = load("perplexity")
 results = perplexity.compute(input_texts=input_texts, model_id='gpt2')
 ```
@@ -58,7 +59,7 @@ This metric's range is 0 and up. A lower score is better.
 ### Examples
 Calculating perplexity on input_texts defined here:
 ```python
-perplexity = evaluate.load("perplexity")
 input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
 results = perplexity.compute(model_id='gpt2',
                              add_start_token=False,
@@ -72,7 +73,7 @@ print(round(results["perplexities"][0], 2))
 ```
 Calculating perplexity on input_texts loaded in from a dataset:
 ```python
-perplexity = evaluate.load("perplexity")
 input_texts = datasets.load_dataset("wikitext",
                                     "wikitext-2-raw-v1",
                                     split="test")["text"][:50]
@@ -90,6 +91,7 @@ print(round(results["perplexities"][0], 2))
 ## Limitations and Bias
 Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
 ## Citation

 ---
 title: Perplexity
+emoji: 🤗
 colorFrom: blue
 colorTo: red
 sdk: gradio
 # Metric Card for Perplexity
 ## Metric Description
+Given a model and an input text sequence, perplexity measures how likely the model is to generate the input text sequence.
+As a metric, it can be used to evaluate how well the model has learned the distribution of the text it was trained on
+In this case, the model input should be the trained model to be evaluated, and the input texts should be the text that the model was trained on.
 ## Intended Uses
 Any language generation task.
 ```python
 from evaluate import load
+perplexity = load("perplexity", module_type="metric")
 results = perplexity.compute(input_texts=input_texts, model_id='gpt2')
 ```
 ### Examples
 Calculating perplexity on input_texts defined here:
 ```python
+perplexity = evaluate.load("perplexity", module_type="metric")
 input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
 results = perplexity.compute(model_id='gpt2',
                              add_start_token=False,
 ```
 Calculating perplexity on input_texts loaded in from a dataset:
 ```python
+perplexity = evaluate.load("perplexity", module_type="metric")
 input_texts = datasets.load_dataset("wikitext",
                                     "wikitext-2-raw-v1",
                                     split="test")["text"][:50]
 ## Limitations and Bias
 Note that the output value is based heavily on what text the model was trained on. This means that perplexity scores are not comparable between models or datasets.
+See Meister and Cotterell, ["Language Model Evaluation Beyond Perplexity"]( https://arxiv.org/abs/2106.00085) (2021) for more information about alternative model evaluation strategies.
 ## Citation

app.py CHANGED Viewed

@@ -2,5 +2,5 @@ import evaluate
 from evaluate.utils import launch_gradio_widget
-module = evaluate.load("perplexity")
 launch_gradio_widget(module)

 from evaluate.utils import launch_gradio_widget
+module = evaluate.load("perplexity", module_type="metric")
 launch_gradio_widget(module)

perplexity.py CHANGED Viewed

@@ -56,7 +56,7 @@ Returns:
         max length for the perplexity computation.
 Examples:
     Example 1:
-        >>> perplexity = evaluate.load("perplexity")
         >>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
         >>> results = perplexity.compute(model_id='gpt2',
         ...                              add_start_token=False,
@@ -70,7 +70,7 @@ Examples:
     Example 2:
         >>> from datasets import load_dataset
-        >>> perplexity = evaluate.load("perplexity")
         >>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10] # doctest: +SKIP
         >>> input_texts = [s for s in input_texts if s!='']
         >>> results = perplexity.compute(model_id='gpt2',
@@ -88,6 +88,7 @@ Examples:
 class Perplexity(evaluate.EvaluationModule):
     def _info(self):
         return evaluate.EvaluationModuleInfo(
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,

         max length for the perplexity computation.
 Examples:
     Example 1:
+        >>> perplexity = evaluate.load("perplexity", module_type="metric")
         >>> input_texts = ["lorem ipsum", "Happy Birthday!", "Bienvenue"]
         >>> results = perplexity.compute(model_id='gpt2',
         ...                              add_start_token=False,
     Example 2:
         >>> from datasets import load_dataset
+        >>> perplexity = evaluate.load("perplexity", module_type="metric")
         >>> input_texts = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")["text"][:10] # doctest: +SKIP
         >>> input_texts = [s for s in input_texts if s!='']
         >>> results = perplexity.compute(model_id='gpt2',
 class Perplexity(evaluate.EvaluationModule):
     def _info(self):
         return evaluate.EvaluationModuleInfo(
+            module_type="metric",
             description=_DESCRIPTION,
             citation=_CITATION,
             inputs_description=_KWARGS_DESCRIPTION,