Small Changes
Browse files- src/about.py +2 -2
- src/tasks.py +10 -10
src/about.py
CHANGED
|
@@ -93,8 +93,8 @@ TITLE = """<h1 align="center" id="space-title">🚀 EVALITA-LLM Leaderboard 🚀
|
|
| 93 |
INTRODUCTION_TEXT = """
|
| 94 |
Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding translation issues and potential cultural biases; (ii) the benchmark includes **generative** tasks, enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer evaluation.
|
| 95 |
|
| 96 |
-
**<small>Multiple
|
| 97 |
-
**<small>Generative:</small>** <small>🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction) </small>
|
| 98 |
"""
|
| 99 |
|
| 100 |
# Which evaluations are you running? how can people reproduce what you have?
|
|
|
|
| 93 |
INTRODUCTION_TEXT = """
|
| 94 |
Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on Italian tasks. The distinguishing features of Evalita-LLM are the following: (i) **all tasks are native Italian**, avoiding translation issues and potential cultural biases; (ii) the benchmark includes **generative** tasks, enabling more natural interaction with LLMs; (iii) **all tasks are evaluated against multiple prompts**, this way mitigating the model sensitivity to specific prompts and allowing a fairer evaluation.
|
| 95 |
|
| 96 |
+
**<small>Multiple-choice tasks:</small>** <small> 📊TE (Textual Entailment), 😃SA (Sentiment Analysis), ⚠️HS (Hate Speech Detection), 🏥AT (Admission Test), 🔤WIC (Word in Context), ❓FAQ (Frequently Asked Questions) </small><br>
|
| 97 |
+
**<small>Generative tasks:</small>** <small>🔄LS (Lexical Substitution), 📝SU (Summarization), 🏷️NER (Named Entity Recognition), 🔗REL (Relation Extraction) </small>
|
| 98 |
"""
|
| 99 |
|
| 100 |
# Which evaluations are you running? how can people reproduce what you have?
|
src/tasks.py
CHANGED
|
@@ -23,7 +23,7 @@ Evalita-LLM is a benchmark designed to evaluate Large Language Models (LLMs) on
|
|
| 23 |
MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
|
| 24 |
|
| 25 |
# Tasks Descriptions
|
| 26 |
-
TE_DESCRIPTION = """### Textual Entailment (TE) *
|
| 27 |
The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
|
| 28 |
|
| 29 |
| # | Prompt | Answer Choices |
|
|
@@ -39,7 +39,7 @@ TE_DESCRIPTION = """### Textual Entailment (TE) *(Multiple Choice)*
|
|
| 39 |
|
| 40 |
"""
|
| 41 |
|
| 42 |
-
SA_DESCRIPTION = """### Sentiment Analysis (SA) *
|
| 43 |
The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
|
| 44 |
|
| 45 |
| # | Prompt | Answer Choices |
|
|
@@ -55,7 +55,7 @@ SA_DESCRIPTION = """### Sentiment Analysis (SA) *(Multiple Choice)*
|
|
| 55 |
|
| 56 |
"""
|
| 57 |
|
| 58 |
-
HS_DESCRIPTION = """### Hate Speech (HS) *
|
| 59 |
The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
|
| 60 |
|
| 61 |
| # | Prompt | Answer Choices |
|
|
@@ -71,7 +71,7 @@ HS_DESCRIPTION = """### Hate Speech (HS) *(Multiple Choice)*
|
|
| 71 |
|
| 72 |
"""
|
| 73 |
|
| 74 |
-
AT_DESCRIPTION = """### Admission Tests (AT) *
|
| 75 |
The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
|
| 76 |
|
| 77 |
| # | Prompt | Answer Choices |
|
|
@@ -87,7 +87,7 @@ AT_DESCRIPTION = """### Admission Tests (AT) *(Multiple Choice)*
|
|
| 87 |
|
| 88 |
"""
|
| 89 |
|
| 90 |
-
WIC_DESCRIPTION = """### Word in Context (WIC) *
|
| 91 |
The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
|
| 92 |
|
| 93 |
| # | Prompt | Answer Choices |
|
|
@@ -103,7 +103,7 @@ WIC_DESCRIPTION = """### Word in Context (WIC) *(Multiple Choice)*
|
|
| 103 |
|
| 104 |
"""
|
| 105 |
|
| 106 |
-
FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *
|
| 107 |
The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
|
| 108 |
|
| 109 |
| # | Prompt | Answer Choices |
|
|
@@ -119,7 +119,7 @@ FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) *
|
|
| 119 |
|
| 120 |
"""
|
| 121 |
|
| 122 |
-
LS_DESCRIPTION = """### Lexical Substitution (LS) *
|
| 123 |
The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
|
| 124 |
|
| 125 |
| # | Prompt |
|
|
@@ -131,7 +131,7 @@ LS_DESCRIPTION = """### Lexical Substitution (LS) *(Generative)*
|
|
| 131 |
|
| 132 |
"""
|
| 133 |
|
| 134 |
-
SU_DESCRIPTION = """### Summarization (SUM) *
|
| 135 |
The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
|
| 136 |
|
| 137 |
| # | Prompt |
|
|
@@ -143,7 +143,7 @@ SU_DESCRIPTION = """### Summarization (SUM) *(Generative)*
|
|
| 143 |
|
| 144 |
"""
|
| 145 |
|
| 146 |
-
NER_DESCRIPTION = """### Named Entity Recognition (NER) *
|
| 147 |
The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
|
| 148 |
|
| 149 |
| # | Prompt |
|
|
@@ -155,7 +155,7 @@ NER_DESCRIPTION = """### Named Entity Recognition (NER) *(Generative)*
|
|
| 155 |
|
| 156 |
"""
|
| 157 |
|
| 158 |
-
REL_DESCRIPTION = """### Relation Extraction (REL) *
|
| 159 |
The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
|
| 160 |
|
| 161 |
| # | Prompt |
|
|
|
|
| 23 |
MEASURE_DESCRIPTION = "**Combined Performance** = (1 - (**Best Prompt** - **Prompt Average**) / 100) * **Best Prompt**. **Prompt Average** = accuracy averaged over the assessed prompts. **Best Prompt** = accuracy of the best prompt. **Prompt ID** = ID of the best prompt (see legend above)."
|
| 24 |
|
| 25 |
# Tasks Descriptions
|
| 26 |
+
TE_DESCRIPTION = """### Textual Entailment (TE) --- *Multiple-choice task*
|
| 27 |
The input are two sentences: the text (T) and the hypothesis (H). The model has to determine whether the meaning of the hypothesis is logically entailed by the text.
|
| 28 |
|
| 29 |
| # | Prompt | Answer Choices |
|
|
|
|
| 39 |
|
| 40 |
"""
|
| 41 |
|
| 42 |
+
SA_DESCRIPTION = """### Sentiment Analysis (SA) --- *Multiple-choice task*
|
| 43 |
The input is a tweet. The model has to determine the sentiment polarity of the text, categorizing it into one of four classes: positive, negative, neutral, or mixed.
|
| 44 |
|
| 45 |
| # | Prompt | Answer Choices |
|
|
|
|
| 55 |
|
| 56 |
"""
|
| 57 |
|
| 58 |
+
HS_DESCRIPTION = """### Hate Speech (HS) --- *Multiple-choice task*
|
| 59 |
The input is a tweet. The model has to determine whether the text contains hateful content directed towards marginalized or minority groups. The output is a binary classification: hateful or not hateful.
|
| 60 |
|
| 61 |
| # | Prompt | Answer Choices |
|
|
|
|
| 71 |
|
| 72 |
"""
|
| 73 |
|
| 74 |
+
AT_DESCRIPTION = """### Admission Tests (AT) --- *Multiple-choice task*
|
| 75 |
The input is a multiple-choice question with five options (A-E) from Italian medical specialty entrance exams, and the model must identify the correct answer.
|
| 76 |
|
| 77 |
| # | Prompt | Answer Choices |
|
|
|
|
| 87 |
|
| 88 |
"""
|
| 89 |
|
| 90 |
+
WIC_DESCRIPTION = """### Word in Context (WIC) --- *Multiple-choice task*
|
| 91 |
The input consists of a word (w) and two sentences. The model has to determine whether the word w has the same meaning in both sentences. The output is a binary classification: 1 (same meaning) or 0 (different meaning).
|
| 92 |
|
| 93 |
| # | Prompt | Answer Choices |
|
|
|
|
| 103 |
|
| 104 |
"""
|
| 105 |
|
| 106 |
+
FAQ_DESCRIPTION = """### Frequently Asked Questions & Question Answering (FAQ) --- *Multiple-choice task*
|
| 107 |
The input is a user query regarding the water supply service. The model must identify the correct answer from the 4 available options.
|
| 108 |
|
| 109 |
| # | Prompt | Answer Choices |
|
|
|
|
| 119 |
|
| 120 |
"""
|
| 121 |
|
| 122 |
+
LS_DESCRIPTION = """### Lexical Substitution (LS) --- *Generative task*
|
| 123 |
The input is a sentence containing a target word (w). The model has to replace the target word w with its most suitable synonyms that are contextually relevant.
|
| 124 |
|
| 125 |
| # | Prompt |
|
|
|
|
| 131 |
|
| 132 |
"""
|
| 133 |
|
| 134 |
+
SU_DESCRIPTION = """### Summarization (SUM) --- *Generative task*
|
| 135 |
The input is a news article. The model has to generate a concise summary of the input text, capturing the key information and main points.
|
| 136 |
|
| 137 |
| # | Prompt |
|
|
|
|
| 143 |
|
| 144 |
"""
|
| 145 |
|
| 146 |
+
NER_DESCRIPTION = """### Named Entity Recognition (NER) --- *Generative task*
|
| 147 |
The input is a sentence. The model has to identify and classify Named Entities into predefined categories such as person, organization, and location.
|
| 148 |
|
| 149 |
| # | Prompt |
|
|
|
|
| 155 |
|
| 156 |
"""
|
| 157 |
|
| 158 |
+
REL_DESCRIPTION = """### Relation Extraction (REL) --- *Generative task*
|
| 159 |
The input is a sentence of a clinical text. The model must identify and extract relationships between laboratory test results (e.g., blood pressure) and the corresponding tests or procedures that generated them (e.g., blood pressure test).
|
| 160 |
|
| 161 |
| # | Prompt |
|