Spaces:
Build error
Build error
typos
Browse files- app/page.js +5 -5
app/page.js
CHANGED
|
@@ -9,22 +9,22 @@ export default async function Leaderboard() {
|
|
| 9 |
<>
|
| 10 |
<p>
|
| 11 |
Traditional LLMs benchmarks have drawbacks: they quickly become part of
|
| 12 |
-
training datasets and are hard to relate
|
| 13 |
use-cases.
|
| 14 |
</p>
|
| 15 |
<p>
|
| 16 |
-
I made this as an experiment to address these issues. Here the dataset
|
| 17 |
is dynamic (changes every week) and composed of crowdsourced real-world
|
| 18 |
prompts.
|
| 19 |
</p>
|
| 20 |
<p>
|
| 21 |
We then use GPT-4 to grade each model's response against a set of
|
| 22 |
rubrics (more details on the about page). The prompt dataset is easily
|
| 23 |
-
explorable
|
| 24 |
</p>
|
| 25 |
<p>
|
| 26 |
-
|
| 27 |
-
results.
|
| 28 |
</p>
|
| 29 |
|
| 30 |
<br />
|
|
|
|
| 9 |
<>
|
| 10 |
<p>
|
| 11 |
Traditional LLMs benchmarks have drawbacks: they quickly become part of
|
| 12 |
+
training datasets and are hard to relate to in terms of real-world
|
| 13 |
use-cases.
|
| 14 |
</p>
|
| 15 |
<p>
|
| 16 |
+
I made this as an experiment to address these issues. Here, the dataset
|
| 17 |
is dynamic (changes every week) and composed of crowdsourced real-world
|
| 18 |
prompts.
|
| 19 |
</p>
|
| 20 |
<p>
|
| 21 |
We then use GPT-4 to grade each model's response against a set of
|
| 22 |
rubrics (more details on the about page). The prompt dataset is easily
|
| 23 |
+
explorable.
|
| 24 |
</p>
|
| 25 |
<p>
|
| 26 |
+
Everything is then stored in a Postgres database and this page shows the
|
| 27 |
+
raw results.
|
| 28 |
</p>
|
| 29 |
|
| 30 |
<br />
|