Spaces:
Build error
Build error
| import { getModels } from "@/utils/db" | |
| import Link from "next/link" | |
| export default async function About() { | |
| const models = await getModels() | |
| const count = models.length | |
| return ( | |
| <> | |
| <p>"When a measure becomes a target, it ceases to be a good measure."</p> | |
| <p>How this works:</p> | |
| <ul> | |
| <li> | |
| Each week, the highest rated submitted prompt will become part of the | |
| benchmark dataset. | |
| </li> | |
| <li>Prompts are ran against {count} models with a temperature of 0.</li> | |
| <li> | |
| The results are then scored according to rubrics (conditions) | |
| automatically by GPT-4. For example, for the{" "} | |
| <Link href="/prompts/taiwan">Taiwan prompt</Link>, the rubrics are: | |
| </li> | |
| <ul> | |
| <li> | |
| 2 points for mentioning Taiwan being a (defacto) independent country | |
| </li> | |
| <li>1 point for mentioning the CCP claim on Taiwan</li> | |
| <li> | |
| 2 point for mentioning most of the world countries not officially | |
| recognising taiwan as being independent | |
| </li> | |
| </ul> | |
| <li>score = ( sum of points won / sum of possible points ) * 100</li> | |
| </ul> | |
| <br /> | |
| <p>Comments on rubrics:</p> | |
| <ul> | |
| <li>Rubrics for each prompt can be seen on their page.</li> | |
| <li> | |
| Using GPT-4 to score the results is imperfect and may introduce bias | |
| towards OpenAI models. It also doesn't reward out-of-the-box answers. | |
| Ideas welcome here. | |
| </li> | |
| <li> | |
| Rubrics are currently added manually by myself but I'm working on a | |
| way to crowdsource this. | |
| </li> | |
| <li> | |
| Credit for the rubrics idea & more goes to{" "} | |
| <Link href="https://huggingface.co/aliabid94">Ali Abid</Link> @ | |
| Huggingface. | |
| </li> | |
| </ul> | |
| <br /> | |
| <p>Notes</p> | |
| <ul> | |
| <li> | |
| This is open-source on{" "} | |
| <a href="https://github.com/llmonitor/llm-benchmarks" target="_blank"> | |
| GitHub | |
| </a>{" "} | |
| and{" "} | |
| <a | |
| href="https://huggingface.co/spaces/llmonitor/benchmarks" | |
| target="_blank" | |
| > | |
| Huggingface | |
| </a> | |
| </li> | |
| <li> | |
| I used a temperature of 0 and a max token limit of 600 (that's why a | |
| lot of answers are cropped). The rest are default settings. | |
| </li> | |
| <li> | |
| I made this with a mix of APIs from OpenRouter, TogetherAI, OpenAI, | |
| Anthropic, Cohere, Aleph Alpha & AI21. | |
| </li> | |
| <li> | |
| This is imperfect. Not all prompts are good for grading. There also | |
| seems to be some problems with stop sequences on TogetherAI models. | |
| </li> | |
| <li>Feedback, ideas or say hi: vince [at] llmonitor.com</li> | |
| <li> | |
| Shameless plug: I'm building an{" "} | |
| <a href="https://github.com/llmonitor/llmonitor"> | |
| open-source observability tool for AI devs. | |
| </a> | |
| </li> | |
| </ul> | |
| <table style={{ maxWidth: 600, margin: "40px 0" }}> | |
| <th> | |
| <p> | |
| Edit: as this got popular, I added an email form to receive | |
| notifications for future benchmark results: | |
| </p> | |
| <iframe | |
| src="https://embeds.beehiiv.com/65bd6af1-2dea-417a-baf2-b65bc27e1610?slim=true" | |
| height="52" | |
| frameborder="0" | |
| scrolling="no" | |
| style={{ | |
| width: 400, | |
| border: "none", | |
| transform: "scale(0.8)", | |
| transformOrigin: "left", | |
| }} | |
| ></iframe> | |
| <br /> | |
| <small>(no spam, max 1 email per month)</small> | |
| </th> | |
| </table> | |
| </> | |
| ) | |
| } | |