OpenEvals

community

Activity Feed

AI & ML interests

LLM evaluation

Recent Activity

thomwolf authored a paper 14 days ago

Robot Learning: A Tutorial

clefourrier updated a Space 19 days ago

OpenEvals/EvalsOnTheHub

clefourrier published a Space 20 days ago

OpenEvals/EvalsOnTheHub

View all activity

Articles

Gaia2 and ARE: Empowering the community to study agents

Sep 22

• 116

Organization Card

Community About org cards

Hi! Welcome on the org page of the Evaluation team at HuggingFace. We want to support the community in building and sharing quality evaluations, for reproducible and fair model comparisions, to cut through the hype of releases and better understand actual model capabilities.

We're behind the:

lighteval LLM evaluation suite, fast and filled with the SOTA benchmarks you might want
evaluation guidebook, your reference for LLM evals
leaderboards on the hub initiative, to encourage people to build more leaderboards in the open for more reproducible evaluation. You'll find some doc here to build your own, and you can look for the best leaderboard for your use case here!

Our archived projects:

Open LLM Leaderboard (over 11K models evaluated since 2023)

We're not behind the evaluate metrics guide but if you want to understand metrics better we really recommend checking it out!

Explore and discover all leaderboards from the HF community

Running on CPU Upgrade

Run your LLM evaluations on the hub

🐢

Generate a command to run model evaluations

models 0

None public yet

datasets 0

None public yet

OpenEvals

AI & ML interests

Recent Activity

Articles

Gaia2 and ARE: Empowering the community to study agents

Collections 5

GAIA: a benchmark for General AI Assistants

Zephyr: Direct Distillation of LM Alignment

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Find a leaderboard

YourBench

Example Leaderboard Template

Run your LLM evaluations on the hub

GAIA: a benchmark for General AI Assistants

Zephyr: Direct Distillation of LM Alignment

SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

Global MMLU: Understanding and Addressing Cultural and Linguistic Biases in Multilingual Evaluation

Find a leaderboard

YourBench

Example Leaderboard Template

Run your LLM evaluations on the hub

spaces 3

Find a leaderboard

Run your LLM evaluations on the hub

models 0

datasets 0

AI & ML interests

Recent Activity

Articles

Gaia2 and ARE: Empowering the community to study agents

Team members 6

Collections 5

Find a leaderboard

YourBench

Example Leaderboard Template

Run your LLM evaluations on the hub

Find a leaderboard

YourBench

Example Leaderboard Template

Run your LLM evaluations on the hub

spaces 3 Sort: Recently updated

Find a leaderboard

Run your LLM evaluations on the hub

models 0

datasets 0

spaces 3