Spaces:
Sleeping
Sleeping
| title: CodeReviewBench | |
| emoji: π | |
| colorFrom: gray | |
| colorTo: indigo | |
| sdk: gradio | |
| sdk_version: 4.44.1 | |
| app_file: app.py | |
| pinned: true | |
| short_description: A comprehensive benchmark for codereview. | |
| models: | |
| - openai/gpt-4o-mini | |
| - openai/gpt-4o | |
| - claude-3-7-sonnet | |
| - deepseek/deepseek-r1 | |
| # CodeReview Bench Leaderboard | |
| A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench). | |
| ## Features | |
| - **Multi-Language Support**: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more | |
| - **Dual Language Comments**: Supports both Russian and English comment languages | |
| - **Comprehensive Metrics**: | |
| - LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity) | |
| - Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10) | |
| - **Interactive Visualization**: Compare model performance across categories with radar plots | |
| - **Easy Submission**: Submit your model results via web interface | |
| ## Metrics | |
| ### LLM-based Multimetric | |
| - **Readability**: How easy the review is to understand | |
| - **Relevance**: How relevant the review is to the code | |
| - **Explanation Clarity**: How clear the explanations are | |
| - **Problem Identification**: How well problems are identified | |
| - **Actionability**: How actionable the suggestions are | |
| - **Completeness**: How complete the review is | |
| - **Specificity**: How specific the feedback is | |
| - **Contextual Adequacy**: How well the review fits the context | |
| - **Consistency**: How consistent the review style is | |
| - **Brevity**: How concise the review is | |
| ### Exact-Match Metrics | |
| - **Pass@1**: Percentage of correct reviews on first attempt | |
| - **Pass@5**: Percentage of correct reviews in top 5 attempts | |
| - **Pass@10**: Percentage of correct reviews in top 10 attempts | |
| - **BLEU@10**: BLEU score for top 10 review candidates | |
| ## Programming Languages Supported | |
| - Python | |
| - JavaScript | |
| - Java | |
| - C++ | |
| - C# | |
| - TypeScript | |
| - Go | |
| - Rust | |
| - Swift | |
| - Kotlin | |
| - Ruby | |
| - PHP | |
| - C | |
| - Scala | |
| - R | |
| - Dart | |
| - Other | |
| ## Comment Languages | |
| - Russian (ru) | |
| - English (en) | |
| ## Example Categories | |
| - Bug Fix | |
| - Code Style | |
| - Performance | |
| - Security | |
| - Refactoring | |
| - Documentation | |
| - Testing | |
| - Architecture | |
| - Other | |
| ## Installation | |
| ```bash | |
| pip install -r requirements.txt | |
| ``` | |
| ## Usage | |
| ```bash | |
| python app.py | |
| ``` | |
| ## Submission Format | |
| Submit your results as a JSONL file where each line contains: | |
| ```json | |
| { | |
| "model_name": "your-model-name", | |
| "programming_language": "python", | |
| "comment_language": "en", | |
| "readability": 8.5, | |
| "relevance": 9.0, | |
| "explanation_clarity": 7.8, | |
| "problem_identification": 8.2, | |
| "actionability": 8.7, | |
| "completeness": 8.0, | |
| "specificity": 7.5, | |
| "contextual_adequacy": 8.3, | |
| "consistency": 8.8, | |
| "brevity": 7.2, | |
| "pass_at_1": 0.75, | |
| "pass_at_5": 0.88, | |
| "pass_at_10": 0.92, | |
| "bleu_at_10": 0.65, | |
| "total_evaluations": 100 | |
| } | |
| ``` | |
| ## Environment Variables | |
| Set the following environment variables: | |
| ## Citation | |
| <<<<<<< HEAD | |
| - **Multi-tab Interface**: Organized navigation with dedicated sections | |
| - **Advanced Filtering**: Real-time filtering by multiple criteria | |
| - **Dark Theme**: Modern, GitHub-inspired dark interface | |
| - **IP-based Submissions**: Secure submission tracking | |
| - **Comprehensive Analytics**: Detailed performance insights | |
| - **Data Export**: Multiple export formats | |
| - **Rate Limiting**: Anti-spam protection | |
| ### π§ Technical Improvements | |
| - **Modular Architecture**: Clean separation of concerns | |
| - **Type Safety**: Full type annotations throughout | |
| - **Error Handling**: Comprehensive error handling and logging | |
| - **Data Validation**: Multi-layer validation with Pydantic | |
| - **Performance**: Optimized data processing and display | |
| ## π Metrics & Evaluation | |
| ### Performance Metrics | |
| - **BLEU**: Text similarity score (0.0-1.0) | |
| - **Pass@1**: Success rate in single attempt (0.0-1.0) | |
| - **Pass@5**: Success rate in 5 attempts (0.0-1.0) | |
| - **Pass@10**: Success rate in 10 attempts (0.0-1.0) | |
| ### Quality Dimensions | |
| 1. **Readability**: How clear and readable are the reviews? | |
| 2. **Relevance**: How relevant to the code changes? | |
| 3. **Explanation Clarity**: How well does it explain issues? | |
| 4. **Problem Identification**: How effectively does it identify problems? | |
| 5. **Actionability**: How actionable are the suggestions? | |
| 6. **Completeness**: How thorough are the reviews? | |
| 7. **Specificity**: How specific are the comments? | |
| 8. **Contextual Adequacy**: How well does it understand context? | |
| 9. **Consistency**: How consistent across different reviews? | |
| 10. **Brevity**: How concise without losing important information? | |
| ## π Security Features | |
| ### Rate Limiting | |
| - **5 submissions per IP per 24 hours** | |
| - **Automatic IP tracking and logging** | |
| - **Graceful error handling for rate limits** | |
| ### Data Validation | |
| - **Model name format validation** | |
| - **Score range validation (0.0-1.0 for performance, 0-10 for quality)** | |
| - **Logical consistency checks (Pass@1 β€ Pass@5 β€ Pass@10)** | |
| - **Required field validation** | |
| ### Audit Trail | |
| - **Complete submission logging** | |
| - **IP address tracking (partially masked for privacy)** | |
| - **Timestamp recording** | |
| - **Data integrity checks** | |
| ## π€ Contributing | |
| 1. Fork the repository | |
| 2. Create a feature branch | |
| 3. Make your changes | |
| 4. Add tests if applicable | |
| 5. Submit a pull request | |
| ## π License | |
| This project is licensed under the MIT License - see the LICENSE file for details. | |
| ## π Acknowledgments | |
| - Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench) | |
| - Built with [Gradio](https://gradio.app/) for the web interface | |
| - Thanks to the open-source community for tools and inspiration | |
| ## π Support | |
| For questions, issues, or contributions: | |
| - Open an issue on GitHub | |
| - Check the documentation | |
| - Contact the maintainers | |
| --- | |
| **Built with β€οΈ for the code review research community** | |