CodeReviewBench

Sleeping

App Files Files Community

CodeReviewBench / README.md

kenkaneki

Update README.md

026e990 verified 4 months ago

preview code

raw

history blame contribute delete

5.99 kB

	---
	title: CodeReviewBench
	emoji: 😎
	colorFrom: gray
	colorTo: indigo
	sdk: gradio

	sdk_version: 4.44.1
	app_file: app.py
	pinned: true
	short_description: A comprehensive benchmark for codereview.
	models:
	- openai/gpt-4o-mini
	- openai/gpt-4o
	- claude-3-7-sonnet
	- deepseek/deepseek-r1

	---

	# CodeReview Bench Leaderboard

	A comprehensive benchmark and leaderboard for code review generation models, inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench).
	## Features

	- Multi-Language Support: Evaluates models across 17+ programming languages including Python, JavaScript, Java, C++, TypeScript, Go, Rust, and more
	- Dual Language Comments: Supports both Russian and English comment languages
	- Comprehensive Metrics:
	- LLM-based multimetric evaluation (readability, relevance, explanation clarity, problem identification, actionability, completeness, specificity, contextual adequacy, consistency, brevity)
	- Exact-match metrics (pass@1, pass@5, pass@10, BLEU@10)
	- Interactive Visualization: Compare model performance across categories with radar plots
	- Easy Submission: Submit your model results via web interface

	## Metrics

	### LLM-based Multimetric

	- Readability: How easy the review is to understand
	- Relevance: How relevant the review is to the code
	- Explanation Clarity: How clear the explanations are
	- Problem Identification: How well problems are identified
	- Actionability: How actionable the suggestions are
	- Completeness: How complete the review is
	- Specificity: How specific the feedback is
	- Contextual Adequacy: How well the review fits the context
	- Consistency: How consistent the review style is
	- Brevity: How concise the review is

	### Exact-Match Metrics

	- Pass@1: Percentage of correct reviews on first attempt
	- Pass@5: Percentage of correct reviews in top 5 attempts
	- Pass@10: Percentage of correct reviews in top 10 attempts
	- BLEU@10: BLEU score for top 10 review candidates

	## Programming Languages Supported

	- Python
	- JavaScript
	- Java
	- C++
	- C#
	- TypeScript
	- Go
	- Rust
	- Swift
	- Kotlin
	- Ruby
	- PHP
	- C
	- Scala
	- R
	- Dart
	- Other

	## Comment Languages

	- Russian (ru)
	- English (en)

	## Example Categories

	- Bug Fix
	- Code Style
	- Performance
	- Security
	- Refactoring
	- Documentation
	- Testing
	- Architecture
	- Other

	## Installation

	```bash
	pip install -r requirements.txt
	```

	## Usage

	```bash
	python app.py
	```

	## Submission Format

	Submit your results as a JSONL file where each line contains:

	```json
	{
	"model_name": "your-model-name",
	"programming_language": "python",
	"comment_language": "en",
	"readability": 8.5,
	"relevance": 9.0,
	"explanation_clarity": 7.8,
	"problem_identification": 8.2,
	"actionability": 8.7,
	"completeness": 8.0,
	"specificity": 7.5,
	"contextual_adequacy": 8.3,
	"consistency": 8.8,
	"brevity": 7.2,
	"pass_at_1": 0.75,
	"pass_at_5": 0.88,
	"pass_at_10": 0.92,
	"bleu_at_10": 0.65,
	"total_evaluations": 100
	}
	```

	## Environment Variables

	Set the following environment variables:


	## Citation

	<<<<<<< HEAD
	- Multi-tab Interface: Organized navigation with dedicated sections
	- Advanced Filtering: Real-time filtering by multiple criteria
	- Dark Theme: Modern, GitHub-inspired dark interface
	- IP-based Submissions: Secure submission tracking
	- Comprehensive Analytics: Detailed performance insights
	- Data Export: Multiple export formats
	- Rate Limiting: Anti-spam protection

	### 🔧 Technical Improvements

	- Modular Architecture: Clean separation of concerns
	- Type Safety: Full type annotations throughout
	- Error Handling: Comprehensive error handling and logging
	- Data Validation: Multi-layer validation with Pydantic
	- Performance: Optimized data processing and display

	## 📈 Metrics & Evaluation

	### Performance Metrics

	- BLEU: Text similarity score (0.0-1.0)
	- Pass@1: Success rate in single attempt (0.0-1.0)
	- Pass@5: Success rate in 5 attempts (0.0-1.0)
	- Pass@10: Success rate in 10 attempts (0.0-1.0)

	### Quality Dimensions

	1. Readability: How clear and readable are the reviews?
	2. Relevance: How relevant to the code changes?
	3. Explanation Clarity: How well does it explain issues?
	4. Problem Identification: How effectively does it identify problems?
	5. Actionability: How actionable are the suggestions?
	6. Completeness: How thorough are the reviews?
	7. Specificity: How specific are the comments?
	8. Contextual Adequacy: How well does it understand context?
	9. Consistency: How consistent across different reviews?
	10. Brevity: How concise without losing important information?

	## 🔒 Security Features

	### Rate Limiting

	- 5 submissions per IP per 24 hours
	- Automatic IP tracking and logging
	- Graceful error handling for rate limits

	### Data Validation

	- Model name format validation
	- Score range validation (0.0-1.0 for performance, 0-10 for quality)
	- Logical consistency checks (Pass@1 ≤ Pass@5 ≤ Pass@10)
	- Required field validation

	### Audit Trail

	- Complete submission logging
	- IP address tracking (partially masked for privacy)
	- Timestamp recording
	- Data integrity checks

	## 🤝 Contributing

	1. Fork the repository
	2. Create a feature branch
	3. Make your changes
	4. Add tests if applicable
	5. Submit a pull request

	## 📄 License

	This project is licensed under the MIT License - see the LICENSE file for details.

	## 🙏 Acknowledgments

	- Inspired by [CodeReviewBench](https://huggingface.co/spaces/your-org/CodeReviewBench)
	- Built with [Gradio](https://gradio.app/) for the web interface
	- Thanks to the open-source community for tools and inspiration

	## 📞 Support

	For questions, issues, or contributions:

	- Open an issue on GitHub
	- Check the documentation
	- Contact the maintainers

	---

	Built with ❤️ for the code review research community