Coding Dataset

Production-grade dataset for training AI coding agents.

Dataset Summary

Total Examples: 6 (demo)
Languages: Python, JavaScript, Java
Task Types: Code Generation
License: CC0-1.0

Dataset Structure

Data Splits

train: 70% of data
validation: 15% of data
test: 15% of data

Features

id (string): Unique identifier
code (string): Source code snippet
code_description (string): Natural language description
programming_language (string): Language (python, javascript, java, etc.)
task_type (string): Type of task
difficulty_level (string): Difficulty (beginner, intermediate, advanced, expert)
quality_score (float): Quality score 0.0-1.0
is_tested (bool): Code is tested
has_bugs (bool): Known bugs exist
lines_of_code (int): Number of lines
collected_at (string): Collection timestamp

Usage

from datasets import load_dataset

# Load dataset
dataset = load_dataset("romcmu863/code-dataset")

# Access splits
train = dataset['train']
validation = dataset['validation']
test = dataset['test']

# Get first example
example = train[0]
print(example['code_description'])
print(example['code'])

License

CC0-1.0

Created

2025-10-25

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support