Coding Dataset

Production-grade dataset for training AI coding agents.

Dataset Summary

  • Total Examples: 6 (demo)
  • Languages: Python, JavaScript, Java
  • Task Types: Code Generation
  • License: CC0-1.0

Dataset Structure

Data Splits

  • train: 70% of data
  • validation: 15% of data
  • test: 15% of data

Features

  • id (string): Unique identifier
  • code (string): Source code snippet
  • code_description (string): Natural language description
  • programming_language (string): Language (python, javascript, java, etc.)
  • task_type (string): Type of task
  • difficulty_level (string): Difficulty (beginner, intermediate, advanced, expert)
  • quality_score (float): Quality score 0.0-1.0
  • is_tested (bool): Code is tested
  • has_bugs (bool): Known bugs exist
  • lines_of_code (int): Number of lines
  • collected_at (string): Collection timestamp

Usage

from datasets import load_dataset

# Load dataset
dataset = load_dataset("romcmu863/code-dataset")

# Access splits
train = dataset['train']
validation = dataset['validation']
test = dataset['test']

# Get first example
example = train[0]
print(example['code_description'])
print(example['code'])

License

CC0-1.0

Created

2025-10-25

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support