The response to my first datasets has been insane - thank you! π
Your support made these go viral, and they're still trending on the Hugging Face datasets homepage:
π Proven Performers: - GitHub Code 2025 (12k+ downloads, 83+ likes) - Top 10 on HF Datasets - ArXiv Papers (8k+ downloads, 51+ likes) - Top 20 on HF Datasets
Now I'm expanding from scientific papers and code into hardware, maker culture, and engineering wisdom with three new domain-specific datasets:
π₯ New Datasets Dropped
1. Phoronix Articles - What is Phoronix? The definitive source for Linux, open-source, and hardware performance journalism since 2004. For more info visit: https://www.phoronix.com/ - Dataset contains: articles with full text, metadata, and comment counts - Want a Linux & hardware news AI? Train models on 50K+ articles tracking 20 years of tech evolution
2. Hackaday Posts - What is Hackaday? The epicenter of maker culture - DIY projects, hardware hacks, and engineering creativity. For more info visit: https://hackaday.com/ - Dataset contains: articles with nested comment threads and engagement metrics - Want a maker community AI? Build assistants that understand electronics projects, 3D printing, and hardware innovation
3. EEVblog Posts - What is EEVblog? The largest electronics engineering forum - a popular online platform and YouTube channel for electronics enthusiasts, hobbyists, and engineers. For more info visit: https://www.eevblog.com/forum/ - Dataset contains: forum posts with author expertise levels and technical discussions - Want an electronics expert? Train AI mentors that explain circuits, troubleshoot designs, and guide hardware projects
π Hey i have Just uploaded 2 new datasets for code and scientific reasoning models:
1. ArXiv Papers (4.6TB) A massive scientific corpus with papers and metadata across all domains.Perfect for training models on academic reasoning, literature review, and scientific knowledge mining. πLink: nick007x/arxiv-papers
2. GitHub Code 2025 (1 TB)a comprehensive code dataset for code generation and analysis tasks. mostly contains GitHub's high quality top 1 million repos above 2 stars πLink: nick007x/github-code-2025