 
	Submitted by
			 guipenedo
			guipenedo
	
	 guipenedo
			guipenedoWe release large pre-training datasets to accelerate open LLM development. Part of the Hugging Face Science team (hf.co/science)

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale