Nurmukhamed
			's Collections
			 
		
			
		llm-performance
		
	updated
			
 
				
				
	
	
	
			
			QLoRA: Efficient Finetuning of Quantized LLMs
		
			Paper
			
•
			2305.14314
			
•
			Published
				
			•
				
				56
			
 
	
	 
	
	
	
			
			Training Transformers with 4-bit Integers
		
			Paper
			
•
			2306.11987
			
•
			Published
				
			•
				
				22
			
 
	
	 
	
	
	
			
			FasterViT: Fast Vision Transformers with Hierarchical Attention
		
			Paper
			
•
			2306.06189
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme
  Long Sequence Transformer Models
		
			Paper
			
•
			2309.14509
			
•
			Published
				
			•
				
				19
			
 
	
	 
	
	
	
			
			VeRA: Vector-based Random Matrix Adaptation
		
			Paper
			
•
			2310.11454
			
•
			Published
				
			•
				
				30
			
 
	
	 
	
	
	
			
			LoftQ: LoRA-Fine-Tuning-Aware Quantization for Large Language Models
		
			Paper
			
•
			2310.08659
			
•
			Published
				
			•
				
				28
			
 
	
	 
	
	
	
			
			Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
		
			Paper
			
•
			2310.17157
			
•
			Published
				
			•
				
				14
			
 
	
	 
	
	
	
			
			TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones
		
			Paper
			
•
			2312.16862
			
•
			Published
				
			•
				
				31
			
 
	
	 
	
	
	
			
			SparQ Attention: Bandwidth-Efficient LLM Inference
		
			Paper
			
•
			2312.04985
			
•
			Published
				
			•
				
				40
			
 
	
	 
	
	
	
			
			PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU
		
			Paper
			
•
			2312.12456
			
•
			Published
				
			•
				
				44
			
 
	
	 
	
	
	
			
			LLM in a flash: Efficient Large Language Model Inference with Limited
  Memory
		
			Paper
			
•
			2312.11514
			
•
			Published
				
			•
				
				260
			
 
	
	 
	
	
	
			
			LLM Augmented LLMs: Expanding Capabilities through Composition
		
			Paper
			
•
			2401.02412
			
•
			Published
				
			•
				
				38
			
 
	
	 
	
	
	
			
			SliceGPT: Compress Large Language Models by Deleting Rows and Columns
		
			Paper
			
•
			2401.15024
			
•
			Published
				
			•
				
				74
			
 
	
	 
	
	
	
			
			Step-3 is Large yet Affordable: Model-system Co-design for
  Cost-effective Decoding
		
			Paper
			
•
			2507.19427
			
•
			Published
				
			•
				
				18