Spaces:
Running
Running
| title: ZeroGPU-LLM-Inference | |
| emoji: π§ | |
| colorFrom: indigo | |
| colorTo: purple | |
| sdk: gradio | |
| sdk_version: 5.49.1 | |
| app_file: app.py | |
| pinned: false | |
| license: apache-2.0 | |
| short_description: Streaming LLM chat with web search and controls | |
| # π§ ZeroGPU LLM Inference | |
| A modern, user-friendly Gradio interface for **token-streaming, chat-style inference** across a wide variety of Transformer modelsβpowered by ZeroGPU for free GPU acceleration on Hugging Face Spaces. | |
| ## β¨ Key Features | |
| ### π¨ Modern UI/UX | |
| - **Clean, intuitive interface** with organized layout and visual hierarchy | |
| - **Collapsible advanced settings** for both simple and power users | |
| - **Smooth animations and transitions** for better user experience | |
| - **Responsive design** that works on all screen sizes | |
| - **Copy-to-clipboard** functionality for easy sharing of responses | |
| ### π Web Search Integration | |
| - **Real-time DuckDuckGo search** with background threading | |
| - **Configurable timeout** and result limits | |
| - **Automatic context injection** into system prompts | |
| - **Smart toggle** - search settings auto-hide when disabled | |
| ### π‘ Smart Features | |
| - **Thought vs. Answer streaming**: `<think>β¦</think>` blocks shown separately as "π Thought" | |
| - **Working cancel button** - immediately stops generation without errors | |
| - **Debug panel** for prompt engineering insights | |
| - **Duration estimates** based on model size and settings | |
| - **Example prompts** to help users get started | |
| - **Dynamic system prompts** with automatic date insertion | |
| ### π― Model Variety | |
| - **30+ LLM options** from leading providers (Qwen, Microsoft, Meta, Mistral, etc.) | |
| - Models ranging from **135M to 32B+** parameters | |
| - Specialized models for **reasoning, coding, and general chat** | |
| - **Efficient model loading** - one at a time with automatic cache clearing | |
| ### βοΈ Advanced Controls | |
| - **Generation parameters**: max tokens, temperature, top-k, top-p, repetition penalty | |
| - **Web search settings**: max results, chars per result, timeout | |
| - **Custom system prompts** with dynamic date insertion | |
| - **Organized in collapsible sections** to keep interface clean | |
| ## π Supported Models | |
| ### Compact Models (< 2B) | |
| - **SmolLM2-135M-Instruct** - Tiny but capable | |
| - **SmolLM2-360M-Instruct** - Lightweight conversation | |
| - **Taiwan-ELM-270M/1.1B** - Multilingual support | |
| - **Qwen3-0.6B/1.7B** - Fast inference | |
| ### Mid-Size Models (2B-8B) | |
| - **Qwen3-4B/8B** - Balanced performance | |
| - **Phi-4-mini** (4.3B) - Reasoning & Instruct variants | |
| - **MiniCPM3-4B** - Efficient mid-size | |
| - **Gemma-3-4B-IT** - Instruction-tuned | |
| - **Llama-3.2-Taiwan-3B** - Regional optimization | |
| - **Mistral-7B-Instruct** - Classic performer | |
| - **DeepSeek-R1-Distill-Llama-8B** - Reasoning specialist | |
| ### Large Models (14B+) | |
| - **Qwen3-14B** - Strong general purpose | |
| - **Apriel-1.5-15b-Thinker** - Multimodal reasoning | |
| - **gpt-oss-20b** - Open GPT-style | |
| - **Qwen3-32B** - Top-tier performance | |
| ## π How It Works | |
| 1. **Select Model** - Choose from 30+ pre-configured models | |
| 2. **Configure Settings** - Adjust generation parameters or use defaults | |
| 3. **Enable Web Search** (optional) - Get real-time information | |
| 4. **Start Chatting** - Type your message or use example prompts | |
| 5. **Stream Response** - Watch as tokens are generated in real-time | |
| 6. **Cancel Anytime** - Stop generation mid-stream if needed | |
| ### Technical Flow | |
| 1. User message enters chat history | |
| 2. If search enabled, background thread fetches DuckDuckGo results | |
| 3. Search snippets merge into system prompt (within timeout limit) | |
| 4. Selected model pipeline loads on ZeroGPU (bf16βf16βf32 fallback) | |
| 5. Prompt formatted with thinking mode detection | |
| 6. Tokens stream to UI with thought/answer separation | |
| 7. Cancel button available for immediate interruption | |
| 8. Memory cleared after generation for next request | |
| ## βοΈ Generation Parameters | |
| | Parameter | Range | Default | Description | | |
| |-----------|-------|---------|-------------| | |
| | Max Tokens | 64-16384 | 1024 | Maximum response length | | |
| | Temperature | 0.1-2.0 | 0.7 | Creativity vs focus | | |
| | Top-K | 1-100 | 40 | Token sampling pool size | | |
| | Top-P | 0.1-1.0 | 0.9 | Nucleus sampling threshold | | |
| | Repetition Penalty | 1.0-2.0 | 1.2 | Reduce repetition | | |
| ## π Web Search Settings | |
| | Setting | Range | Default | Description | | |
| |---------|-------|---------|-------------| | |
| | Max Results | Integer | 4 | Number of search results | | |
| | Max Chars/Result | Integer | 50 | Character limit per result | | |
| | Search Timeout | 0-30s | 5s | Maximum wait time | | |
| ## π» Local Development | |
| ```bash | |
| # Clone the repository | |
| git clone https://huggingface.co/spaces/Luigi/ZeroGPU-LLM-Inference | |
| cd ZeroGPU-LLM-Inference | |
| # Install dependencies | |
| pip install -r requirements.txt | |
| # Run the app | |
| python app.py | |
| ``` | |
| ## π¨ UI Design Philosophy | |
| The interface follows these principles: | |
| 1. **Simplicity First** - Core features immediately visible | |
| 2. **Progressive Disclosure** - Advanced options hidden but accessible | |
| 3. **Visual Hierarchy** - Clear organization with groups and sections | |
| 4. **Feedback** - Status indicators and helpful messages | |
| 5. **Accessibility** - Responsive, keyboard-friendly, with tooltips | |
| ## π§ Customization | |
| ### Adding New Models | |
| Edit `MODELS` dictionary in `app.py`: | |
| ```python | |
| "Your-Model-Name": { | |
| "repo_id": "org/model-name", | |
| "description": "Model description", | |
| "params_b": 7.0 # Size in billions | |
| } | |
| ``` | |
| ### Modifying UI Theme | |
| Adjust theme parameters in `gr.Blocks()`: | |
| ```python | |
| theme=gr.themes.Soft( | |
| primary_hue="indigo", | |
| secondary_hue="purple", | |
| # ... more options | |
| ) | |
| ``` | |
| ## π Performance | |
| - **Token streaming** for responsive feel | |
| - **Background search** doesn't block UI | |
| - **Efficient memory** management with cache clearing | |
| - **ZeroGPU acceleration** for fast inference | |
| - **Optimized loading** with dtype fallbacks | |
| ## π€ Contributing | |
| Contributions welcome! Areas for improvement: | |
| - Additional model integrations | |
| - UI/UX enhancements | |
| - Performance optimizations | |
| - Bug fixes and testing | |
| - Documentation improvements | |
| ## π License | |
| Apache 2.0 - See LICENSE file for details | |
| ## π Acknowledgments | |
| - Built with [Gradio](https://gradio.app) | |
| - Powered by [Hugging Face Transformers](https://huggingface.co/transformers) | |
| - Uses [ZeroGPU](https://huggingface.co/zero-gpu-explorers) for acceleration | |
| - Search via [DuckDuckGo](https://duckduckgo.com) | |
| --- | |
| **Made with β€οΈ for the open source community** | |