# Usage Examples - FDA Task Classifier ## Basic Usage ### 1. Start the Server ```bash ./run_server.sh ``` ### 2. Check Server Health ```bash curl http://127.0.0.1:8000/health ``` ### 3. Simple Completion ```bash curl -X POST http://127.0.0.1:8000/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ", "max_tokens": 100, "temperature": 0.7 }' ``` ### 4. Streaming Response ```bash curl -X POST http://127.0.0.1:8000/completion \ -H "Content-Type: application/json" \ -d '{ "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ", "max_tokens": 500, "temperature": 0.8, "stream": true }' ``` ## Advanced Configuration ### Custom Server Settings ```bash llama-server \ -m model.gguf \ --host 127.0.0.1 \ --port 8000 \ --n-gpu-layers 35 \ --ctx-size 4096 \ --threads 8 \ --chat-template "" \ --log-disable ``` ### GPU Acceleration (macOS with Metal) ```bash llama-server \ -m model.gguf \ --host 127.0.0.1 \ --port 8000 \ --n-gpu-layers 50 \ --metal ``` ### GPU Acceleration (Linux/Windows with CUDA) ```bash llama-server \ -m model.gguf \ --host 127.0.0.1 \ --port 8000 \ --n-gpu-layers 50 \ --cuda ``` ## Python Client Example ```python import requests import json def complete_with_model(prompt, max_tokens=200, temperature=0.7): url = "http://127.0.0.1:8000/completion" payload = { "prompt": prompt, "max_tokens": max_tokens, "temperature": temperature } headers = { 'Content-Type': 'application/json' } response = requests.post(url, json=payload, headers=headers) if response.status_code == 200: result = response.json() return result['content'] else: return f"Error: {response.status_code}" # Example usage prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: " response = complete_with_model(prompt) print(response) ``` ## Troubleshooting ### Common Issues 1. **Memory Errors** ``` Error: not enough memory ``` **Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value 2. **Context Window Too Large** ``` Error: context size exceeded ``` **Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`) 3. **CUDA Not Available** ``` Error: CUDA not found ``` **Solution**: Remove `--cuda` flag or install CUDA drivers 4. **Port Already in Use** ``` Error: bind failed ``` **Solution**: Use a different port with `--port 8001` ### Performance Tuning - **For faster inference**: Increase `--n-gpu-layers` - **For lower latency**: Reduce `--ctx-size` - **For better quality**: Lower `--temperature` and increase `--top-p` - **For creativity**: Increase `--temperature` and adjust `--top-k` ### System Requirements - **RAM**: Minimum 8GB, recommended 16GB+ - **GPU**: Optional but recommended for better performance - **Storage**: Model file size + 2x for temporary files --- Generated on 2025-10-16 19:13:23