# Usage Examples - FDA Task Classifier

## Basic Usage

### 1. Start the Server
```bash
./run_server.sh
```

### 2. Check Server Health
```bash
curl http://127.0.0.1:8000/health
```

### 3. Simple Completion
```bash
curl -X POST http://127.0.0.1:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is terrible!\n\nResponse: ",
    "max_tokens": 100,
    "temperature": 0.7
  }'
```

### 4. Streaming Response
```bash
curl -X POST http://127.0.0.1:8000/completion \
  -H "Content-Type: application/json" \
  -d '{
    "prompt": "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This sucks so bad!\n\nResponse: ",
    "max_tokens": 500,
    "temperature": 0.8,
    "stream": true
  }'
```

## Advanced Configuration

### Custom Server Settings
```bash
llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 35 \
  --ctx-size 4096 \
  --threads 8 \
  --chat-template "" \
  --log-disable
```

### GPU Acceleration (macOS with Metal)
```bash
llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 50 \
  --metal
```

### GPU Acceleration (Linux/Windows with CUDA)
```bash
llama-server \
  -m model.gguf \
  --host 127.0.0.1 \
  --port 8000 \
  --n-gpu-layers 50 \
  --cuda
```

## Python Client Example

```python
import requests
import json

def complete_with_model(prompt, max_tokens=200, temperature=0.7):
    url = "http://127.0.0.1:8000/completion"

    payload = {
        "prompt": prompt,
        "max_tokens": max_tokens,
        "temperature": temperature
    }

    headers = {
        'Content-Type': 'application/json'
    }

    response = requests.post(url, json=payload, headers=headers)

    if response.status_code == 200:
        result = response.json()
        return result['content']
    else:
        return f"Error: {response.status_code}"

# Example usage
prompt = "Instruction: Rewrite the provided text to remove the toxicity.\n\nInput: This is awful!\n\nResponse: "
response = complete_with_model(prompt)
print(response)
```

## Troubleshooting

### Common Issues

1. **Memory Errors**
   ```
   Error: not enough memory
   ```
   **Solution**: Reduce `--n-gpu-layers` to 0 or use a smaller value

2. **Context Window Too Large**
   ```
   Error: context size exceeded
   ```
   **Solution**: Reduce `--ctx-size` (e.g., `--ctx-size 2048`)

3. **CUDA Not Available**
   ```
   Error: CUDA not found
   ```
   **Solution**: Remove `--cuda` flag or install CUDA drivers

4. **Port Already in Use**
   ```
   Error: bind failed
   ```
   **Solution**: Use a different port with `--port 8001`

### Performance Tuning

- **For faster inference**: Increase `--n-gpu-layers`
- **For lower latency**: Reduce `--ctx-size`
- **For better quality**: Lower `--temperature` and increase `--top-p`
- **For creativity**: Increase `--temperature` and adjust `--top-k`

### System Requirements

- **RAM**: Minimum 8GB, recommended 16GB+
- **GPU**: Optional but recommended for better performance
- **Storage**: Model file size + 2x for temporary files

---
Generated on 2025-10-16 19:13:23