Local LLM Deployment
Deploy and run large language models on your local TopAiPC infrastructure
Model Requirements
| Model | VRAM | RAM | Storage |
|---|---|---|---|
| Llama 2 7B | 16GB | 32GB | 20GB |
| Llama 2 13B | 24GB | 64GB | 40GB |
| Llama 2 70B | 80GB+ | 128GB | 140GB |
| Mistral 7B | 16GB | 32GB | 20GB |
| Falcon 40B | 48GB | 64GB | 80GB |
Step-by-Step Guide
1. Install Required Software
**For Linux:**
```bash
# Install Python and pip
sudo apt update
sudo apt install python3 python3-pip
# Install PyTorch with CUDA
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
```
**For Windows:**
- Download Python 3.10+ from python.org
- Install PyTorch from pytorch.org
2. Choose Your Framework
**Option A: Hugging Face Transformers**
```bash
pip install transformers accelerate
```
**Option B: Ollama (Recommended for beginners)**
```bash
curl -fsSL https://ollama.ai/install.sh | sh
```
**Option C: vLLM (High performance)**
```bash
pip install vllm
```
3. Download Model
**Using Hugging Face:**
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "meta-llama/Llama-2-7b-hf"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)
```
**Using Ollama:**
```bash
ollama pull llama2
```
4. Run Inference
**Basic Inference:**
```python
from transformers import pipeline
generator = pipeline('text-generation', model=model, tokenizer=tokenizer)
result = generator("Hello, how are you?", max_length=100)
print(result[0]['generated_text'])
```
**Using Ollama:**
```bash
ollama run llama2 "Hello, how are you?"
```
Performance Tips
- Use quantization (4-bit or 8-bit) to reduce VRAM requirements
- Enable GPU offloading for larger models
- Monitor VRAM usage with nvidia-smi
- Use batch processing for better throughput
- Consider model pruning for production deployments