Back to Guides

Local LLM Deployment

Deploy and run large language models on your local TopAiPC infrastructure

Model Requirements

ModelVRAMRAMStorage
Llama 2 7B16GB32GB20GB
Llama 2 13B24GB64GB40GB
Llama 2 70B80GB+128GB140GB
Mistral 7B16GB32GB20GB
Falcon 40B48GB64GB80GB

Step-by-Step Guide

1

1. Install Required Software

**For Linux:**

```bash

# Install Python and pip

sudo apt update

sudo apt install python3 python3-pip

# Install PyTorch with CUDA

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

```

**For Windows:**

- Download Python 3.10+ from python.org

- Install PyTorch from pytorch.org

2

2. Choose Your Framework

**Option A: Hugging Face Transformers**

```bash

pip install transformers accelerate

```

**Option B: Ollama (Recommended for beginners)**

```bash

curl -fsSL https://ollama.ai/install.sh | sh

```

**Option C: vLLM (High performance)**

```bash

pip install vllm

```

3

3. Download Model

**Using Hugging Face:**

```python

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "meta-llama/Llama-2-7b-hf"

tokenizer = AutoTokenizer.from_pretrained(model_name)

model = AutoModelForCausalLM.from_pretrained(model_name)

```

**Using Ollama:**

```bash

ollama pull llama2

```

4

4. Run Inference

**Basic Inference:**

```python

from transformers import pipeline

generator = pipeline('text-generation', model=model, tokenizer=tokenizer)

result = generator("Hello, how are you?", max_length=100)

print(result[0]['generated_text'])

```

**Using Ollama:**

```bash

ollama run llama2 "Hello, how are you?"

```

Performance Tips

  • Use quantization (4-bit or 8-bit) to reduce VRAM requirements
  • Enable GPU offloading for larger models
  • Monitor VRAM usage with nvidia-smi
  • Use batch processing for better throughput
  • Consider model pruning for production deployments