로컬 LLM 셋업 가이드 (v40)

Dev.to / 5/26/2026

💬 OpinionDeveloper Stack & InfrastructureTools & Practical Usage

Read original →

共有:

Key Points

The article is a practical guide for setting up a local LLM on Linux with strong privacy, low latency, and fewer policy/infrastructure constraints.
It lists system requirements (Ubuntu/Debian versions, CPU cores, RAM, NVIDIA GPU with CUDA 11.8+, and ample storage) and shows commands to verify the environment.
It compares local LLM frameworks (llama.cpp, Ollama, vLLM, LocalAI) and recommends building directly with llama.cpp.
It provides step-by-step installation instructions for llama.cpp, including cloning the repository, building with make, enabling NVIDIA CUDA support, and starting setup with a run script.
The guide emphasizes GPU acceleration considerations by showing how to rebuild llama.cpp with CUDA enabled for better performance.

로컬 LLM 셋업 가이드 (v40)

1. 개요 및 사전 요구사항

로컬 LLM 환경은 높은 개인 정보 보호, 낮은 지연 시간, 그리고 정책 및 인프라 제약을 피할 수 있는 장점이 있습니다. 이 가이드는 리눅스 머신에서 최적화된 로컬 LLM을 설정하는 실용적인 방법을 제공합니다.

사전 요구사항:

OS: Ubuntu 20.04 이상 또는 Debian 11 이상
CPU: 최소 4코어, 권장 8코어 이상
RAM: 최소 16GB, 권장 32GB 이상
GPU: NVIDIA GTX 10xx 이상 (CUDA 11.8 이상 필요)
Storage: 최소 50GB 여유 공간 (모델 파일 용량이 매우 큼)

# 시스템 정보 확인
lscpu
free -h
nvidia-smi

2. 프레임워크 비교

프레임워크	특징	장점	단점
llama.cpp	C++로 구현된 최적화된 런타임	높은 성능, 최소 의존성	명령줄 기반
Ollama	Docker 기반 관리 도구	설치 간단, 모델 관리 용이	리소스 소모가 크고, 모델 로딩 속도 느림
vLLM	Python 기반, 고성능	높은 토큰 처리량	복잡한 설정, 많은 메모리 요구
LocalAI	REST API 기반, 다양한 엔진 지원	모델 호환성 높음, API 기반	다중 엔진 지원으로 인한 복잡성

추천: llama.cpp를 사용하여 직접 구축

3. 설치 단계 (llama.cpp 기반)

3.1 레포지토리 복제 및 빌드

# 필수 패키지 설치
sudo apt update
sudo apt install git cmake build-essential python3-pip -y

# llama.cpp 레포지토리 클론
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp

# 빌드
make clean
make

3.2 GPU 지원 활성화 (NVIDIA)

# CUDA 설치 (필요시)
wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-keyring_1.0-1_all.deb
sudo dpkg -i cuda-keyring_1.0-1_all.deb
sudo apt-get update
sudo apt-get install cuda-toolkit-11-8 -y

# llama.cpp 빌드 시 CUDA 활성화
make clean
CUDA=1 make

3.3 실행 스크립트 생성

# ~/llama_run.sh 생성
cat > ~/llama_run.sh << 'EOF'
#!/bin/bash
cd ~/llama.cpp
./main -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
       -p "Qwen: " \
       -n 512 \
       -ngl 35 \
       --temp 0.7 \
       --repeat-penalty 1.1
EOF

chmod +x ~/llama_run.sh

4. 모델 선택 가이드

모델	설명	추천 사용 사례
Llama-2-7B	베이스 모델, 70억 파라미터	일반적인 챗봇, 문법 분석
Llama-2-13B	더 강력한 베이스 모델	높은 정확도가 필요한 작업
Mistral-7B	최적화된 베이스 모델	빠른 추론, 높은 성능
Qwen-7B	알리바바의 중국어 최적화 모델	중국어 콘텐츠 처리

예시 모델 다운로드:

# 모델 디렉토리 생성
mkdir -p ~/llama.cpp/models

# Qwen-7B 모델 다운로드 (예시)
wget https://huggingface.co/Qwen/Qwen-7B-Chat-GGUF/resolve/main/qwen-7b-chat-q4_k_m.gguf -O ~/llama.cpp/models/qwen-7b-chat-q4_k_m.gguf

5. 양자화 유형 설명

양자화	설명	성능
Q4_K_M	4비트 양자화, 최적화된 메모리 사용	높은 성능, 적은 메모리
Q5_K_M	5비트 양자화, 중간 성능	균형 잡힌 성능
Q8_0	8비트 양자화	최대 정확도, 메모리 사용량 증가
F16	반정밀도 (FP16)	최고 정확도, 메모리 최대 사용량

# 모델 별 성능 비교
./main -m ./models/llama-2-7b-chat.Q4_K_M.gguf -n 512 -ngl 35 --temp 0.7
./main -m ./models/llama-2-7b-chat.Q5_K_M.gguf -n 512 -ngl 35 --temp 0.7

6. API 설정 및 기존 도구 통합

6.1 OpenAI 호환 API 서버

# API 서버 실행
./server -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
         -p "Qwen: " \
         -n 512 \
         -ngl 35 \
         --host 0.0.0.0 \
         --port 8080 \
         --threads 8 \
         --temp 0.7 \
         --repeat-penalty 1.1

6.2 Python 클라이언트 예시

# client.py
import openai

client = openai.OpenAI(
    base_url="http://localhost:8080/v1",
    api_key="EMPTY"
)

response = client.chat.completions.create(
    model="llama2",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ]
)

print(response.choices[0].message.content)

7. Systemd 서비스 설정 (24/7 운영)

# 서비스 파일 생성
sudo nano /etc/systemd/system/llama.service

# 내용 추가:
[Unit]
Description=Local LLM Server
After=network.target

[Service]
Type=simple
User=your_username
WorkingDirectory=/home/your_username/llama.cpp
ExecStart=/home/your_username/llama_run.sh
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

# 서비스 활성화 및 시작
sudo systemctl daemon-reload
sudo systemctl enable llama.service
sudo systemctl start llama.service
sudo systemctl status llama.service

8. 모니터링 및 성능 최적화

8.1 성능 모니터링

# GPU 사용량 모니터링
nvidia-smi -l 1

# CPU 사용량 모니터링
htop

# 로그 확인
journalctl -u llama.service -f

8.2 성능 테스트

# 추론 성능 테스트
./main -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
       -p "Qwen: " \
       -n 100 \
       -ngl 35 \
       --temp 0.7 \
       --repeat-penalty 1.1 \
       --timings

8.3 메모리 최적화

# GPU 메모리 최적화
./main -m ./models/llama-2-7b-chat.Q4_K_M.gguf \
       -n 512 \
       -ngl 35 \
       --temp 0.7 \
       --repeat-penalty 1.1 \
       --ctx-size 4096 \
       --batch-size 512

9. 실제 사용 예시

9.1 챗봇 환경 구성


bash

---

📥 **Get the full guide on Gumroad**: https://gumroad.com/l/auto ($7)

Black Hat USA

AI Business

Building Conifer, an open-source local inference runtime (free + open source)

Reddit r/artificial

Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

Reddit r/MachineLearning

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

Reddit r/LocalLLaMA

A prompt is not a conversation. It's a component contract.

Dev.to

로컬 LLM 셋업 가이드 (v40)

Key Points

로컬 LLM 셋업 가이드 (v40)

1. 개요 및 사전 요구사항

사전 요구사항:

2. 프레임워크 비교

3. 설치 단계 (llama.cpp 기반)

3.1 레포지토리 복제 및 빌드

3.2 GPU 지원 활성화 (NVIDIA)

3.3 실행 스크립트 생성

4. 모델 선택 가이드

예시 모델 다운로드:

5. 양자화 유형 설명

6. API 설정 및 기존 도구 통합

6.1 OpenAI 호환 API 서버

6.2 Python 클라이언트 예시

7. Systemd 서비스 설정 (24/7 운영)

8. 모니터링 및 성능 최적화

8.1 성능 모니터링

8.2 성능 테스트

8.3 메모리 최적화

9. 실제 사용 예시

9.1 챗봇 환경 구성

Related Articles

Black Hat USA

Building Conifer, an open-source local inference runtime (free + open source)

Aiki my local Wikipedia Retrieval-Augmented Generation system [R]

Update on 12x32gb sxm v100 cluster / local AI for legal drafting

A prompt is not a conversation. It's a component contract.

関連おすすめサービス

Notta搭載AI議事録イヤホン ZENCHORD1

AI搭載ボイスレコーダー Plaud

画像高画質化AIツール Aiarty Image Enhancer