Spaces:

alex4cip
/

simple-chat

Running on Zero

alex4cip Claude commited on 18 days ago

Commit

f1ac66c

1 Parent(s): 09e4bc2

docs: Update README with multi-environment support and remove redundant footer

README.md changes:
- Add comprehensive multi-environment support documentation
- Local environments: GPU (CUDA/MPS), CPU
- HF Spaces: ZeroGPU, CPU Upgrade, CPU Basic
- Add hardware auto-detection explanation with code examples
- Add CUDA compatibility testing for RTX 5080+ support
- Expand performance comparison table (2 → 5 environments)
- Add known issues and troubleshooting section
- Simplify installation instructions (single requirements.txt)
- Add PyTorch version requirements per environment

app.py changes:
- Move useful content (test examples, loading time note) to header
- Remove redundant footer section (60+ lines)
- Cleaner UI with all important info at the top
- Better user experience with reduced scrolling

Benefits:
- Eliminates duplicate information (header vs footer)
- Improves information accessibility
- Clearer environment setup guidance
- Better troubleshooting documentation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>

Files changed (2) hide show

README.md +255 -105
app.py +6 -56

README.md CHANGED Viewed

@@ -12,15 +12,16 @@ license: mit
 # 🤖 Multi-Model Korean LLM Chatbot
-13개의 다양한 한국어 LLM 모델을 선택하여 대화할 수 있는 멀티모델 챗봇입니다. **ZeroGPU**와 **CPU Upgrade** 하드웨어를 모두 지원합니다.
 ## ✨ 주요 특징
 - **🎯 13개 모델 선택**: 다양한 크기와 특성의 LLM 모델 지원
 - **🇰🇷 한글 최적화**: 한국어 성능이 우수한 모델들로 구성
-- **⚡ 유연한 하드웨어**: ZeroGPU/CPU Upgrade 자동 감지
 - **💾 캐시 시스템**: 모델 재다운로드 방지, 빠른 로딩
 - **🔄 Lazy Loading**: 선택한 모델만 로드하여 리소스 절약
 ## 🎯 지원 모델 (13개)
@@ -53,113 +54,221 @@ license: mit
 > **참고**: Gated 모델은 Hugging Face에서 별도 승인 필요
-## 🚀 하드웨어 옵션
-### Option 1: ZeroGPU (추천)
-**장점**:
-- ⚡ 빠른 응답 (3-10초)
-- 💰 저렴한 비용 ($9/month)
-- 🔋 자동 GPU 할당/해제
-**제약**:
-- 하루 25분 무료 사용 (PRO 구독 필요)
-- 대기열 가능 (사용자 많을 경우)
-**비용**: $9/month (PRO 구독)
-### Option 2: CPU Upgrade
-**장점**:
-- ⏰ 무제한 사용
-- 📊 예측 가능한 성능
-- 🔧 간단한 설정
-**제약**:
-- 🐢 느린 응답 (15초~2분)
-- 💵 상대적으로 비싼 비용
-**비용**: $0.03/hour (월 약 $22)
-## ⚙️ 하드웨어 설정 방법
-### ZeroGPU로 변경
 1. Space Settings → Hardware
 2. **ZeroGPU** 선택
-3. Confirm
-4. 빌드 완료 대기 (1-2분)
-→ UI에 "ZeroGPU" 표시 확인
-### CPU Upgrade로 변경
 1. Space Settings → Hardware
 2. **CPU Upgrade (8 vCPU / 32 GB)** 선택
-3. Confirm
-4. 빌드 완료 대기 (1-2분)
-→ UI에 "CPU Upgrade" 표시 확인
 ## 📊 성능 비교
-| 항목 | ZeroGPU | CPU Upgrade |
-|------|---------|-------------|
-| **첫 응답** | 10-20초 | 1-3분 |
-| **이후 응답** | 3-10초 | 15초~2분 |
-| **일일 한도** | 25분 | 무제한 |
-| **월 비용** | $9 | $22 |
-| **GPU** | H200 (70GB) | 없음 |
-| **RAM** | - | 32GB |
 ## 🔧 기술 구조
-### 자동 하드웨어 감지
 ```python
-# ZeroGPU 사용 가능 여부 자동 감지
 try:
     import spaces
     ZEROGPU_AVAILABLE = True
 except ImportError:
     ZEROGPU_AVAILABLE = False
-# 조건부 decorator 적용
 if ZEROGPU_AVAILABLE:
     @spaces.GPU(duration=120)
-    def generate_response(messages):
-        return generate_response_impl(messages)
 else:
-    def generate_response(messages):
-        return generate_response_impl(messages)
 ```
-### Lazy Loading 시스템
-- 선택한 모델만 메모리에 로드
-- 모델 전환 시 이전 모델 자동 언로드
-- 캐시 확인으로 재다운로드 방지
-- 디스크에서 빠른 로딩 (캐시된 경우)
-### 캐시 관리
 ```python
-def check_model_cached(model_name):
-    """Check if model is already downloaded in HF cache"""
-    from huggingface_hub import scan_cache_dir
-    cache_info = scan_cache_dir()
-    for repo in cache_info.repos:
-        if repo.repo_id == model_name:
-            return True
-    return False
 ```
 ## 📝 사용 방법
 ### 1. Space 접속
-https://huggingface.co/spaces/catchitplay/simple-chatbot-gradio
 ### 2. 모델 선택
@@ -207,25 +316,14 @@ cd simple-chatbot-gradio
 python -m venv venv
 source venv/bin/activate  # Windows: venv\Scripts\activate
-# 의존성 설치 (3가지 방법)
-```
-**방법 1: 로컬 전용 requirements (권장)**
-```bash
-pip install -r requirements-local.txt
-# 최신 PyTorch 버전 사용 (ZeroGPU 제약 없음)
-```
-**방법 2: 자동 환경 감지 설치**
-```bash
-python setup.py
-# 환경을 자동 감지하고 적절한 버전 설치
 ```
-**방법 3: HF Spaces용 requirements**
 ```bash
-pip install -r requirements.txt
-# PyTorch 2.2.0 (ZeroGPU 호환)
 ```
 ### .env 파일 설정
@@ -247,12 +345,28 @@ echo "HF_TOKEN=your_hugging_face_token" > .env
 python app.py
 ```
-브라우저에서 http://0.0.0.0:7860 접속 (또는 http://localhost:7860)
 **참고**:
-- 로컬은 CPU/GPU 자동 감지
-- GPU 권장 (CUDA 필요)
-- 첫 실행 시 모델 다운로드 (시간 소요)
 ### 리눅스 시스템 서비스로 설치 (자동 시작)
@@ -403,25 +517,35 @@ pip install -r requirements-local.txt
 ## 🛠️ 기술 스택
 - **프레임워크**: Gradio 5.49.1
-- **ML 라이브러리**: Transformers 4.57.1, PyTorch 2.2.0 (ZeroGPU 호환)
-- **GPU 인프라**: Hugging Face ZeroGPU (선택적)
 - **언어**: Python 3.10+
 ## 📚 Dependencies
 ```txt
 gradio==5.49.1
 transformers==4.57.1
-torch==2.2.0  # ZeroGPU compatible (supports 2.0.0-2.2.0)
 safetensors==0.6.2
 accelerate==0.26.1
 sentencepiece==0.2.0
 protobuf==4.25.1
 huggingface-hub>=0.19.0
 python-dotenv==1.0.0
-spaces  # ZeroGPU support
 ```
 ## 🔒 Gated 모델 사용법
 ### 1. 모델 승인 요청
@@ -441,22 +565,48 @@ Space Settings → Repository secrets:
 - Name: `HF_TOKEN`
 - Value: `your_token_here`
-## ⚠️ 제한사항
 ### 공통
 - **모델 크기**: 2-70GB (로딩 시간 필요)
-- **컨텍스트**: 대화 히스토리 유지
 - **메모리**: 큰 모델은 GPU/고용량 RAM 필요
-### ZeroGPU 전용
-- **일일 한도**: 25분 (PRO 구독)
-- **대기열**: 사용자 많을 경우 대기
-- **PRO 필요**: $9/month 구독 필요
-### CPU Upgrade 전용
-- **느린 속도**: GPU 대비 10-30배 느림
-- **비용**: 시간당 $0.03 ($22/month)
-- **메모리 제약**: 32GB RAM (대형 모델 제약)
 ## 🔗 관련 리소스

 # 🤖 Multi-Model Korean LLM Chatbot
+13개의 다양한 한국어 LLM 모델을 선택하여 대화할 수 있는 멀티모델 챗봇입니다. **로컬 환경(CPU/GPU)**과 **Hugging Face Spaces(CPU Basic/Upgrade, ZeroGPU)**를 자동 감지하여 최적 설정을 적용합니다.
 ## ✨ 주요 특징
 - **🎯 13개 모델 선택**: 다양한 크기와 특성의 LLM 모델 지원
 - **🇰🇷 한글 최적화**: 한국어 성능이 우수한 모델들로 구성
+- **🖥️ 멀티 환경 지원**: 로컬(CPU/GPU) + HF Spaces(CPU Basic/Upgrade, ZeroGPU) 자동 감지
 - **💾 캐시 시스템**: 모델 재다운로드 방지, 빠른 로딩
 - **🔄 Lazy Loading**: 선택한 모델만 로드하여 리소스 절약
+- **🛡️ 안정성**: RTX 5080 등 최신 GPU 지원, CUDA 호환성 자동 테스트
 ## 🎯 지원 모델 (13개)
 > **참고**: Gated 모델은 Hugging Face에서 별도 승인 필요
+## 🚀 지원 환경
+### 로컬 환경 (개발/개인 사용)
+**1. Local GPU (권장)**
+- **장점**:
+  - ⚡ 빠른 응답 (5-10초, GPU 가속)
+  - 🔓 무제한 사용
+  - 💰 비용 없음
+- **지원 GPU**:
+  - NVIDIA CUDA 지원 GPU (RTX 시리즈, A100 등)
+  - Apple Silicon GPU (M1/M2/M3 - MPS 가속)
+  - RTX 5080 등 최신 Blackwell GPU (PyTorch nightly 필요)
+- **요구사항**: CUDA 12.0+ 또는 Apple Silicon
+**2. Local CPU**
+- **장점**:
+  - 🖥️ GPU 없이도 실행 가능
+  - 🔧 간단한 설정
+- **제약**:
+  - ⏳ 느린 응답 (1~3분)
+  - 🔒 경량 모델 권장 (EXAONE 2.4B, Mistral 7B)
+### Hugging Face Spaces (클라우드 배포)
+**1. ZeroGPU (추천)**
+- **장점**:
+  - ⚡ 빠른 응답 (3-10초, NVIDIA H200 GPU 가속)
+  - 💰 저렴한 비용 ($9/month)
+  - 🔋 자동 GPU 할당/해제
+- **제약**:
+  - 하루 25분 무료 사용 (PRO 구독 필요)
+  - 대기열 가능 (사용자 많을 경우)
+- **비용**: $9/month (PRO 구독)
+**2. CPU Upgrade**
+- **장점**:
+  - ⏰ 무제한 사용
+  - 📊 예측 가능한 성능
+  - 🔧 간단한 설정
+- **제약**:
+  - 🐢 느린 응답 (30초~1분)
+  - 💵 상대적으로 비싼 비용
+- **비용**: $0.03/hour (월 약 $22)
+**3. CPU Basic (무료)**
+- **장점**:
+  - 💡 무료 티어
+  - 🧪 테스트/학습 용도
+- **제약**:
+  - ⏳ 매우 느린 응답 (1~2분)
+  - 🔒 경량 모델만 권장
+  - ⚠️ 제한적 사용
+## ⚙️ 환경별 설정 방법
+### 로컬 실행 (자동 감지)
+앱이 자동으로 로컬 환경을 감지하고 최적 설정을 적용합니다:
+```bash
+python app.py
+```
+**자동 감지 로직**:
+1. **GPU 감지**: CUDA/MPS 사용 가능 여부 확인
+2. **CUDA 호환성 테스트**: 텐서 연산으로 실제 GPU 작동 검증
+3. **CPU 폴백**: GPU 오류 시 자동 CPU 모드 전환
+4. **환경 정보 출력**: 시작 시 감지된 환경 정보 표시
+### HF Spaces 배포 (자동 감지)
+Space Settings에서 하드웨어를 변경하면 앱이 자동으로 감지:
+**ZeroGPU로 변경**:
 1. Space Settings → Hardware
 2. **ZeroGPU** 선택
+3. Confirm → 빌드 완료 대기 (1-2분)
+4. UI에 "🚀 HF Spaces - ZeroGPU" 표시 확인
+**CPU Upgrade로 변경**:
 1. Space Settings → Hardware
 2. **CPU Upgrade (8 vCPU / 32 GB)** 선택
+3. Confirm → 빌드 완료 대기 (1-2분)
+4. UI에 "⚙️ HF Spaces - CPU Upgrade" 표시 확인
+**CPU Basic (무료)**:
+- 기본 설정, 별도 변경 불필요
+- UI에 "💻 HF Spaces - CPU Basic" 표시
 ## 📊 성능 비교
+| 항목 | Local GPU | Local CPU | ZeroGPU | CPU Upgrade | CPU Basic |
+|------|-----------|-----------|---------|-------------|-----------|
+| **첫 응답** | 10-20초 | 2-5분 | 10-20초 | 1-2분 | 2-3분 |
+| **이후 응답** | 5-10초 | 1-3분 | 3-10초 | 30초~1분 | 1-2분 |
+| **일일 한도** | 무제한 | 무제한 | 25분 | 무제한 | 제한적 |
+| **월 비용** | $0 | $0 | $9 | $22 | $0 |
+| **GPU** | 사용자 GPU | 없음 | H200 (70GB) | 없음 | 없음 |
+| **권장 모델** | 전체 | 경량 | 전체 | 전체 | 경량 |
 ## 🔧 기술 구조
+### 멀티 환경 자동 감지 시스템
 ```python
+# 1. CUDA 초기화 오류 방지: spaces를 먼저 import
 try:
     import spaces
     ZEROGPU_AVAILABLE = True
 except ImportError:
     ZEROGPU_AVAILABLE = False
+# 2. 이후 CUDA 관련 패키지 import
+import torch
+from transformers import AutoModelForCausalLM, AutoTokenizer
+# 3. 하드웨어 환경 감지
+def detect_hardware_environment():
+    """
+    Returns: {
+        'platform': 'hf_spaces' | 'local',
+        'hardware': 'zerogpu' | 'cpu_upgrade' | 'cpu_basic' | 'local_gpu' | 'local_cpu',
+        'gpu_available': bool,
+        'gpu_name': str or None,
+        'cuda_compatible': bool
+    }
+    """
+    # HF Spaces 감지
+    if os.environ.get('SPACE_ID'):
+        if ZEROGPU_AVAILABLE:
+            return 'zerogpu'
+        elif cpu_count >= 8:
+            return 'cpu_upgrade'
+        else:
+            return 'cpu_basic'
+    # 로컬 환경 감지
+    if torch.cuda.is_available():
+        # CUDA 호환성 테스트 (RTX 5080 등 최신 GPU 지원)
+        if test_cuda_compatibility():
+            return 'local_gpu'
+        else:
+            return 'local_cpu'  # CUDA 오류 → CPU 폴백
+    elif torch.backends.mps.is_available():
+        return 'local_gpu'  # Apple Silicon
+    else:
+        return 'local_cpu'
+# 4. 조건부 GPU decorator 적용
 if ZEROGPU_AVAILABLE:
     @spaces.GPU(duration=120)
+    def generate_response(message, history):
+        return generate_response_impl(message, history)
 else:
+    def generate_response(message, history):
+        return generate_response_impl(message, history)
 ```
+### Lazy Loading & 캐시 시스템
+**스마트 모델 로딩**:
 ```python
+def load_model_once(model_index=None):
+    """모델 변경 시에만 로드 (Lazy Loading)"""
+    global model, tokenizer, loaded_model_name
+    model_name = MODEL_CONFIGS[model_index]["MODEL_NAME"]
+    # 1. 이미 로드된 모델이면 재사용
+    if loaded_model_name == model_name:
+        print(f"ℹ️ Model {model_name} already loaded, reusing...")
+        return model, tokenizer
+    # 2. 캐시 확인 → UI에 다운로드 vs 로딩 메시지 표시
+    is_cached = check_model_cached(model_name)
+    if is_cached:
+        print(f"✅ Model found in cache, loading from disk...")
+    else:
+        print(f"📥 Model not in cache, downloading (~4-14GB)...")
+    # 3. 이전 모델 메모리 해제
+    if model is not None:
+        del model, tokenizer
+        if HW_ENV['cuda_compatible']:
+            torch.cuda.empty_cache()
+    # 4. 새 모델 로드 (환경별 최적화)
+    device = "cuda" if HW_ENV['gpu_available'] and HW_ENV['cuda_compatible'] else "cpu"
+    if device == "cuda":
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            dtype=torch.float16,  # GPU: float16
+            device_map="auto",
+        )
+    else:
+        model = AutoModelForCausalLM.from_pretrained(
+            model_name,
+            dtype=torch.float32,  # CPU: float32
+        )
+    loaded_model_name = model_name
+    return model, tokenizer
 ```
+**캐시 상태 확인**:
+- 사용자에게 "💾 캐시된 모델 로딩 중" vs "📥 모델 다운로드 중" 실시간 표시
+- 다운로드 시간 예측 정보 제공 (첫 사용 시 5-20분)
 ## 📝 사용 방법
 ### 1. Space 접속
+https://huggingface.co/spaces/catchitplay/simple-chat
 ### 2. 모델 선택
 python -m venv venv
 source venv/bin/activate  # Windows: venv\Scripts\activate
+# 의존성 설치
+pip install -r requirements.txt
 ```
+**RTX 5080 등 최신 GPU 사용 시**:
 ```bash
+# PyTorch nightly 설치 (CUDA 12.8+ 지원)
+pip install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/cu128
 ```
 ### .env 파일 설정
 python app.py
 ```
+브라우저에서 http://localhost:7860 접속
+**시작 시 자동 환경 감지 출력**:
+```
+============================================================
+Hardware Environment Detection
+============================================================
+Platform: local
+Hardware: local_gpu
+GPU Available: True
+GPU Name: NVIDIA GeForce RTX 5080
+CPU Cores: 16
+OS: Linux
+Description: 🖥️ Local - GPU (NVIDIA GeForce RTX 5080)
+============================================================
+```
 **참고**:
+- 로컬 환경 자동 감지: CPU/GPU/Apple Silicon MPS
+- CUDA 호환성 자동 테스트 (GPU 오류 시 CPU 폴백)
+- 첫 실행 시 모델 다운로드 (4-14GB, 5-20분 소요)
+- GPU 권장 (RTX 시리즈, A100, Apple Silicon 등)
 ### 리눅스 시스템 서비스로 설치 (자동 시작)
 ## 🛠️ 기술 스택
 - **프레임워크**: Gradio 5.49.1
+- **ML 라이브러리**: Transformers 4.57.1, PyTorch 2.2.0+
+- **GPU 지원**:
+  - HF Spaces: ZeroGPU (NVIDIA H200)
+  - 로컬: CUDA 12.0+, Apple Silicon MPS
+  - 최신 GPU: PyTorch nightly (CUDA 12.8+) 지원
 - **언어**: Python 3.10+
 ## 📚 Dependencies
 ```txt
+# Core
 gradio==5.49.1
 transformers==4.57.1
+torch>=2.2.0  # HF Spaces: 2.2.0 (ZeroGPU), Local: 2.2.0+ or nightly
 safetensors==0.6.2
 accelerate==0.26.1
 sentencepiece==0.2.0
 protobuf==4.25.1
 huggingface-hub>=0.19.0
 python-dotenv==1.0.0
+spaces  # ZeroGPU support (HF Spaces only)
 ```
+**환경별 PyTorch 버전**:
+- **HF Spaces**: PyTorch 2.2.0 (ZeroGPU 호환)
+- **로컬 일반 GPU**: PyTorch 2.2.0+ (CUDA 12.0+)
+- **로컬 최신 GPU (RTX 5080 등)**: PyTorch nightly (CUDA 12.8+)
+- **로컬 CPU**: PyTorch 2.2.0+ (CPU-only build)
 ## 🔒 Gated 모델 사용법
 ### 1. 모델 승인 요청
 - Name: `HF_TOKEN`
 - Value: `your_token_here`
+## ⚠️ 제한사항 및 알려진 이슈
 ### 공통
 - **모델 크기**: 2-70GB (로딩 시간 필요)
+- **컨텍스트**: 대화 히스토리 유지 (최근 3턴)
 - **메모리**: 큰 모델은 GPU/고용량 RAM 필요
+### 환경별 제약
+**HF Spaces - ZeroGPU**:
+- 일일 한도: 25분 (PRO 구독 필요)
+- 대기열: 사용자 많을 경우 대기
+- 비용: $9/month
+**HF Spaces - CPU Upgrade**:
+- 느린 속도: GPU 대비 10-30배 느림
+- 비용: 시간당 $0.03 ($22/month)
+- 메모리: 32GB RAM (대형 모델 제약)
+**HF Spaces - CPU Basic**:
+- 매우 느림: 1-2분 응답
+- 제한적 사용
+- 경량 모델 권장
+**로컬 환경**:
+- GPU 메모리: 큰 모델은 VRAM 부족 가능
+- 최신 GPU: PyTorch nightly 필요 (RTX 5080 등)
+- CPU 모드: 매우 느림 (1-3분 응답)
+### 알려진 이슈 및 해결방법
+**"CUDA has been initialized" 오류 (ZeroGPU)**:
+- **원인**: torch 전에 spaces import 필요
+- **해결**: app.py에서 spaces를 가장 먼저 import (이미 적용됨)
+**RTX 5080 등 Blackwell GPU에서 CUDA 오류**:
+- **원인**: CUDA 12.8+ 필요 (PyTorch 2.2.0은 미지원)
+- **해결**: PyTorch nightly 설치 (위 설치 섹션 참조)
+**GPU 감지되지만 CPU 모드로 동작**:
+- **원인**: CUDA 호환성 테스트 실패
+- **해결**: PyTorch 버전 확인, CUDA 드라이버 업데이트
 ## 🔗 관련 리소스

app.py CHANGED Viewed

@@ -501,6 +501,12 @@ with gr.Blocks(title="🤖 Multi-Model Chatbot", css=custom_css) as demo:
     **모델 선택**:
     - 🎯 {TOTAL_MODEL_COUNT}가지 한글 최적화 모델 ({PUBLIC_MODEL_COUNT} Public + {GATED_MODEL_COUNT} Gated)
     - 🔄 모델 전환 시 자동 재로딩 (채팅 히스토리 초기화)
     """
     # Add hardware-specific features
@@ -609,61 +615,5 @@ with gr.Blocks(title="🤖 Multi-Model Chatbot", css=custom_css) as demo:
     msg.submit(submit, [msg, chatbot], [chatbot, msg])
     clear.click(lambda: [], outputs=chatbot)
-    # Dynamic footer based on hardware environment
-    footer = f"""
-    ---
-    **현재 환경**: {HW_ENV['description']}
-    **참고사항**:
-    - 🤖 {TOTAL_MODEL_COUNT}가지 모델 중 선택 가능
-    - 🔄 모델 변경 시 대화 내역 초기화
-    - ⏱️ 첫 응답은 모델 로딩 시간 포함
-    """
-    # Add environment-specific notes
-    if HW_ENV['hardware'] == 'zerogpu':
-        footer += """
-    - ⚡ ZeroGPU 자동 GPU 할당 (3-5초 응답)
-    - 💰 PRO 구독자 하루 25분 무료
-    - ⏱️ 첫 로딩: ~10-15초
-    """
-    elif HW_ENV['hardware'] == 'cpu_upgrade':
-        footer += """
-    - ⏰ 24시간 무제한 사용
-    - ⏳ CPU 환경 (30초~1분 응답)
-    - 💰 시간당 $0.03
-    - ⏱️ 첫 로딩: ~1-2분
-    """
-    elif HW_ENV['hardware'] == 'cpu_basic':
-        footer += """
-    - 💡 무료 티어 (제한적)
-    - ⏳ CPU 환경 (1~2분 응답)
-    - 🔒 경량 모델 권장
-    - ⏱️ 첫 로딩: ~2-3분
-    """
-    elif HW_ENV['hardware'] == 'local_gpu':
-        footer += f"""
-    - 🖥️  GPU 가속: {HW_ENV['gpu_name']}
-    - ⚡ 빠른 응답 (5-10초)
-    - 🔓 무제한 사용
-    - ⏱️ 첫 로딩: ~10-20초
-    """
-    else:  # local_cpu
-        footer += f"""
-    - 💻 로컬 CPU ({HW_ENV['cpu_count']} 코어)
-    - ⏳ 느린 응답 (1~3분)
-    - 🔒 경량 모델 권장
-    - ⏱️ 첫 로딩: ~2-5분
-    """
-    footer += """
-    **테스트 예시**:
-    - "안녕하세요"
-    - "인공지능에 대해 설명해주세요"
-    - "한국의 수도는 어디인가요?"
-    """
-    gr.Markdown(footer)
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860)

     **모델 선택**:
     - 🎯 {TOTAL_MODEL_COUNT}가지 한글 최적화 모델 ({PUBLIC_MODEL_COUNT} Public + {GATED_MODEL_COUNT} Gated)
     - 🔄 모델 전환 시 자동 재로딩 (채팅 히스토리 초기화)
+    - ⏱️ 첫 응답은 모델 로딩 시간 포함
+    **테스트 예시**:
+    - "안녕하세요"
+    - "인공지능에 대해 설명해주세요"
+    - "한국의 수도는 어디인가요?"
     """
     # Add hardware-specific features
     msg.submit(submit, [msg, chatbot], [chatbot, msg])
     clear.click(lambda: [], outputs=chatbot)
 if __name__ == "__main__":
     demo.launch(server_name="0.0.0.0", server_port=7860)