Upload folder using huggingface_hub

Browse files

Files changed (5) hide show

.gitattributes +3 -0
README.md +214 -13
qwen3-vl-8b-instruct-abliterated-f16.gguf +3 -0
qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf +3 -0
qwen3-vl-8b-instruct-abliterated-q8-0.gguf +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,6 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-8b-instruct-abliterated-f16.gguf filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf filter=lfs diff=lfs merge=lfs -text
+qwen3-vl-8b-instruct-abliterated-q8-0.gguf filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -11,7 +11,7 @@ tags:
   - image-text-to-text
 ---
-<!-- README Version: v1.1 -->
 # Qwen3-VL-8B-Instruct (Abliterated)
@@ -40,37 +40,87 @@ This is an **abliterated** (uncensored) version of the Qwen3-VL-8B-Instruct mult
 ```
 qwen3-vl-8b-instruct/
-├── qwen3-vl-8b-instruct-abliterated.safetensors  # Complete model weights (16.33 GB)
-└── README.md                                      # This file
 ```
-**Total Repository Size**: 16.33 GB (FP16 precision, single-file format)
 **File Details**:
 - **qwen3-vl-8b-instruct-abliterated.safetensors**: Complete merged model in safetensors format
-  - Size: 16.33 GB
   - Precision: FP16 (half precision)
   - Format: Single-file merged weights (not sharded)
-  - Contains: Full vision encoder + language model + abliteration modifications
 ## Hardware Requirements
-### Minimum Requirements
 - **VRAM**: 20 GB (FP16 inference)
 - **RAM**: 32 GB system memory
 - **Disk Space**: 20 GB free space
 - **GPU**: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40 series, A100, etc.)
-### Recommended Requirements
 - **VRAM**: 24 GB+ (RTX 4090, A6000, A100 for longer sequences)
 - **RAM**: 64 GB system memory
 - **Disk Space**: 30 GB+ (for model caching and optimization)
 - **GPU**: NVIDIA RTX 4090, A100, or H100 for optimal performance
-### Optimization Options
-- **INT8 Quantization**: ~10 GB VRAM (with minor quality loss)
-- **INT4 Quantization**: ~6 GB VRAM (with moderate quality loss)
-- **CPU Inference**: Possible but very slow (not recommended)
 ## Usage Examples
@@ -221,6 +271,148 @@ print("Model layers:", list(weights.keys())[:10])  # First 10 keys
 print(f"Total parameters: {sum(w.numel() for w in weights.values()):,}")
 ```
 ## Model Specifications
 ### Architecture Details
@@ -518,7 +710,16 @@ processor = Qwen2VLProcessor(image_processor=image_processor, tokenizer=tokenize
 ## Changelog
-**v1.1** (Current)
 - Updated README with accurate file information
 - Added abliteration details and safety warnings
 - Documented single-file merged format

   - image-text-to-text
 ---
+<!-- README Version: v1.2 -->
 # Qwen3-VL-8B-Instruct (Abliterated)
 ```
 qwen3-vl-8b-instruct/
+├── qwen3-vl-8b-instruct-abliterated.safetensors    # Complete model weights (17 GB)
+├── qwen3-vl-8b-instruct-abliterated-f16.gguf       # FP16 GGUF format (16 GB)
+├── qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf    # Q4_K_M quantized (4.7 GB)
+├── qwen3-vl-8b-instruct-abliterated-q8-0.gguf      # Q8_0 quantized (8.2 GB)
+└── README.md                                        # This file
 ```
+**Total Repository Size**: ~46 GB (multiple formats for different use cases)
 **File Details**:
 - **qwen3-vl-8b-instruct-abliterated.safetensors**: Complete merged model in safetensors format
+  - Size: 17 GB
   - Precision: FP16 (half precision)
   - Format: Single-file merged weights (not sharded)
+  - Use with: Transformers library, standard PyTorch inference
+  - Best for: GPU inference with 20GB+ VRAM
+- **qwen3-vl-8b-instruct-abliterated-f16.gguf**: FP16 GGUF format
+  - Size: 16 GB
+  - Precision: FP16 (half precision)
+  - Format: GGUF (GPT-Generated Unified Format)
+  - Use with: llama.cpp, Ollama, LM Studio
+  - Best for: CPU/GPU inference with llama.cpp ecosystem
+- **qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf**: Q4_K_M quantized GGUF
+  - Size: 4.7 GB
+  - Precision: 4-bit K-quant (medium quality)
+  - Format: GGUF quantized
+  - Use with: llama.cpp, Ollama, LM Studio
+  - Best for: Lower VRAM systems (8-12 GB), good quality/size balance
+- **qwen3-vl-8b-instruct-abliterated-q8-0.gguf**: Q8_0 quantized GGUF
+  - Size: 8.2 GB
+  - Precision: 8-bit quantization
+  - Format: GGUF quantized
+  - Use with: llama.cpp, Ollama, LM Studio
+  - Best for: 12-16 GB VRAM, minimal quality loss from FP16
 ## Hardware Requirements
+### SafeTensors Format (FP16)
+**Minimum Requirements**:
 - **VRAM**: 20 GB (FP16 inference)
 - **RAM**: 32 GB system memory
 - **Disk Space**: 20 GB free space
 - **GPU**: NVIDIA GPU with Compute Capability 7.0+ (V100, RTX 20/30/40 series, A100, etc.)
+**Recommended Requirements**:
 - **VRAM**: 24 GB+ (RTX 4090, A6000, A100 for longer sequences)
 - **RAM**: 64 GB system memory
 - **Disk Space**: 30 GB+ (for model caching and optimization)
 - **GPU**: NVIDIA RTX 4090, A100, or H100 for optimal performance
+### GGUF Formats (Multiple Options)
+**F16 GGUF** (qwen3-vl-8b-instruct-abliterated-f16.gguf):
+- **VRAM**: 18-20 GB GPU VRAM recommended
+- **RAM**: 32 GB for GPU offloading, 64 GB for CPU inference
+- **Disk Space**: 20 GB
+- **Use Case**: GPU inference with llama.cpp ecosystem
+**Q8_0 GGUF** (qwen3-vl-8b-instruct-abliterated-q8-0.gguf):
+- **VRAM**: 12-16 GB GPU VRAM
+- **RAM**: 16 GB for GPU offloading, 32 GB for CPU inference
+- **Disk Space**: 10 GB
+- **Quality**: Minimal quality loss from FP16, excellent balance
+- **Use Case**: Mid-range GPUs (RTX 3060 12GB, RTX 4060 Ti 16GB, etc.)
+**Q4_K_M GGUF** (qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf):
+- **VRAM**: 8-12 GB GPU VRAM
+- **RAM**: 8 GB for GPU offloading, 16 GB for CPU inference
+- **Disk Space**: 6 GB
+- **Quality**: Good quality/size balance, suitable for most tasks
+- **Use Case**: Consumer GPUs (RTX 3060, RTX 4060, etc.)
+### CPU-Only Inference (GGUF formats)
+- **RAM**: 32-64 GB system memory
+- **CPU**: Modern CPU with AVX2 support (Intel Core i5/i7/i9, AMD Ryzen)
+- **Performance**: Much slower than GPU, but functional
+- **Recommended**: Q4_K_M format for best performance/quality balance
 ## Usage Examples
 print(f"Total parameters: {sum(w.numel() for w in weights.values()):,}")
 ```
+## GGUF Format Usage
+The GGUF formats are designed for use with llama.cpp, Ollama, LM Studio, and other GGUF-compatible inference engines. These formats are optimized for flexible deployment across CPU and GPU systems.
+### Using with llama.cpp
+**Installation**:
+```bash
+git clone https://github.com/ggerganov/llama.cpp
+cd llama.cpp
+make
+```
+**Basic Inference** (Q4_K_M for best balance):
+```bash
+./main -m "E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf" \
+  --image "path/to/image.jpg" \
+  -p "Describe this image in detail." \
+  -n 512 \
+  --temp 0.7 \
+  --top-p 0.9
+```
+**GPU Acceleration** (offload layers to GPU):
+```bash
+./main -m "E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q8-0.gguf" \
+  --image "path/to/image.jpg" \
+  -p "What objects are in this image?" \
+  -n 512 \
+  --n-gpu-layers 35 \
+  --temp 0.7
+```
+**CPU-Only Inference** (optimized threading):
+```bash
+./main -m "E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf" \
+  --image "path/to/image.jpg" \
+  -p "Analyze this chart." \
+  -n 512 \
+  --threads 8
+```
+### Using with Ollama
+**Create Modelfile**:
+```dockerfile
+FROM E:\huggingface\qwen3-vl-8b-instruct\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf
+PARAMETER temperature 0.7
+PARAMETER top_p 0.9
+PARAMETER top_k 40
+SYSTEM """You are an uncensored vision-language AI assistant capable of analyzing images and answering questions without content filtering."""
+```
+**Create and run model**:
+```bash
+ollama create qwen3-vl-abliterated -f ./Modelfile
+ollama run qwen3-vl-abliterated
+```
+**Interactive use**:
+```bash
+>>> What's in this image? /path/to/image.jpg
+```
+### Using with LM Studio
+1. Open LM Studio
+2. Go to "Local Models" → "Import Model"
+3. Select one of the GGUF files:
+   - Use Q4_K_M for best performance on consumer hardware
+   - Use Q8_0 for better quality with more VRAM
+   - Use F16 for maximum quality
+4. Load the model and configure:
+   - Context Length: 32768
+   - GPU Offload: Adjust based on your VRAM
+   - Temperature: 0.7 (adjust for your use case)
+5. Use the image upload feature to analyze images
+### Python with llama-cpp-python
+**Installation**:
+```bash
+pip install llama-cpp-python
+```
+**Basic Usage**:
+```python
+from llama_cpp import Llama
+from llama_cpp.llama_chat_format import Llava15ChatHandler
+# Initialize chat handler for vision model
+chat_handler = Llava15ChatHandler(clip_model_path="path/to/clip/model")
+# Load model
+llm = Llama(
+    model_path="E:\\huggingface\\qwen3-vl-8b-instruct\\qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf",
+    chat_handler=chat_handler,
+    n_ctx=32768,
+    n_gpu_layers=35,  # Adjust based on VRAM
+    verbose=False
+)
+# Analyze image
+response = llm.create_chat_completion(
+    messages=[
+        {
+            "role": "user",
+            "content": [
+                {"type": "image_url", "image_url": {"url": "file:///path/to/image.jpg"}},
+                {"type": "text", "text": "What is in this image?"}
+            ]
+        }
+    ],
+    temperature=0.7,
+    max_tokens=512
+)
+print(response["choices"][0]["message"]["content"])
+```
+### Format Selection Guide
+**Choose Q4_K_M** if:
+- You have 8-12 GB VRAM
+- You want fast inference with good quality
+- Storage space is a concern
+- Most consumer hardware scenarios
+**Choose Q8_0** if:
+- You have 12-16 GB VRAM
+- You want minimal quality loss from FP16
+- You can spare the extra storage
+- Professional or high-quality output needs
+**Choose F16 GGUF** if:
+- You have 20+ GB VRAM
+- You want maximum quality
+- You prefer GGUF ecosystem over PyTorch
+- You need llama.cpp compatibility with full precision
 ## Model Specifications
 ### Architecture Details
 ## Changelog
+**v1.2** (Current - November 2025)
+- Added GGUF format files (F16, Q8_0, Q4_K_M)
+- Comprehensive GGUF usage documentation (llama.cpp, Ollama, LM Studio)
+- Detailed hardware requirements for each format
+- Format selection guide for different use cases
+- Updated total repository size to ~46 GB
+- Added Python llama-cpp-python examples
+- Enhanced deployment flexibility across CPU/GPU systems
+**v1.1**
 - Updated README with accurate file information
 - Added abliteration details and safety warnings
 - Documented single-file merged format

qwen3-vl-8b-instruct-abliterated-f16.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:198b11e5bf72366e17c26fd2ef7acffb7b521e0520b5d888adb3caf7ba1df5ae
+size 16388044928

qwen3-vl-8b-instruct-abliterated-q4-k-m.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:a12b15fc5631cd42bd67739d8bbb5ac2e18b958c879d16e6bcd3d86879f0117d
+size 5027784832

qwen3-vl-8b-instruct-abliterated-q8-0.gguf ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:39579e349e291e70a6391eebfb8a93046d5798e018df0a029cc6408c35ecbb80
+size 8709519488