kevin510
/

friday-4bit

Text Generation

vision-language

4-bit precision

Model card Files Files and versions

kevin510 commited on Jun 18

Commit

25f40c2

·

verified ·

1 Parent(s): 081dce9

Create README.md

Files changed (1) hide show

README.md +108 -0

README.md ADDED Viewed

	@@ -0,0 +1,108 @@

+---
+license: apache-2.0
+datasets:
+- liuhaotian/LLaVA-Instruct-150K
+- liuhaotian/LLaVA-Pretrain
+base_model:
+- microsoft/Phi-4-mini-reasoning
+- kevin510/fast-vit-hd
+library_name: transformers
+tags:
+- vision-language
+- multimodal
+- friday
+- custom_code
+- 4bit
+- quantization
+---
+# Friday-VLM
+Friday-VLM is a multimodal (image + text) LLM fine-tuned on image and text instruction data.
+The architecture and config live in this repo, so callers must load the model with
+`trust_remote_code=True`.
+---
+# Model variants
+| Repo ID | Precision | File format | Typical VRAM* | Size on disk |
+|---------|-----------|-------------|---------------|--------------|
+| `kevin510/friday`       | **bf16** (full) | `safetensors` | 100 % | 100 % |
+| `kevin510/friday-fp4`   | **fp4** (bitsandbytes int4) | `safetensors` |  ≈ 30 % |  ≈ 25 % |
+---
+# Dependencies
+```bash
+conda create --name friday python=3.12 -y
+conda activate friday
+pip install transformers torch torchvision  deepspeed accelerate pillow einops timm bitsandbytes
+```
+# Quick start
+```python
+import torch
+from PIL import Image
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from transformers.utils import logging
+tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    "kevin510/friday-4bit",
+    trust_remote_code=True,
+    load_in_4bit=True,
+    device_map="auto"
+)
+model.eval()
+prompt = "Describe this image."
+user_prompt = f"<|user|><image>\n{prompt}\n<|assistant|>"
+inputs = tok(user_prompt, return_tensors="pt").to(model.device)
+image = Image.open("my_image.jpg").convert("RGB")
+with torch.no_grad():
+    out = model.generate(
+        **inputs,
+        max_new_tokens=256,
+        do_sample=False,
+        images=[image]
+    )
+print(tok.decode(out[0], skip_special_tokens=False))
+```
+# Architecture at a glance
+```
+FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶  2-layer MLP vision-adapter (6144 → 3072)
+(vision tokens, 3072 d) ─┐
+├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072)
+<text tokens, 3072 d> ───┘ │
+│ (standard self-attention only;
+│ language tower is frozen at finetune)
+```
+# Limitations & Responsible AI
+Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases.
+All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment.
+# Citation
+```bibtex
+@misc{friday2025,
+  title   = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling},
+  author  = {Your Name et al.},
+  year    = {2025},
+  url     = {https://huggingface.co/kevin510/friday}
+}
+```