kevin510 commited on
Commit
25f40c2
·
verified ·
1 Parent(s): 081dce9

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +108 -0
README.md ADDED
@@ -0,0 +1,108 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - liuhaotian/LLaVA-Instruct-150K
5
+ - liuhaotian/LLaVA-Pretrain
6
+ base_model:
7
+ - microsoft/Phi-4-mini-reasoning
8
+ - kevin510/fast-vit-hd
9
+ library_name: transformers
10
+ tags:
11
+ - vision-language
12
+ - multimodal
13
+ - friday
14
+ - custom_code
15
+ - 4bit
16
+ - quantization
17
+ ---
18
+
19
+ # Friday-VLM
20
+
21
+ Friday-VLM is a multimodal (image + text) LLM fine-tuned on image and text instruction data.
22
+ The architecture and config live in this repo, so callers must load the model with
23
+ `trust_remote_code=True`.
24
+
25
+ ---
26
+
27
+ # Model variants
28
+
29
+ | Repo ID | Precision | File format | Typical VRAM* | Size on disk |
30
+ |---------|-----------|-------------|---------------|--------------|
31
+ | `kevin510/friday` | **bf16** (full) | `safetensors` | 100 % | 100 % |
32
+ | `kevin510/friday-fp4` | **fp4** (bitsandbytes int4) | `safetensors` | ≈ 30 % | ≈ 25 % |
33
+
34
+ ---
35
+
36
+
37
+ # Dependencies
38
+
39
+ ```bash
40
+ conda create --name friday python=3.12 -y
41
+ conda activate friday
42
+ pip install transformers torch torchvision deepspeed accelerate pillow einops timm bitsandbytes
43
+ ```
44
+
45
+ # Quick start
46
+
47
+ ```python
48
+ import torch
49
+ from PIL import Image
50
+ from transformers import AutoTokenizer, AutoModelForCausalLM
51
+ from transformers.utils import logging
52
+
53
+ tok = AutoTokenizer.from_pretrained("kevin510/friday", trust_remote_code=True)
54
+ model = AutoModelForCausalLM.from_pretrained(
55
+ "kevin510/friday-4bit",
56
+ trust_remote_code=True,
57
+ load_in_4bit=True,
58
+ device_map="auto"
59
+ )
60
+ model.eval()
61
+
62
+ prompt = "Describe this image."
63
+ user_prompt = f"<|user|><image>\n{prompt}\n<|assistant|>"
64
+ inputs = tok(user_prompt, return_tensors="pt").to(model.device)
65
+
66
+ image = Image.open("my_image.jpg").convert("RGB")
67
+
68
+ with torch.no_grad():
69
+ out = model.generate(
70
+ **inputs,
71
+ max_new_tokens=256,
72
+ do_sample=False,
73
+ images=[image]
74
+ )
75
+
76
+ print(tok.decode(out[0], skip_special_tokens=False))
77
+ ```
78
+
79
+ # Architecture at a glance
80
+
81
+ ```
82
+ FastViT-HD ─▶ 3072-d patch embeddings ─▶ S2 6144-d patch embeddings ─▶ 2-layer MLP vision-adapter (6144 → 3072)
83
+
84
+ (vision tokens, 3072 d) ─┐
85
+ ├─► Φ-4-mini-reasoning (2.7 B params, hidden = 3072)
86
+ <text tokens, 3072 d> ───┘ │
87
+ │ (standard self-attention only;
88
+ │ language tower is frozen at finetune)
89
+ ```
90
+
91
+
92
+
93
+
94
+ # Limitations & Responsible AI
95
+
96
+ Friday-VLM may hallucinate objects, invent facts, or reproduce societal biases.
97
+ All variants share the same behaviour profile; quantisation does not filter or sanitise model outputs. Users must apply their own content-safety layer before deployment.
98
+
99
+ # Citation
100
+
101
+ ```bibtex
102
+ @misc{friday2025,
103
+ title = {Friday VLM: Efficient Instruction-Tuned Vision–Language Modelling},
104
+ author = {Your Name et al.},
105
+ year = {2025},
106
+ url = {https://huggingface.co/kevin510/friday}
107
+ }
108
+ ```