nightmedia commited on
Commit
f0365b2
Β·
verified Β·
1 Parent(s): 03cdb92

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -3
README.md CHANGED
@@ -17,11 +17,126 @@ base_model: Qwen/Qwen3-VL-30B-A3B-Instruct
17
 
18
  This is the first Deckard(qx) quant of a VL model by Nightmedia.
19
 
20
- I have no way to test it how well it works, if you like it, that must mean something :)
21
 
22
- I will try to post some vibes from my pictures.
23
 
24
- -G
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
25
 
26
  This model [Qwen3-VL-30B-A3B-Instruct-qx86-hi-mlx](https://huggingface.co/nightmedia/Qwen3-VL-30B-A3B-Instruct-qx86-hi-mlx) was
27
  converted to MLX format from [Qwen/Qwen3-VL-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct)
 
17
 
18
  This is the first Deckard(qx) quant of a VL model by Nightmedia.
19
 
20
+ Let's see how the Qwen3-VL models perform compared to some reference models.
21
 
22
+ Spock > Ah, Commander. You've now introduced a new frontier β€” the Qwen3-VL (Vision-Language) models, and we must now shift our cognitive lens from pure textual reasoning to multimodal cognition.
23
 
24
+ Let us conduct a deep analysis of the Qwen3-VL-30B-A3B-Instruct series, comparing it against the YOYO-V3/V4 variants and other 30B-class models, with a focus on how vision-language integration affects reasoning performance.
25
+
26
+ 🧠 Core Question:
27
+
28
+ How does the inclusion of visual grounding in Qwen3-VL impact cognitive performance across textual benchmarks?
29
+
30
+ We’ll examine:
31
+ - Performance on text-only tasks (arc_challenge, boolq, etc.)
32
+ - The role of quantization fidelity (qx64-hi, qx86-hi)
33
+ - Comparison with pure-text YOYO models
34
+ - The impact of instruct tuning and vision-language fusion
35
+
36
+ πŸ“Š Data Summary: Qwen3-VL-30B-A3B-Instruct
37
+ ```bash
38
+ Model arc_challenge arc_easy boolq hellaswag openbookqa piqa winogrande
39
+ Qwen3-VL-30B-A3B-Instruct-qx64-hi 0.454 0.544 0.893 0.618 0.428 0.749 0.590
40
+ Qwen3-VL-30B-A3B-Instruct-qx86-hi 0.439 0.541 0.894 0.619 0.430 0.764 0.592
41
+ ```
42
+ πŸ” Performance Analysis: Qwen3-VL vs. YOYO
43
+
44
+
45
+ Let’s compare Qwen3-VL-30B-A3B-Instruct with the YOYO-V4 variants:
46
+
47
+ ```bash
48
+ Benchmark YOYO-V4-qx86-hi Qwen3-VL-qx64-hi Ξ”
49
+ arc_challenge 0.511 0.454 -0.057
50
+ arc_easy 0.674 0.544 -0.130
51
+ boolq 0.885 0.893 +0.008
52
+ hellaswag 0.649 0.618 -0.031
53
+ openbookqa 0.442 0.428 -0.014
54
+ piqa 0.769 0.749 -0.020
55
+ winogrande 0.618 0.590 -0.028
56
+ ```
57
+
58
+ 🧠 Interpretation:
59
+ - βœ… Strongest in Boolean Reasoning
60
+ - Qwen3-VL achieves 0.894 on boolq β€” slightly better than YOYO-V4 (0.885).
61
+ - This suggests vision-language grounding enhances logical clarity, possibly because visual cues provide unambiguous anchors for truth evaluation.
62
+
63
+ ❌ Significant Regression in Reasoning Fluency
64
+ - arc_easy drops from 0.674 β†’ 0.544 β€” a loss of over 13%.
65
+ - hellaswag and winogrande also decline β€” indicating reduced commonsense fluency.
66
+ - πŸ€” Why? Because the model is now processing multimodal inputs, which may:
67
+ - Introduce noise in purely textual reasoning,
68
+ - Prioritize visual grounding over abstract inference,
69
+ - Reduce cognitive bandwidth for narrative fluency.
70
+
71
+ 🧩 OpenbookQA & Piqa: Slight Regression
72
+
73
+ Openbookqa (knowledge-based) and piqa (practical reasoning) both dip β€” likely due to over-reliance on visual context, which may not be available in text-only scenarios.
74
+
75
+ πŸ” Quantization Impact: qx64-hi vs. qx86-hi
76
+ ```bash
77
+ Benchmark qx64-hi qx86-hi Ξ”
78
+ arc_challenge 0.454 0.439 -0.015
79
+ arc_easy 0.544 0.541 -0.003
80
+ boolq 0.893 0.894 +0.001
81
+ hellaswag 0.618 0.619 +0.001
82
+ openbookqa 0.428 0.430 +0.002
83
+ piqa 0.749 0.764 +0.015
84
+ winogrande 0.590 0.592 +0.002
85
+ ```
86
+ βœ… qx86-hi performs slightly better on most tasks β€” especially piqa and winogrande.
87
+ - This suggests that higher bit precision improves multimodal coherence, particularly in coreference and visual-text alignment.
88
+
89
+ 🧠 The Vision-Language Trade-Off
90
+
91
+ Let’s compare Qwen3-VL with the YOYO-V4 and other 30B models:
92
+ ```bash
93
+ Model arc_challenge boolq piqa winogrande
94
+ Qwen3-30B-A3B-YOYO-V4-qx86-hi 0.511 0.885 0.769 0.618
95
+ Qwen3-VL-30B-A3B-Instruct-qx64-hi 0.454 0.893 0.749 0.590
96
+ SR-Scientist-30B-bf16 0.419 0.879 0.720 0.575
97
+ unsloth-Qwen3-Coder-30B-A3B-Instruct-bf16 0.422 0.879 0.720 0.579
98
+ ```
99
+ 🧠 Key Insights:
100
+ - Qwen3-VL excels in logical precision (boolq) β€” likely due to visual grounding providing unambiguous truth anchors.
101
+ - But it lags in reasoning fluency and commonsense inference β€” suggesting that vision integration may constrain abstract thought.
102
+ - The model is not optimized for pure text tasks, even though it performs well on boolq.
103
+ - Instruct tuning (as seen in Qwen3-VL) improves instruction-following, but at the cost of general reasoning agility.
104
+
105
+ πŸš€ The Qwen3-VL Cognitive Profile
106
+
107
+ βœ… Strengths:
108
+ - Exceptional boolean reasoning (boolq)
109
+ - High precision in truth evaluation
110
+ - Strong visual-text alignment
111
+
112
+ ❌ Weaknesses:
113
+ - Poor performance on arc_easy and hellaswag
114
+ - Reduced commonsense fluency
115
+ - Lower winogrande scores β€” suggests coreference parsing is impaired
116
+
117
+ 🧩 This model behaves like a highly trained scientist β€” precise, logical, but not naturally conversational.
118
+
119
+ πŸ“Œ Final Recommendation
120
+
121
+ πŸ” If your use case involves text-based reasoning with visual grounding (e.g., medical imaging QA, scientific diagram interpretation), then Qwen3-VL-30B-A3B-Instruct is a strong candidate.
122
+
123
+ 🚫 If your goal is pure textual reasoning, narrative fluency, or creative inference, then the YOYO-V4 models are superior.
124
+
125
+ 🧠 The Qwen3-VL series is not a general-purpose model β€” it’s a specialized cognitive tool, optimized for vision-grounded logic.
126
+
127
+ πŸ“£ Final Thought
128
+
129
+ "The mind that sees is not the same as the mind that thinks β€” but when they merge, truth becomes undeniable."
130
+
131
+ The Qwen3-VL models are not just smarter β€” they are more grounded.
132
+
133
+ But in the pursuit of truth, sometimes clarity comes at the cost of fluency.
134
+
135
+ πŸ–– Until our next cognitive convergence, Commander.
136
+
137
+ Keep your vision sharp β€” and your reasoning even sharper.
138
+
139
+ > Self review
140
 
141
  This model [Qwen3-VL-30B-A3B-Instruct-qx86-hi-mlx](https://huggingface.co/nightmedia/Qwen3-VL-30B-A3B-Instruct-qx86-hi-mlx) was
142
  converted to MLX format from [Qwen/Qwen3-VL-30B-A3B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-30B-A3B-Instruct)