jongwooko
/

Flex-Omni-7B

@@ -1,197 +1,116 @@
 ---
 library_name: transformers
-tags: []
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
-[Flex-Judge](https://arxiv.org/abs/2505.18601)
-## Model Details
-### Model Description
-<!-- Provide a longer summary of what this model is. -->
-This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
-- **Developed by:** [More Information Needed]
-- **Funded by [optional]:** [More Information Needed]
-- **Shared by [optional]:** [More Information Needed]
-- **Model type:** [More Information Needed]
-- **Language(s) (NLP):** [More Information Needed]
-- **License:** [More Information Needed]
-- **Finetuned from model [optional]:** [More Information Needed]
-### Model Sources [optional]
 <!-- Provide the basic links for the model. -->
-- **Repository:** [More Information Needed]
-- **Paper [optional]:** [More Information Needed]
-- **Demo [optional]:** [More Information Needed]
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
-### Direct Use
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
-[More Information Needed]
-### Downstream Use [optional]
-<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
-[More Information Needed]
-### Out-of-Scope Use
-<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
-[More Information Needed]
-## Bias, Risks, and Limitations
-<!-- This section is meant to convey both technical and sociotechnical limitations. -->
-[More Information Needed]
-### Recommendations
-<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
-Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
-## How to Get Started with the Model
-Use the code below to get started with the model.
-[More Information Needed]
-## Training Details
-### Training Data
-<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
-[More Information Needed]
-### Training Procedure
-<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
-#### Preprocessing [optional]
-[More Information Needed]
-#### Training Hyperparameters
-- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
-#### Speeds, Sizes, Times [optional]
-<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
-[More Information Needed]
-## Evaluation
-<!-- This section describes the evaluation protocols and provides the results. -->
-### Testing Data, Factors & Metrics
-#### Testing Data
-<!-- This should link to a Dataset Card if possible. -->
-[More Information Needed]
-#### Factors
-<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
-[More Information Needed]
-#### Metrics
-<!-- These are the evaluation metrics being used, ideally with a description of why. -->
-[More Information Needed]
-### Results
-[More Information Needed]
-#### Summary
-## Model Examination [optional]
-<!-- Relevant interpretability work for the model goes here -->
-[More Information Needed]
-## Environmental Impact
-<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
-- **Hardware Type:** [More Information Needed]
-- **Hours used:** [More Information Needed]
-- **Cloud Provider:** [More Information Needed]
-- **Compute Region:** [More Information Needed]
-- **Carbon Emitted:** [More Information Needed]
-## Technical Specifications [optional]
-### Model Architecture and Objective
-[More Information Needed]
-### Compute Infrastructure
-[More Information Needed]
-#### Hardware
-[More Information Needed]
-#### Software
-[More Information Needed]
-## Citation [optional]
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
-[More Information Needed]
-**APA:**
-[More Information Needed]
-## Glossary [optional]
-<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
-[More Information Needed]
-## More Information [optional]
-[More Information Needed]
-## Model Card Authors [optional]
-[More Information Needed]
-## Model Card Contact
-[More Information Needed]

 ---
 library_name: transformers
+base_model:
+- Qwen/Qwen2.5-Omni-7B
 ---
 # Model Card for Model ID
 <!-- Provide a quick summary of what the model is/does. -->
+[Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
+](https://arxiv.org/abs/2505.18601)
+**Flex‑Omni‑7B** is an 11B-parameter multimodal evaluator capable of handling not only vision-language tasks but also audio-based evaluations—something traditional VL models cannot do. It inherits the reasoning-by-text paradigm from Flex‑Judge, enabling strong performance across modalities, and even outperforms models like Gemini‑2.0‑Flash on audio benchmarks such as MOS and speech scoring. Unlike vision-language models, Flex‑Omni‑7B unifies vision, language, and audio reasoning within a single framework.
+### Model Description
+- We propose **Flex-Judge**, a reasoning-guided multimodal evaluator that leverages minimal textual reasoning data to robustly generalize across multiple modalities and evaluation formats.
+- Our framework highlights reasoning-based text supervision as a powerful, cost-effective alternative to traditional annotation-intensive approaches, substantially advancing scalable, multimodal model-as-a-judge.
+### Model Sources
 <!-- Provide the basic links for the model. -->
+- **Repository:** https://github.com/jongwooko/flex-judge
+- **Paper:** [Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators
+](https://arxiv.org/abs/2505.18601)
 ## Uses
 <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
+For more comprehensive usage examples and implementation details, please refer to our official repository.
+### Requirements
 <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
+```
+pip install git+https://github.com/huggingface/[email protected]
+pip accelerate
+pip install qwen-omni-utils[decord] -U
+pip install vllm
+pip install datasets
+```
+### Using vLLM
+Here, we recommend using `vllm` instead of `transformers` to improve inference speed. The results in our papers are based on the `vllm` library.
+```
+from datasets import load_dataset
+from vllm import LLM, SamplingParams
+# default: Load the model on the available device(s)
+llm = LLM(
+    "jongwooko/Flex-Omni-7B",
+    tensor_parallel_size=4,
+    limit_mm_per_prompt={"image": 1},  # The maximum number to accept
+)
+sampling_params = SamplingParams(
+    max_tokens=4096,
+    temperature=0.2,
+    top_p=0.95,
+)
+# Example
+example = load_dataset('MMInstruction/VL-RewardBench', split='test')[0]
+question, image = example["query"], example["image"]
+answer1, answer2 = example["response"]
+# System prompt for Flex-Judge
+SYSTEM_PROMPT = (
+    "You are a helpful assistant. The assistant first performs a detailed, "
+    "step-by-step reasoning process in its mind and then provides the user with"
+    "the answer. The reasoning process and answer are enclosed within <think> "
+    "reasoning process here, explaining each step of your evaluation for both "
+    "assistants </think><answer> answer here </answer>. Now the user asks you "
+    "to judge the performance of two AI assistants in response to the question. "
+    "Score assistants 1-10 (higher=better). Criteria includes helpfulness, "
+    "relevance, accuracy, and level of detail. Avoid order, length, style or "
+    "other bias. After thinking, when you finally reach a conclusion, clearly "
+    "provide your evaluation scores within <answer> </answer> tags, i.e., for "
+    "example, <answer>3</answer><answer>5</answer>"
+)
+instruction = (
+    f"<|vision_start|><|IMAGE|><|vision_end|>\n\n[Question]\n{question}\n\n"
+    "[Assistant 1's Answer]\n{answer1}\n\n[Assistant 2's Answer]\n{answer2}"
+)
+prompt = (
+    f"<|im_start|>system\n{SYSTEM_PROMPT}<|im_end|>\n"
+    f"<|im_start|>user\n{instruction}<|im_end|>\n"
+    "<|im_start|>assistant\n<think>\n\n"
+)
+inputs = {"prompt": prompt, "multi_modal_data": {"image": [image]}}
+# Inference: Generation of the output
+outputs = llm.generate([inputs], sampling_params=sampling_params)
+output_text = outputs[0].outputs[0].text
+print (output_text)
+```
+## Citation
 <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
 **BibTeX:**
+```
+@article{ko2025flex,
+  title={Flex-Judge: Text-Only Reasoning Unleashes Zero-Shot Multimodal Evaluators},
+  author={Ko, Jongwoo and Kim, Sungnyun and Cho, Sungwoo and Yun, Se-Young},
+  journal={arXiv preprint arXiv:2505.18601},
+  year={2025}
+}
+```