LZXzju
/

Qwen2.5-VL-3B-UI-R1

@@ -10,4 +10,159 @@ pipeline_tag: visual-question-answering
 This repository contains the model presented in [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620).
-Project page: https://github.com/lll6gg/UI-R1

 This repository contains the model presented in [UI-R1: Enhancing Action Prediction of GUI Agents by Reinforcement Learning](https://huggingface.co/papers/2503.21620).
+Project page: https://github.com/lll6gg/UI-R1
+New version: [UI-R1-E-3B](https://huggingface.co/LZXzju/Qwen2.5-VL-3B-UI-R1-E)
+## Benchmark 1: ScreenSpotV2
+| ScreenSpotV2  | inference mode | Mobile-T | Mobile-I | Desktop-T | Desktop-I | Web-T    | Web-I    | Avg↑ / Len↓        |
+| ------------- | -------------- | -------- | -------- | --------- | --------- | -------- | -------- | ----------------- |
+| OS-ATLAS-7B   | w/o thinking   | 95.2     | 75.8     | 90.7      | 63.6      | 90.6     | 77.3     | 84.1 /            |
+| UI-TARS-7B    | w/o thinking   | 95.2     | 79.1     | 90.7      | 68.6      | 90.6     | 78.3     | 84.7 /            |
+| UI-R1-3B (v1) | w/ thinking    | 96.2     | **84.3** | 92.3      | 63.6      | 89.2     | 75.4     | 85.4 / 67         |
+| GUI-R1-3B     | w/ thinking    | 97.6     | 78.2     | 94.3      | 64.3      | 91.0     | 72.4     | 85.0 / 80         |
+| UI-R1-3B (v2) | w/ thinking    | 97.6     | 79.6     | 92.3      | 67.9      | 88.9     | 77.8     | 85.8 / 60         |
+| **UI-R1-E-3B**    | w/o thinking   | **98.2** | 83.9     | **94.8**  | **75.0**  | **93.2** | **83.7** | **89.5** / **28** |
+## Benchmark 2: ScreenSpot-Pro
+| ScreenSpot-Pro | inference mode | Average Length↓ | Average Accuracy↑ |
+| -------------- | -------------- | --------------- | ---------------- |
+| UGround-7B     | w/o thinking   | -               | 16.5             |
+| OS-ATLAS-7B    | w/o thinking   | -               | 18.9             |
+| UI-R1-3B (v1)  | w/ thinking    | 102             | 17.8             |
+| GUI-R1-3B      | w/ thinking    | 114             | 26.6             |
+| UI-R1-3B (v2)  | w/ thinking    | 129             | 29.8             |
+| **UI-R1-E-3B**     | w/o thinking   | **28**          | **33.5**         |
+## Leaderboard: UI-I2E-Bench
+|     Model      | ScreenSpot | UI-I2E-Bench Avg | ScreenSpot-Pro | Avg  |
+| :------------: | :--------: | :--------------: | :------------: | :--: |
+| UI-TARS-1.5-7B |    88.1    |       73.2       |      42.2      | 67.8 |
+| Uground-V1-72B |    89.7    |       76.3       |      34.3      | 66.8 |
+|  UI-TARS-72B   |    88.4    |       73.7       |      38.1      | 66.7 |
+|   **UI-R1-E-3B**   |    89.2    |       69.1       |      33.5      | 63.9 |
+| Uground-V1-7B  |    87.1    |       70.3       |      31.1      | 62.8 |
+|   InfiGUI-R1   |    87.5    |       69.7       |      29.6      | 62.3 |
+|   UI-TARS-7B   |    89.5    |       61.4       |      35.7      | 62.2 |
+| Qwen2.5-VL-72B |    87.1    |       51.4       |      43.6      | 60.7 |
+| UI-I2E-VLM-7B  |    82.5    |       69.5       |      23.6      | 58.5 |
+|   UI-TARS-2B   |    82.3    |        62        |      27.7      | 57.3 |
+| Qwen2.5-VL-7B  |    84.7    |       53.8       |       29       | 55.8 |
+| OmniParser-V2  |     72     |       54.8       |      39.6      | 55.5 |
+| Uground-V1-2B  |    78.8    |       57.4       |      26.6      | 54.3 |
+|  OS-Atlas-7B   |    82.5    |       58.6       |      18.9      | 53.3 |
+|     **UI-R1-3B**      |    83.3    |       58.5       |      17.8      | 53.2 |
+|   UGround-7B   |    74.1    |       54.2       |      16.5      | 48.3 |
+| UI-I2E-VLM-4B  |    70.4    |       53.4       |      12.2      | 45.3 |
+|   OmniParser   |    73.9    |       53.1       |      8.3       | 45.1 |
+|   ShowUI-2B    |    76.8    |       41.5       |      7.7       |  42  |
+| Qwen2.5-VL-3B  |    55.5    |       41.7       |      23.9      | 41.3 |
+|   Aguvis-7B    |    84.4    |       53.2       |      22.9      | 40.4 |
+|  OS-Atlas-4B   |    70.1    |       44.3       |      3.7       | 39.4 |
+|  Qwen2-VL-7B   |    42.6    |       48.7       |      1.6       |  31  |
+|    Seeclick    |    55.8    |       26.4       |      1.1       | 27.8 |
+|  InternVL2-4B  |    4.2     |       0.9        |      0.3       | 1.8  |
+## Evaluation Code for GUI Grounding
+1. Generation for UI-R1-E-3B：
+   ```python
+   model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+       args.model_path,
+       torch_dtype=torch.bfloat16,
+       attn_implementation="flash_attention_2",
+       device_map="cpu",
+   )
+   model = model.to(torch.device(rank))
+   model = model.eval()
+   processor = AutoProcessor.from_pretrained(ori_processor_path)
+   question_template = (
+      f"In this UI screenshot, I want to perform the command '{task_prompt}'.\n"
+      "Please provide the action to perform (enumerate in ['click', 'scroll']) and the coordinate where the cursor is moved to(integer) if click is performed.\n"
+      "Output the thinking process in <think> </think> and final answer in <answer> </answer> tags."
+      "The output answer format should be as follows:\n"
+      "<think> ... </think> <answer>[{'action': enum['click', 'scroll'], 'coordinate': [x, y]}]</answer>\n"
+      "Please strictly follow the format."
+   )
+   query = '<image>\n' + question_template
+   messages = [
+       {
+           "role": "user",
+           "content": [
+               {"type": "image", "image": image_path}
+           ] + [{"type": "text", "text": query}],
+       }
+   ]
+   text = processor.apply_chat_template(
+       messages, tokenize=False, add_generation_prompt=True
+   )
+   image_inputs, video_inputs = process_vision_info(messages)
+   inputs = processor(
+       text=[text],
+       images=image_inputs,
+       videos=video_inputs,
+       padding=True,
+       return_tensors="pt",
+   )
+   generated_ids = model.generate(**inputs, max_new_tokens=1024)
+   generated_ids_trimmed = [
+       out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+   ]
+   response = processor.batch_decode(
+       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+   )
+   response = response[0]
+   pred_coord, _ = extract_coord(response)
+   ```
+2. Rescale the predicted coordinate according to the image resize (especially image_size > 12845056)
+   ```python
+   image = Image.open(image_path)
+   origin_width, origin_height = image.size
+   resized_height,resized_width = smart_resize(origin_height,origin_width,max_pixels=12845056)
+   scale_x = origin_width / resized_width
+   scale_y = origin_height / resized_height
+   pred_coord[0] = int(pred_coord[0] * scale_x)
+   pred_coord[1] = int(pred_coord[1] * scale_y)
+   ```
+   Function smart_resize is from Qwen2VL：
+   ```python
+   import math
+   def smart_resize(
+       height: int, width: int, factor: int = 28, min_pixels: int = 56 * 56, max_pixels: int = 14 * 14 * 4 * 1280
+   ):
+       """Rescales the image so that the following conditions are met:
+       1. Both dimensions (height and width) are divisible by 'factor'.
+       2. The total number of pixels is within the range ['min_pixels', 'max_pixels'].
+       3. The aspect ratio of the image is maintained as closely as possible.
+       """
+       if height < factor or width < factor:
+           raise ValueError(f"height:{height} or width:{width} must be larger than factor:{factor}")
+       elif max(height, width) / min(height, width) > 200:
+           raise ValueError(
+               f"absolute aspect ratio must be smaller than 200, got {max(height, width) / min(height, width)}"
+           )
+       h_bar = round(height / factor) * factor
+       w_bar = round(width / factor) * factor
+       if h_bar * w_bar > max_pixels:
+           beta = math.sqrt((height * width) / max_pixels)
+           h_bar = math.floor(height / beta / factor) * factor
+           w_bar = math.floor(width / beta / factor) * factor
+       elif h_bar * w_bar < min_pixels:
+           beta = math.sqrt(min_pixels / (height * width))
+           h_bar = math.ceil(height * beta / factor) * factor
+           w_bar = math.ceil(width * beta / factor) * factor
+       return h_bar, w_bar
+   ```