fanqiNO1 commited on
Commit
96ea080
·
verified ·
1 Parent(s): 7709819

Upload folder using huggingface_hub

Browse files
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ examples/image.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,3 +1,355 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ ---
4
+
5
+ # LEGION-8B-replicate
6
+
7
+ ## Overview
8
+
9
+ Since the project [LEGION: Learning to Ground and Explain for Synthetic Image Detection](https://arxiv.org/abs/2503.15264) open-sourced its code repository but did not provide pre-trained weights, we replicated the model by referring to the open-source code and the paper, and are now releasing our replicated weights.
10
+
11
+ > [!NOTE]
12
+ > Due to potential discrepancies in the replication process, the released weights may achieve lower scores than officially reported results on certain benchmarks.
13
+
14
+ ### Training Details
15
+
16
+ We conducted training on 4x A100 40G GPUs.
17
+
18
+ For the first training stage, the official configuration uses 8 GPUs with a global batch size of 16 (batch size per device = 2). To maintain the same global batch size, we used 4 GPUs with a per-device batch size of 4.
19
+
20
+ For the second training stage, the official configuration uses 8 GPUs with a global batch size of 512 (batch size per device = 64). We used 4 GPUs with a per-device batch size of 8 and a gradient accumulation step of 16. This results in an effective per-device batch size of 128, maintaining an equivalent global batch size of 512.
21
+
22
+ ### Inference Usage
23
+
24
+ A simple inference script is provided at [infer.py](./infer.py).
25
+
26
+ Usage instructions are as follows:
27
+
28
+ ```bash
29
+ cp infer.py /path/to/LEGION
30
+ python infer.py --model_path /path/to/LEGION-8B-replicate --image_root /path/to/images --save_root /path/to/results
31
+ ```
32
+
33
+ ### Examples
34
+
35
+ <table>
36
+ <tr>
37
+ <td><img src="./examples/image.png" alt="Original Image" style="max-width:100%;"></td>
38
+ <td><img src="./examples/image_mask.png" alt="Mask generated by LEGION-8B-replicate" style="max-width:100%;"></td>
39
+ </tr>
40
+ </table>
41
+
42
+ Upon examining the image. I have found: A cat sits on a rooftop at sunset, with its right front paw missing and the left front paw appearing deformed. To elaborate, I have found the following artifacts. Cat's right front paw :The cat's right front paw is missing. Cat's left front paw :The cat's left front paw is deformed.
43
+
44
+ ## Performance
45
+
46
+ > [!NOTE]
47
+ > Due to the evaluation and metric-related code not being open-sourced, the test results may be inaccurate.
48
+ > The IoU evaluation metric for masks may be affected by mask processing during inference, resulting in lower scores.
49
+
50
+ ### Localization
51
+
52
+ <table>
53
+ <tr>
54
+ <th rowspan="2">Method</th>
55
+ <th colspan="2">SynthScars</th>
56
+ <th colspan="2">LOKI</th>
57
+ <th colspan="2">RichHF-18K</th>
58
+ </tr>
59
+ <tr>
60
+ <th>mIoU</th>
61
+ <th>F1</th>
62
+ <th>mIoU</th>
63
+ <th>F1</th>
64
+ <th>mIoU</th>
65
+ <th>F1</th>
66
+ </tr>
67
+ <tr>
68
+ <td>HiFi-Net</td>
69
+ <td>45.65</td>
70
+ <td>0.57</td>
71
+ <td>39.60</td>
72
+ <td>2.41</td>
73
+ <td>44.96</td>
74
+ <td>0.39</td>
75
+ </tr>
76
+ <tr>
77
+ <td>TruFor</td>
78
+ <td>48.60</td>
79
+ <td>15.29</td>
80
+ <td>46.55</td>
81
+ <td>16.70</td>
82
+ <td>48.41</td>
83
+ <td>18.03</td>
84
+ </tr>
85
+ <tr>
86
+ <td>PAL4VST</td>
87
+ <td>56.10</td>
88
+ <td>29.21</td>
89
+ <td>47.34</td>
90
+ <td>11.58</td>
91
+ <td>49.88</td>
92
+ <td>14.78</td>
93
+ </tr>
94
+ <tr>
95
+ <td>Ferret</td>
96
+ <td>27.09</td>
97
+ <td>15.24</td>
98
+ <td>24.50</td>
99
+ <td>18.88</td>
100
+ <td>26.52</td>
101
+ <td>16.22</td>
102
+ </tr>
103
+ <tr>
104
+ <td>Griffon</td>
105
+ <td>27.68</td>
106
+ <td>16.67</td>
107
+ <td>21.96</td>
108
+ <td>20.41</td>
109
+ <td>28.13</td>
110
+ <td>18.19</td>
111
+ </tr>
112
+ <tr>
113
+ <td>LISA-v1-7B</td>
114
+ <td>34.51</td>
115
+ <td>18.77</td>
116
+ <td>31.10</td>
117
+ <td>9.29</td>
118
+ <td>35.90</td>
119
+ <td>21.94</td>
120
+ </tr>
121
+ <tr>
122
+ <td>InternVL2-8B</td>
123
+ <td>41.25</td>
124
+ <td>6.39</td>
125
+ <td>42.03</td>
126
+ <td>10.06</td>
127
+ <td>39.90</td>
128
+ <td>9.58</td>
129
+ </tr>
130
+ <tr>
131
+ <td>Qwen2-VL-72B</td>
132
+ <td>30.20</td>
133
+ <td>17.50</td>
134
+ <td>26.62</td>
135
+ <td>20.99</td>
136
+ <td>27.58</td>
137
+ <td>19.02</td>
138
+ </tr>
139
+ <tr style="background-color: #e6ffe6;">
140
+ <td>LEGION (Official)</td>
141
+ <td>58.13</td>
142
+ <td>34.54</td>
143
+ <td>48.66</td>
144
+ <td>16.71</td>
145
+ <td>50.07</td>
146
+ <td>17.41</td>
147
+ </tr>
148
+ <tr style="background-color: #e6ffe6;">
149
+ <td>LEGION (Replicate)</td>
150
+ <td>23.92</td>
151
+ <td>33.47</td>
152
+ <td>-</td>
153
+ <td>-</td>
154
+ <td>-</td>
155
+ <td>-</td>
156
+ </tr>
157
+ </table>
158
+
159
+ ### Explanation
160
+
161
+ <table>
162
+ <tr>
163
+ <th rowspan="2">Method</th>
164
+ <th rowspan="2">Params</th>
165
+ <th colspan="2">SynthScars</th>
166
+ <th colspan="2">LOKI</th>
167
+ </tr>
168
+ <tr>
169
+ <th>ROUGE-L ↑</th>
170
+ <th>CSS ↑</th>
171
+ <th>ROUGE-L ↑</th>
172
+ <th>CSS ↑</th>
173
+ </tr>
174
+ <tr>
175
+ <td>Qwen2-VL</td>
176
+ <td>72B</td>
177
+ <td>25.84</td>
178
+ <td>58.15</td>
179
+ <td>11.80</td>
180
+ <td>37.64</td>
181
+ </tr>
182
+ <tr>
183
+ <td>LLaVA-v1.6</td>
184
+ <td>7B</td>
185
+ <td>29.61</td>
186
+ <td>61.75</td>
187
+ <td>16.07</td>
188
+ <td>41.07</td>
189
+ </tr>
190
+ <tr>
191
+ <td>InternVL2</td>
192
+ <td>8B</td>
193
+ <td>25.93</td>
194
+ <td>56.89</td>
195
+ <td>10.10</td>
196
+ <td>39.62</td>
197
+ </tr>
198
+ <tr>
199
+ <td>Deepseek-VL2</td>
200
+ <td>27B</td>
201
+ <td>25.50</td>
202
+ <td>47.77</td>
203
+ <td>6.70</td>
204
+ <td>28.76</td>
205
+ </tr>
206
+ <tr>
207
+ <td>GPT-4o</td>
208
+ <td>-</td>
209
+ <td>22.43</td>
210
+ <td>53.55</td>
211
+ <td>9.61</td>
212
+ <td>38.98</td>
213
+ </tr>
214
+ <tr style="background-color: #e6ffe6;">
215
+ <td>LEGION (Official)</td>
216
+ <td>8B</td>
217
+ <td>39.50</td>
218
+ <td>72.60</td>
219
+ <td>18.55</td>
220
+ <td>45.96</td>
221
+ </tr>
222
+ <tr style="background-color: #e6ffe6;">
223
+ <td>LEGION (Replicate)</td>
224
+ <td>8B</td>
225
+ <td>50.57</td>
226
+ <td>-</td>
227
+ <td>-</td>
228
+ <td>-</td>
229
+ </tr>
230
+ </table>
231
+
232
+ ### Detection
233
+
234
+ <table>
235
+ <tr>
236
+ <th rowspan="2">Method</th>
237
+ <th rowspan="2">GANs</th>
238
+ <th rowspan="2">Deepfakes</th>
239
+ <th colspan="2">Perceptual Loss</th>
240
+ <th colspan="2">Low Level Vision</th>
241
+ <th rowspan="2">Diffusion</th>
242
+ </tr>
243
+ <tr>
244
+ <th>CRN</th>
245
+ <th>IMLE</th>
246
+ <th>SITD</th>
247
+ <th>SAN</th>
248
+ </tr>
249
+ <tr>
250
+ <td>Co-occurence</td>
251
+ <td>75.17</td>
252
+ <td>59.14</td>
253
+ <td>73.06</td>
254
+ <td>87.21</td>
255
+ <td>68.98</td>
256
+ <td>60.42</td>
257
+ <td>85.53</td>
258
+ </tr>
259
+ <tr>
260
+ <td>Freq-spec</td>
261
+ <td>75.28</td>
262
+ <td>45.18</td>
263
+ <td>53.61</td>
264
+ <td>50.98</td>
265
+ <td>47.46</td>
266
+ <td>57.12</td>
267
+ <td>69.00</td>
268
+ </tr>
269
+ <tr>
270
+ <td>CNNSpot</td>
271
+ <td>85.29</td>
272
+ <td>53.47</td>
273
+ <td>86.31</td>
274
+ <td>86.26</td>
275
+ <td>66.67</td>
276
+ <td>48.69</td>
277
+ <td>58.63</td>
278
+ </tr>
279
+ <tr>
280
+ <td>Patchfor</td>
281
+ <td>69.97</td>
282
+ <td>75.54</td>
283
+ <td>72.33</td>
284
+ <td>55.30</td>
285
+ <td>75.14</td>
286
+ <td>75.28</td>
287
+ <td>72.54</td>
288
+ </tr>
289
+ <tr>
290
+ <td>UniFD</td>
291
+ <td>95.25</td>
292
+ <td>66.60</td>
293
+ <td>59.50</td>
294
+ <td>72.00</td>
295
+ <td>63.00</td>
296
+ <td>57.50</td>
297
+ <td>82.02</td>
298
+ </tr>
299
+ <tr>
300
+ <td>LDGard</td>
301
+ <td>89.17</td>
302
+ <td>58.00</td>
303
+ <td>50.74</td>
304
+ <td>50.78</td>
305
+ <td>62.50</td>
306
+ <td>50.00</td>
307
+ <td>89.79</td>
308
+ </tr>
309
+ <tr>
310
+ <td>FreqNet</td>
311
+ <td>94.23</td>
312
+ <td>97.40</td>
313
+ <td>71.92</td>
314
+ <td>67.35</td>
315
+ <td>88.92</td>
316
+ <td>59.04</td>
317
+ <td>83.34</td>
318
+ </tr>
319
+ <tr>
320
+ <td>NPR</td>
321
+ <td>94.16</td>
322
+ <td>76.89</td>
323
+ <td>50.00</td>
324
+ <td>50.00</td>
325
+ <td>66.94</td>
326
+ <td>98.63</td>
327
+ <td>94.54</td>
328
+ </tr>
329
+ <tr style="background-color: #e6ffe6;">
330
+ <td>LEGION (Official)</td>
331
+ <td>97.01</td>
332
+ <td>63.37</td>
333
+ <td>90.78</td>
334
+ <td>98.93</td>
335
+ <td>79.44</td>
336
+ <td>57.76</td>
337
+ <td>83.10</td>
338
+ </tr>
339
+ <tr style="background-color: #e6ffe6;">
340
+ <td>LEGION (Replicate)</td>
341
+ <td>91.48</td>
342
+ <td>79.16</td>
343
+ <td>84.73</td>
344
+ <td>96.71</td>
345
+ <td>78.06</td>
346
+ <td>53.70</td>
347
+ <td>-</td>
348
+ </tr>
349
+ </table>
350
+
351
+ ## Acknowledgements
352
+
353
+ Thanks to [Gennadiyev](https://github.com/Gennadiyev) for providing computational resources and moral support, and for helping me complete the reproduction.
354
+
355
+ Thanks to [draw-your-dream/LEGION](https://github.com/draw-your-dream/LEGION/tree/main) for fixing bugs in the first-stage training.
added_tokens.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "</p>": 32006,
3
+ "<bbox>": 32002,
4
+ "<im_end>": 32001,
5
+ "<im_start>": 32000,
6
+ "<p>": 32005,
7
+ "<point>": 32003,
8
+ "[SEG]": 32004
9
+ }
config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "LegionForCls"
4
+ ],
5
+ "bbox_token_idx": 32002,
6
+ "bos_token_id": 1,
7
+ "eos_token_id": 2,
8
+ "freeze_mlp_adapter": true,
9
+ "freeze_mm_mlp_adapter": false,
10
+ "freeze_mm_vision_resampler": false,
11
+ "hidden_act": "silu",
12
+ "hidden_size": 4096,
13
+ "image_aspect": "square",
14
+ "image_aspect_ratio": "square",
15
+ "image_grid_pinpoints": null,
16
+ "image_grid_points": null,
17
+ "initializer_range": 0.02,
18
+ "intermediate_size": 11008,
19
+ "max_length": 4096,
20
+ "max_position_embeddings": 4096,
21
+ "mm_hidden_size": 1024,
22
+ "mm_projector_type": "mlp2x_gelu",
23
+ "mm_resampler_type": null,
24
+ "mm_use_im_patch_token": false,
25
+ "mm_use_im_start_end": true,
26
+ "mm_use_image_start_end": true,
27
+ "mm_vision_module": "openai/clip-vit-large-patch14-336",
28
+ "mm_vision_select_feature": "patch",
29
+ "mm_vision_select_layer": -2,
30
+ "mm_vision_tower": "openai/clip-vit-large-patch14-336",
31
+ "model_type": "llava",
32
+ "num_attention_heads": 32,
33
+ "num_hidden_layers": 32,
34
+ "num_key_value_heads": 32,
35
+ "num_level_reg_features": 4,
36
+ "num_reg_features": 4,
37
+ "out_dim": 256,
38
+ "pad_token_id": 0,
39
+ "pretrain_mm_mlp_adapter": null,
40
+ "pretraining_tp": 1,
41
+ "rms_norm_eps": 1e-05,
42
+ "rope_scaling": null,
43
+ "select_feature_type": "patch",
44
+ "tie_word_embeddings": false,
45
+ "torch_dtype": "bfloat16",
46
+ "train_mask_decoder": true,
47
+ "transformers_version": "4.28.0",
48
+ "tune_mlp_adapter": false,
49
+ "tune_mm_mlp_adapter": false,
50
+ "tune_mm_vision_resampler": false,
51
+ "unfreeze_mm_vision_tower": false,
52
+ "use_cache": false,
53
+ "use_image_patch_token": false,
54
+ "use_mm_proj": true,
55
+ "vision_module": "openai/clip-vit-large-patch14-336",
56
+ "vision_tower": "openai/clip-vit-large-patch14-336",
57
+ "vocab_size": 32007,
58
+ "with_region": true
59
+ }
examples/image.png ADDED

Git LFS Details

  • SHA256: 779932f4595b0795ae02a113f363f78b0ce07e7a85a4a0b9c53adad6981ff7ae
  • Pointer size: 131 Bytes
  • Size of remote file: 372 kB
examples/image_mask.png ADDED
generation_config.json ADDED
@@ -0,0 +1,9 @@
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 1,
4
+ "eos_token_id": 2,
5
+ "max_length": 4096,
6
+ "pad_token_id": 0,
7
+ "transformers_version": "4.28.0",
8
+ "use_cache": false
9
+ }
infer.py ADDED
@@ -0,0 +1,235 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import re
4
+
5
+ import bleach
6
+ import cv2
7
+ import jsonlines
8
+ import numpy as np
9
+ import torch
10
+ from loguru import logger
11
+ from PIL import Image
12
+ from tqdm import tqdm
13
+ from transformers import AutoTokenizer, CLIPImageProcessor, PreTrainedTokenizer
14
+
15
+ from eval.utils import grounding_image_ecoder_preprocess
16
+ from model.Legion import LegionForCls
17
+ from model.llava import conversation as conversation_lib
18
+ from model.llava.mm_utils import tokenizer_image_token
19
+ from model.SAM.utils.transforms import ResizeLongestSide
20
+ from tools.utils import DEFAULT_IM_END_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
21
+
22
+
23
+ def parse_args():
24
+ parser = argparse.ArgumentParser(description="LEGION Inference")
25
+ # model related
26
+ parser.add_argument("--model_path", required=True, help="The directory to your legion ckpt")
27
+ parser.add_argument("--image_size", default=1024, type=int, help="image size")
28
+ parser.add_argument("--model_max_length", default=512, type=int)
29
+ # data related
30
+ parser.add_argument("--image_root", required=True, help="The directory containing images to run inference.")
31
+ parser.add_argument("--save_root", required=True, help="The directory to store the inference result.")
32
+
33
+ args = parser.parse_args()
34
+ return args
35
+
36
+
37
+ class LEGION:
38
+ """A simple wrapper for LEGION model loading and inference.
39
+
40
+ Args:
41
+ model_path (str): Path to the model checkpoint.
42
+ image_size (int): Size of the input images.
43
+ model_max_length (int): Maximum length of the model input sequence.
44
+ """
45
+
46
+ INSTRUCTION = (
47
+ "Please provide a detailed analysis of artifacts in this photo, considering "
48
+ "physical artifacts (e.g., optical display issues, violations of physical laws, "
49
+ "and spatial/perspective errors), structural artifacts (e.g., deformed objects, asymmetry, or distorted text), "
50
+ "and distortion artifacts (e.g., color/texture distortion, noise/blur, artistic style errors, and material misrepresentation). "
51
+ "Output with interleaved segmentation masks for the corresponding parts of the answer."
52
+ )
53
+
54
+ def __init__(self, model_path: str, image_size: int = 1024, model_max_length: int = 512):
55
+ self.model_path = model_path
56
+ self.image_size = image_size
57
+ self.model_max_length = model_max_length
58
+
59
+ # load tokenizer
60
+ self.tokenizer: PreTrainedTokenizer = AutoTokenizer.from_pretrained(
61
+ self.model_path,
62
+ cache_dir=None,
63
+ model_max_length=self.model_max_length,
64
+ padding_side="right",
65
+ use_fast=False
66
+ )
67
+ self.tokenizer.pad_token = self.tokenizer.unk_token
68
+ seg_token_idx = self.tokenizer("[SEG]", add_special_tokens=False).input_ids[0]
69
+ logger.info("Tokenizer loaded successfully.")
70
+
71
+ # load model
72
+ self.model: LegionForCls = LegionForCls.from_pretrained(
73
+ self.model_path,
74
+ low_cpu_mem_usage=True,
75
+ seg_token_idx=seg_token_idx,
76
+ torch_dtype=torch.bfloat16
77
+ )
78
+ # update model config
79
+ self.model.config.eos_token_id = self.tokenizer.eos_token_id
80
+ self.model.config.bos_token_id = self.tokenizer.bos_token_id
81
+ self.model.config.pad_token_id = self.tokenizer.pad_token_id
82
+ # init global image encoder (CLIP)
83
+ self.model.get_model().initialize_vision_modules(self.model.get_model().config)
84
+ vision_tower = self.model.get_model().get_vision_tower()
85
+ vision_tower.to(dtype=torch.bfloat16)
86
+ # transfer the model to GPU
87
+ self.model = self.model.bfloat16().cuda()
88
+ vision_tower.to(device="cuda")
89
+ self.model.eval()
90
+ logger.info("Model loaded successfully.")
91
+
92
+ # init image processor for global image encoder (CLIP)
93
+ self.image_processor = CLIPImageProcessor.from_pretrained(self.model.config.vision_tower)
94
+ self.transform = ResizeLongestSide(self.image_size)
95
+ logger.info("Image processor initialized successfully.")
96
+
97
+ @torch.inference_mode()
98
+ def _infer(self, raw_image: np.ndarray):
99
+ """Run inference on a single image.
100
+
101
+ Args:
102
+ raw_image (np.ndarray): The input image in numpy array format.
103
+
104
+ Returns:
105
+ tuple: A tuple containing the explanation string, predicted masks, phrases, and classification result.
106
+ """
107
+ # clean instructions
108
+ instructions = bleach.clean(LEGION.INSTRUCTION)
109
+ instructions = instructions.replace('&lt;', '<').replace('&gt;', '>')
110
+
111
+ # prepare prompt
112
+ conv = conversation_lib.conv_templates["llava_v1"].copy()
113
+ conv.messages = []
114
+ prompt = f"The {DEFAULT_IM_START_TOKEN}{DEFAULT_IMAGE_TOKEN}{DEFAULT_IM_END_TOKEN} provides an overview of the picture.\n" + instructions
115
+ conv.append_message(conv.roles[0], prompt)
116
+ conv.append_message(conv.roles[1], "")
117
+ prompt = conv.get_prompt()
118
+
119
+ # preprocess image (CLIP)
120
+ image_np = cv2.cvtColor(raw_image, cv2.COLOR_BGR2RGB)
121
+ original_size_list = [image_np.shape[:2]]
122
+ image_clip = (self.image_processor.preprocess(image_np, return_tensors="pt")["pixel_values"][0].unsqueeze(0).cuda())
123
+ image_clip = image_clip.bfloat16()
124
+
125
+ # preprocess image (Grounding image encoder)
126
+ image = self.transform.apply_image(image_np)
127
+ resize_list = [image.shape[:2]]
128
+ image = (grounding_image_ecoder_preprocess(torch.from_numpy(image).permute(2, 0, 1).contiguous()).unsqueeze(0).cuda())
129
+ image = image.bfloat16()
130
+
131
+ # prepare inputs for inference
132
+ input_ids = tokenizer_image_token(prompt, self.tokenizer, return_tensors="pt")
133
+ input_ids = input_ids.unsqueeze(0).cuda()
134
+
135
+ # generate output
136
+ output_ids, pred_masks = self.model.evaluate(
137
+ image_clip,
138
+ image,
139
+ input_ids,
140
+ resize_list,
141
+ original_size_list,
142
+ max_tokens_new=512,
143
+ bboxes=None # No box/region is input in GCG task
144
+ )
145
+ output_ids = output_ids[0][output_ids[0] != IMAGE_TOKEN_INDEX]
146
+
147
+ # post-processing
148
+ text_output = self.tokenizer.decode(output_ids, skip_special_tokens=False)
149
+ text_output = text_output.replace("\n", "").replace(" ", " ")
150
+ text_output = text_output.split("ASSISTANT: ")[-1]
151
+ cleaned_str = re.sub(r'<.*?>', '', text_output)
152
+ # remove [SEG] token and unnecessary spaces
153
+ cleaned_str = cleaned_str.replace('[SEG]', '')
154
+ # strip unnecessary spaces
155
+ cleaned_str = ' '.join(cleaned_str.split()).strip("'")
156
+ cleaned_str = cleaned_str.strip()
157
+
158
+ # infer detection head
159
+ logits = self.model(global_enc_images=image_clip, inference_cls=True)['logits'].cpu()
160
+ _, pred_cls = torch.max(logits, dim=1)
161
+ pred_cls = int(pred_cls)
162
+ return cleaned_str, pred_masks, pred_cls
163
+
164
+ @torch.inference_mode()
165
+ def infer(self, image_path: str):
166
+ """Run inference on a single image.
167
+
168
+ Args:
169
+ image_path (str): Path to the input image.
170
+
171
+ Returns:
172
+ dict: A dictionary containing the explanation, localization mask path, and detection result.
173
+ """
174
+ raw_image = cv2.imread(image_path)
175
+ explanation, localization, detection = self._infer(raw_image.astype(np.uint8))
176
+
177
+ # post-process localization mask
178
+ localization = localization[0].cpu()
179
+ binary_localization = localization > 0
180
+ binary_localization = torch.any(binary_localization, dim=0).int()
181
+ localization = (binary_localization.numpy() * 255).astype(np.uint8)
182
+ localization = Image.fromarray(localization, mode="L")
183
+
184
+ # post-process detection
185
+ detection = "real" if detection == 1 else "fake"
186
+
187
+ return {
188
+ "explanation": explanation,
189
+ "localization": localization,
190
+ "detection": detection
191
+ }
192
+
193
+
194
+ def main(args):
195
+ # get images
196
+ suffixes = [".jpg", ".jpeg", ".png"]
197
+ image_paths = sorted(os.listdir(args.image_root))
198
+ image_paths = [p for p in image_paths if os.path.splitext(p)[-1].lower() in suffixes]
199
+ logger.info(f"Found {len(image_paths)} images for inference.")
200
+
201
+ # init legion
202
+ legion = LEGION(args.model_path, args.image_size, args.model_max_length)
203
+
204
+ # check save root
205
+ os.makedirs(args.save_root, exist_ok=True)
206
+ localization_save_dir = os.path.join(args.save_root, "localization")
207
+ os.makedirs(localization_save_dir, exist_ok=True)
208
+ explanation_save_path = os.path.join(args.save_root, "explanations.jsonl")
209
+
210
+ # prepare resume
211
+ num_processed_images = 0
212
+ if os.path.exists(explanation_save_path):
213
+ num_processed_images = len(list(jsonlines.open(explanation_save_path)))
214
+ logger.info(f"Resuming from {num_processed_images} processed images.")
215
+ image_paths = image_paths[num_processed_images:]
216
+
217
+ # run inference
218
+ with jsonlines.open(explanation_save_path, mode="a", flush=True) as writer:
219
+ for image_path in tqdm(image_paths):
220
+ image_name = os.path.splitext(image_path)[0]
221
+ full_image_path = os.path.join(args.image_root, image_path)
222
+ result = legion.infer(full_image_path)
223
+ # save localization
224
+ this_localization_save_path = os.path.join(localization_save_dir, f"{image_name}_mask.png")
225
+ result["localization"].save(this_localization_save_path)
226
+ result["localization"] = this_localization_save_path
227
+ # add original image path
228
+ result["image_path"] = full_image_path
229
+ # write to jsonl
230
+ writer.write(result)
231
+
232
+
233
+ if __name__ == "__main__":
234
+ args = parse_args()
235
+ main(args)
pytorch_model-00001-of-00002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:52bedf3c0f9c51c46511a732449dc08dfa36241639bd21e261fee7030a108be4
3
+ size 9976695294
pytorch_model-00002-of-00002.bin ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8aa2a634e2e0667569d2f607b9038eda50fb970c2388932c4a9975941094220c
3
+ size 6070091263
pytorch_model.bin.index.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.model ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:9e556afd44213b6bd1be2b850ebbbd98f5481437a8021afaf58ee7fb1818d347
3
+ size 499723
tokenizer_config.json ADDED
@@ -0,0 +1,35 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "add_bos_token": true,
3
+ "add_eos_token": false,
4
+ "bos_token": {
5
+ "__type": "AddedToken",
6
+ "content": "<s>",
7
+ "lstrip": false,
8
+ "normalized": false,
9
+ "rstrip": false,
10
+ "single_word": false
11
+ },
12
+ "clean_up_tokenization_spaces": false,
13
+ "eos_token": {
14
+ "__type": "AddedToken",
15
+ "content": "</s>",
16
+ "lstrip": false,
17
+ "normalized": false,
18
+ "rstrip": false,
19
+ "single_word": false
20
+ },
21
+ "legacy": false,
22
+ "model_max_length": 1536,
23
+ "pad_token": null,
24
+ "padding_side": "right",
25
+ "sp_model_kwargs": {},
26
+ "tokenizer_class": "LlamaTokenizer",
27
+ "unk_token": {
28
+ "__type": "AddedToken",
29
+ "content": "<unk>",
30
+ "lstrip": false,
31
+ "normalized": false,
32
+ "rstrip": false,
33
+ "single_word": false
34
+ }
35
+ }