Qwen 2.5VL Abliterated?

#135
by ScramboSplinergy - opened

Doesn't look like you're using an Abliterated model. Lot of evidence of it improving the NSFW content.

Qwen clip for Wan? Are you at correct model? Maybe you like to write it about his Qwen model?

Owner

I am looking into it for the Qwen model!

Owner

I'm struggled to get the abliterated model working (I can't seem to find any prepackaged version of it, so I've been trying to load GGUFs with mmproj beside it and get errors). However, I haven't found any signs that it is any better, for example:

https://www.reddit.com/r/comfyui/comments/1mjvtnm/comment/n7jp8t6/

I've been able to prompt NSFW stuff with the typical text encoder, so I'm not really sure what the abliterated version does better (if anything) to continue investing time in it. If anyone has any other resources or tips, please share!

Correct me if I'm wrong but I think the theory behind abliteration supports the claim that when VL models are used in this way abliteration does not matter.
So basically abliteration aims to remove the "refusal" direction in embedding space that are mostly manually trained-in during post-training stages where they train the model with "harmful" prompts and artificial refusal responses. They achieve this by manually activating the censored model using refused prompts and comparing the embed vector to figure out the common refusal direction, then silence this direction by masking it and propagating the changes to the weights.

However the "refusal" direction in embed vector only makes sense if the vector is used to predict the next token, these directions are then expected to produce tokens of refusal in response. In image-gen application the vector is used to condition the diffusion model, and what it decides to do with this direction is completely up to how the diffusion model is trained. Basically you need a manual fine-tuning process as well to make the diffusion model exhibit refusal behaviors upon seeing these directions. And as far as I know qwen-image did not have this stage. Furthermore if there are any such fine-tuning they are probably already overwritten by the nfsw fine-tuning process since they are tuning the same networks.

tldr: a censored VL model used as textual head of an image gen application only "tags" the to-be-censored images in a special category, and what to do with this category is up to the behavior of the diffusion model, and whatever the censorship may be in qwen-image diffusion model it is probably already overwritten by the nsfw fine-tuning process.

Interesting. I'm a total noob, so I'm not sure what I read in another forum had any credence. Glad to know how that abliteration process works

Sign up or log in to comment