AI & ML interests

The Fellowship is a network of exceptional people from different backgrounds who contribute to open-source machine learning πŸ§™β€β™‚οΈπŸ¦Έβ€β™€οΈπŸ¦ΉπŸ§β€β™‚οΈ

Recent Activity

prithivMLmodsΒ 
posted an update about 19 hours ago
view post
Post
251
Introducing Photo-Mate-v2, based on FLUX.1-Kontext-dev, for advanced image manipulation tasks. It supports transforming scenes into top-down/bottom-up perspectives, CAM-right/left-view and its reverse, as well as general kontext-specified object removal. Below is the list of demos and adapters.πŸ”₯πŸ€—

➀ Spaces [Demo] : prithivMLmods/Kontext-Photo-Mate-v2

Kontext-Adapters :
✦ Kontext-Bottom-Up-View: prithivMLmods/Kontext-Bottom-Up-View
✦ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
✦ Kontext-Top-Down-View: prithivMLmods/Kontext-Top-Down-View
✦ Kontext-CAM-Left-View: prithivMLmods/Kontext-CAM-Left-View
✦ Kontext-CAM-Right-View: prithivMLmods/Kontext-CAM-Right-View
✦ Kontext-Unblur-Upscale: prithivMLmods/Kontext-Unblur-Upscale
✦ Kontext-0811-exp: prithivMLmods/Kontext-0811-exp

Photo-Mate Collection:
✦ Kontext CAM Angles: https://huggingface.co/collections/prithivMLmods/kontext-cam-angles
✦ i2i - Kontext (exp): https://huggingface.co/collections/prithivMLmods/i2i-kontext-exp
✦ LZO-1 (Lossless Zoom Operator): https://huggingface.co/collections/prithivMLmods/lzo-1-lossless-zoom-operator

Related-Apps:
✦ Photo-Mate [Version 1.0]: prithivMLmods/Photo-Mate-i2i
✦ Image Generation Apps [Collection]: https://huggingface.co/collections/prithivMLmods/image-generation-apps-collection

To know more about it, visit the app page or the respective model page!
@prithivMLmods
prithivMLmodsΒ 
posted an update 5 days ago
view post
Post
1192
A week ago, I shared a post about the latest transformers test implementation of DeepSeek-OCR Compatibility (https://tinyurl.com/ykc4mm66). Now, I’m dropping the most compatible version of it to support the model with the latest transformers. πŸ€—πŸ”₯

➠ DeepSeek-OCR-Latest-BF16.I64: prithivMLmods/DeepSeek-OCR-Latest-BF16.I64
➠ DeepSeek OCR [exp] : prithivMLmods/DeepSeek-OCR-experimental

βœ…Supports the latest transformers v4.57.1
βœ…torch: 2.6.0+cu124 (or) the latest version (i.e., torch 2.9.0)
βœ…cuda version: 12.4
βœ…users can also opt out of specific attention implementations if desired.

✨Previous version: strangervisionhf/deepseek-ocr-latest-transformers
↗️Related Blog: https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms
✨Community Page: strangervisionhf
✨Original Model Page: deepseek-ai/DeepSeek-OCR

To know more about it, visit the app page or the respective model page!
prithivMLmodsΒ 
posted an update 9 days ago
view post
Post
2530
A small blog post titled - Hall of Multimodal OCR VLMs and Demonstrations has been published on ↗️ https://huggingface.co/blog/prithivMLmods/multimodal-ocr-vlms on behalf of strangervisionhf

It discusses the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique capabilities, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility.
prithivMLmodsΒ 
posted an update 11 days ago
view post
Post
3804
Implemented DeepSeek-OCR to support the latest transformers on the strangervisionhf page. The page includes the model weights and corrected configuration, which fix the issues and allow transformers inference to run smoothly.πŸ€—πŸ”₯

> Model: strangervisionhf/deepseek-ocr-latest-transformers
> Demo Space: prithivMLmods/DeepSeek-OCR-experimental

βœ…Supports the latest transformers
βœ…You can also opt out of the attention implementation if needed.
βœ…Supports torch version 2.6.0 or higher
βœ…torch version cuda: 12.4

If you are interested in experimenting with new things and streamlining compatibility, the strangervisionhf organization is open for you, and you can join the community.

> Multimodal Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0, https://huggingface.co/collections/strangervisionhf/october-2025-models

> Thank you, @merve , for assigning the blazing-fast Zero GPU support!

> Notebook : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepSeek-OCR-Demo/deepseek_ocr_demo.ipynb

To know more about it, visit the app page or the respective model page!
prithivMLmodsΒ 
posted an update 12 days ago
view post
Post
1492
Introducing Gliese-OCR-7B-Post2.0-final, a document content-structure retrieval VLM designed for content extraction (OCR), summarization, and document visual question answering. This is the fourth and final model in the Camel Doc OCR VLM series, following Gliese-OCR-7B-Post1.0. The model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.πŸš€πŸ€—

> Gliese-OCR-7B-Post2.0-final : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> Gliese-OCR-7B-Post1.0 (previous) : prithivMLmods/Gliese-OCR-7B-Post1.0
> Gliese OCR Post-x.0 (collection) : https://huggingface.co/collections/prithivMLmods/gliese-ocr-post-x0
> Multimodal Implementations (collection) : https://huggingface.co/collections/prithivMLmods/multimodal-implementations
> Qwen VL Captions (other-collection) : https://huggingface.co/collections/prithivMLmods/qwen-vl-captions
> Run Demo Here : prithivMLmods/Gliese-OCR-7B-Post2.0-final
> GitHub (4bit) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Gliese-OCR-7B-Post2.0-final(4bit)/Gliese_OCR_7B_Post2_0_final.ipynb

.
.
.
> To know more about it, visit the app page or the respective model page!!
prithivMLmodsΒ 
posted an update 13 days ago
view post
Post
1827
Here is the official Florence-2 Transformers-converted demo for the following vision models: florence-community/Florence-2-large, florence-community/Florence-2-large-ft, florence-community/Florence-2-base, and florence-community/Florence-2-base-ft. These models support tasks such as object detection, captioning, detailed captioning, more detailed captioning, dense region captioning, region proposal, OCR, and OCR with region. Try the official demo at the link below:

> Space: prithivMLmods/florence2-vision-models
> Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

> To know more about it, visit the app page or the respective model page!!
tomaarsenΒ 
posted an update 18 days ago
view post
Post
3514
πŸ€— Sentence Transformers is joining Hugging Face! πŸ€— This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face! Details:

Today, the Ubiquitous Knowledge Processing (UKP) Lab is transferring the project to Hugging Face. Sentence Transformers will remain a community-driven, open-source project, with the same open-source license (Apache 2.0) as before. Contributions from researchers, developers, and enthusiasts are welcome and encouraged. The project will continue to prioritize transparency, collaboration, and broad accessibility.

Read our full announcement for more details and quotes from UKP and Hugging Face leadership: https://huggingface.co/blog/sentence-transformers-joins-hf

We see an increasing wish from companies to move from large LLM APIs to local models for better control and privacy, reflected in the library's growth: in just the last 30 days, Sentence Transformer models have been downloaded >270 million times, second only to transformers.

I would like to thank the UKP Lab, and especially Nils Reimers and Iryna Gurevych, both for their dedication to the project and for their trust in myself, both now and two years ago. Back then, neither of you knew me well, yet you trusted me to take the project to new heights. That choice ended up being very valuable for the embedding & Information Retrieval community, and I think this choice of granting Hugging Face stewardship will be similarly successful.

I'm very excited about the future of the project, and for the world of embeddings and retrieval at large!
  • 1 reply
Β·
prithivMLmodsΒ 
posted an update 19 days ago
merveΒ 
posted an update 20 days ago
view post
Post
5104
deepseek-ai/DeepSeek-OCR is out! πŸ”₯ my take ‡️
> pretty insane it can parse and re-render charts in HTML
> it uses CLIP and SAM features concatenated, so better grounding
> very efficient per vision tokens/performance ratio
> covers 100 languages
Β·
prithivMLmodsΒ 
posted an update 24 days ago
view post
Post
1896
Now you can try all the latest state-of-the-art multimodal vision-language models from the Qwen3-VL series demo on Hugging Face Spaces β€” including 4B, 8B, and 30B (Instruct, 4B-Thinking) variants. I’ve also uploaded the weights for the Abliterated variants of these models, up to 30B parameters. Check out the Spaces and model links below! πŸ€—πŸ”₯

✨ Qwen3-VL[4B,8B]: prithivMLmods/Qwen3-VL-Outpost
✨ Qwen3-VL-30B-A3B-Demo: prithivMLmods/Qwen3-VL-HF-Demo
✨ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Qwen3-VL Abliterated Model Collection [ Version 1.0 ]

✨ Qwen3-VL-8B-Instruct-abliterated: prithivMLmods/Qwen3-VL-8B-Instruct-abliterated
✨ Qwen3-VL-4B-Instruct-abliterated: prithivMLmods/Qwen3-VL-4B-Instruct-abliterated
✨ Qwen3-VL-8B-Thinking-abliterated: prithivMLmods/Qwen3-VL-8B-Thinking-abliterated
✨ Qwen3-VL-4B-Thinking-abliterated: prithivMLmods/Qwen3-VL-4B-Thinking-abliterated
✨ Qwen3-VL-30B-A3B-Instruct-abliterated: prithivMLmods/Qwen3-VL-30B-A3B-Instruct-abliterated
✨ Qwen3-VL-30B-A3B-Thinking-abliterated: prithivMLmods/Qwen3-VL-30B-A3B-Thinking-abliterated

⚑Collection: prithivMLmods/qwen3-vl-abliteration-oct-1625-68f0e3e567ef076594605fac

Note: This is version 1.0 of the Abliteration of the Qwen3-VL series of models. It may perform sub-optimally in some cases. If you encounter any issues, please open a discussion.
prithivMLmodsΒ 
posted an update 26 days ago
view post
Post
3063
Introducing Image-Guard-2.0, an experimental, lightweight vision-language encoder model with a size of 0.1B (<100M parameters), trained on SigLIP2 (siglip2-base-patch16-224). Designed for multi-label image classification tasks, this model functions as an image safety system, serving as an image guard or moderator across a wide range of categories, from anime to realistic imagery.

⚑blog-article: https://huggingface.co/blog/prithivMLmods/image-guard-models

It also performs strict moderation and filtering of artificially synthesized content, demonstrating strong detection and handling of explicit images. Image-Guard-2.0 delivers robust performance in streamlined scenarios, ensuring reliable and effective classification across diverse visual inputs.
lbourdoisΒ 
posted an update 27 days ago
prithivMLmodsΒ 
posted an update 29 days ago
view post
Post
3343
The demo of Qwen3-VL-30B-A3B-Instruct, the next-generation and powerful vision-language model in the Qwen series, delivers comprehensive upgrades across the board β€” including superior text understanding and generation, deeper visual perception and reasoning, extended context length, enhanced spatial and video dynamics comprehension, and stronger agent interaction capabilities. πŸ€—πŸ”₯

⚑ Space / App: prithivMLmods/Qwen3-VL-HF-Demo

The model’s demo supports a wide range of tasks, including;
Image Inference, Video Inference, PDF Inference, Image Captioning (VLA), GIF Inference.

⚑ Collection: prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Thanks for granting the blazing-fast Zero GPU access, @merve πŸ™

⚑ Other Pages

> Github: https://github.com/prithivsakthiur/qwen3-vl-hf-demo
> Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
> VL caption β€” < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
> Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
BramVanroyΒ 
posted an update about 1 month ago
view post
Post
259
What are currently the best multilingual models with at most 72B parameters? Are Llama 3.3 70B and Qwen 2.5 72B still king?
  • 1 reply
Β·
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
463
Introducing the next-gen version of DeepCaption-VLA (v2.0) β€” an advanced, multimodal model based on Qwen2.5-VL, specialized for Image Captioning and Vision Language Attribution (VLA). This enhanced release focuses on generating precise, attribute-rich captions that capture visual properties, object attributes, and scene details across diverse image types and aspect ratios. Version 2.0 introduces significant improvements in multilingual inference, delivering higher captioning quality and attribution accuracy in languages including Chinese (Zh), Thai (Th), and more.

πŸ€— DeepCaption-VLA (v2.0) : prithivMLmods/DeepCaption-VLA-V2.0-7B
🫱 Collection : prithivMLmods/vlm-20-oct-0825-68e606aa6e3993be8a3b1d51
⭐ GitHub (notebook) : https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption_VLA_V2_0_7B/DeepCaption_VLA_V2_0_7Bipynb.ipynb

Other Pages⚑

βž₯ Multimodal VLMs July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
βž₯ VL caption β€” < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391
βž₯ Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd

To know more about it, visit the app page or the respective model page!!
christopherΒ 
posted an update about 1 month ago
view post
Post
444
Something very cool is cooking at Lichess
  • 1 reply
Β·
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
2791
Have built the new Image Studio with the Gemini Image Gen models for the following multiple tasks: imagen-4.0-fast-generate-001 model for Image Generation (Text-to-Image) and Multi-Image Editing (Image-to-Image), and Draw-to-Image powered by gemini-2.5-flash-image (aka Nano Banana).

⭐ Gemini-Image-Studio: prithivMLmods/Gemini-Image-Studio (Latest)
🀞 Old-App: prithivMLmods/Nano-Banana-AIO
πŸ₯Š GitHub: https://github.com/prithivsakthiur/gemini-image-studio-hf

To proceed, you need to add your Gemini API key. Your API key is stored only for the duration of your session and will be lost when you reload or exit the page. It will not be shared or exposed anywhere.
prithivMLmodsΒ 
posted an update about 1 month ago
view post
Post
4540
Try the Hugging Face Space demo for Logics-MLLM/Logics-Parsing, the latest multimodal VLM from the Logics Team at Alibaba Group. It enables end-to-end document parsing with precise content extraction in markdown format, and it also generates a clean HTML representation of the document while preserving its logical structure. πŸ€—πŸ”₯

Additionally, I’ve integrated one of my recent works β€” prithivMLmods/Gliese-OCR-7B-Post1.0 β€” which also excels at document comprehension.

⭐ Space / App : prithivMLmods/VLM-Parsing
πŸ“„ Technical Report by the Logics Team, Alibaba Group : Logics-Parsing Technical Report (2509.19760)
πŸ–– MM: VLM-Parsing: prithivMLmods/mm-vlm-parsing-68e33e52bfb9ae60b50602dc
⚑ Collections : prithivMLmods/multimodal-implementations-67c9982ea04b39f0608badb0

Other Pages:

βž” Multimodal VLMs - July'25 : prithivMLmods/multimodal-vlms-until-july25-688312e6b840e1e156f13027
βž” Multimodal VLMs - Aug'25 : prithivMLmods/multimodal-vlms-aug25-68a56aac39fe8084f3c168bd
βž” VL caption β€” < Sep 15 ’25 : prithivMLmods/vl-caption-sep-15-25-68c7f6d737985c63c13e2391

.
.
.
To know more about it, visit the app page or the respective model page!!