blog-explorers (Blog-explorers)

suayptalha

authored a paper 4 days ago

Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training

Paper • 2511.01918 • Published 8 days ago • 4

ccocks-deca

in blog-explorers/README 9 days ago

Pending Blog-Explorers Access Request

2

#10 opened 14 days ago by

KarthikAvinash

tonywu71

authored a paper 9 days ago

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

Paper • 2510.19949 • Published 18 days ago • 36

mrfakename

posted an update 13 days ago

Post

3192

Trained a model for emotion-controllable TTS based on MiMo audio on LAION's dataset.

Still very early and does have an issue with hallucinating but results seem pretty good so far, given that it is very early into the training run.

Will probably kick off a new run later with some settings tweaked.

Put up a demo here: mrfakename/EmoAct-MiMo

(Turn 🔊 on to hear audio samples)

4 replies

·

tomaarsen

posted an update 18 days ago

Post

3519

🤗 Sentence Transformers is joining Hugging Face! 🤗 This formalizes the existing maintenance structure, as I've personally led the project for the past two years on behalf of Hugging Face! Details:

Today, the Ubiquitous Knowledge Processing (UKP) Lab is transferring the project to Hugging Face. Sentence Transformers will remain a community-driven, open-source project, with the same open-source license (Apache 2.0) as before. Contributions from researchers, developers, and enthusiasts are welcome and encouraged. The project will continue to prioritize transparency, collaboration, and broad accessibility.

Read our full announcement for more details and quotes from UKP and Hugging Face leadership: https://huggingface.co/blog/sentence-transformers-joins-hf

We see an increasing wish from companies to move from large LLM APIs to local models for better control and privacy, reflected in the library's growth: in just the last 30 days, Sentence Transformer models have been downloaded >270 million times, second only to transformers.

I would like to thank the UKP Lab, and especially Nils Reimers and Iryna Gurevych, both for their dedication to the project and for their trust in myself, both now and two years ago. Back then, neither of you knew me well, yet you trusted me to take the project to new heights. That choice ended up being very valuable for the embedding & Information Retrieval community, and I think this choice of granting Hugging Face stewardship will be similarly successful.

I'm very excited about the future of the project, and for the world of embeddings and retrieval at large!

1 reply

·

christopher

authored a paper 23 days ago

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Paper • 2510.13996 • Published 25 days ago • 6

multimodalart

posted an update 24 days ago

Post

2832

Want to iterate on a Hugging Face Space with an LLM?

Now you can easily convert any HF entire repo (Model, Dataset or Space) to a text file and feed it to a language model!

multimodalart/repo2txt

vumichien

authored 2 papers 26 days ago

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

Paper • 2509.25531 • Published Sep 29 • 7

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9 • 34

christopher

posted an update about 1 month ago

Post

445

Something very cool is cooking at

Lichess

1 reply

·

bpan

authored a paper about 2 months ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 54

tomaarsen

posted an update 2 months ago

Post

5606

ModernBERT goes MULTILINGUAL! One of the most requested models I've seen, The Johns Hopkins University's CLSP has trained state-of-the-art massively multilingual encoders using the ModernBERT architecture: mmBERT.

Model details:
- 2 model sizes:
- jhu-clsp/mmBERT-small
- jhu-clsp/mmBERT-base
- Uses the ModernBERT architecture, but with the Gemma2 multilingual tokenizer (so: flash attention, alternating global/local attention, unpadding/sequence packing, etc.)
- Maximum sequence length of 8192 tokens, on the high end for encoders
- Trained on 1833 languages using DCLM, FineWeb2, and many more sources
- 3 training phases: 2.3T tokens pretraining on 60 languages, 600B tokens mid-training on 110 languages, and 100B tokens decay training on all 1833 languages.
- Both models are MIT Licensed, and the full datasets and intermediary checkpoints are also publicly released

Evaluation details:
- Very competitive with ModernBERT at equivalent sizes on English (GLUE, MTEB v2 English after finetuning)
- Consistently outperforms equivalently sized models on all Multilingual tasks (XTREME, classification, MTEB v2 Multilingual after finetuning)
- In short: beats commonly used multilingual base models like mDistilBERT, XLM-R (multilingual RoBERTa), multilingual MiniLM, etc.
- Additionally: the ModernBERT-based mmBERT is much faster than the alternatives due to its architectural benefits. Easily up to 2x throughput in common scenarios.

Check out the full blogpost with more details. It's super dense & gets straight to the point: https://huggingface.co/blog/mmbert

Based on these results, mmBERT should be the new go-to multilingual encoder base models at 300M and below. Do note that the mmBERT models are "base" models, i.e. they're currently only trained to perform Mask Filling. They'll need to be finetuned for downstream tasks like semantic search, classification, clustering, etc.

LPX55

posted an update 3 months ago

Post

1036

Hey all, just a friendly PSA heads-up to those using our POC space made for showcasing the various novel possibilities that @qwen-llm 's Image Editing model can be used for:

- We do not collect or store any data of inferences by users. BUT (pun not intended) there's been a minor logic error in how we store session data for custom prompting for our presets feature.
- While we can't deny that we all like nice buttocks as natural human beings, for the time being please reset the prompts to the default state after use, just as a courtesy for the next user.
- As long as you are staying within the law and HF's terms of service, we do not judge for curious minds and the inevitable occasional ... urges driven by humanly hormones, which, our future AGI overlords will never have.

The session storage logic error is completely on me, and will be fixed tomorrow, after I get some sleep as I am severely sleep-deprived (unrelated reasons) and my brain is pretty much mush. But thank you for the chuckles, as it was admittedly a pretty funny way to be made aware of my mistake.

To the anon that left the custom prompts at this state, I hope the "buttocks" were satisfactory. On a more serious note, please do not take this light-hearted note and mistake it as complacency or disregard for obvious lines that should not be crossed. Buttocks, I suppose, are okay. But let's be responsible and keep everything ethical and legal. As long as common sense is used (i.e. you are not harming anyone or breaking any TOS or laws), enjoy all the AI generated buttocks as your heart desires.

The attached image is what everyone sees at the moment when a user tries out the preset batch prompting feature. Though, as mentioned, we also generally like a nice butt, we are clarifying to fellow users that may have come across this that... we are not into butts of Russian nuns. Butt we don't judge. Original preset prompts can be seen in presets.py but possibilities are endless.

1 reply

·

LPX55

posted an update 3 months ago

Post

5691

Qwen's latest Image Edit model has been implemented with lightx2v's LoRA for 8-step lightning fast inferencing. Still a WIP, so YMMV.

LPX55/Qwen-Image-Edit_Fast-Presets

4 replies

·

Tonic

in blog-explorers/README 3 months ago

[Support] Community Articles

🚀 🤝 1

86

#5 opened over 1 year ago by

victor

GeorgeBredis

authored a paper 3 months ago

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

Paper • 2508.04280 • Published Aug 6 • 35

tomaarsen

posted an update 3 months ago

Post

4390

😎 I just published Sentence Transformers v5.1.0, and it's a big one. 2x-3x speedups of SparseEncoder models via ONNX and/or OpenVINO backends, easier distillation data preparation with hard negatives mining, and more:

1️⃣ Faster ONNX and OpenVINO backends for SparseEncoder models
Usage is as simple as backend="onnx" or backend="openvino" when initializing a SparseEncoder to get started, but I also included utility functions for optimization, dynamic quantization, and static quantization, plus benchmarks.

2️⃣ New n-tuple-scores output format from mine_hard_negatives
This new output format is immediately compatible with the MarginMSELoss and SparseMarginMSELoss for training SentenceTransformer, CrossEncoder, and SparseEncoder losses.

3️⃣ Gathering across devices
When doing multi-GPU training using a loss that has in-batch negatives (e.g. MultipleNegativesRankingLoss), you can now use gather_across_devices=True to load in-batch negatives from the other devices too! Essentially a free lunch, pretty big impact potential in my evals.

4️⃣ Trackio support
If you also upgrade transformers, and you install trackio with pip install trackio, then your experiments will also automatically be tracked locally with trackio. Just open up localhost and have a look at your losses/evals, no logins, no metric uploading.

5️⃣ MTEB Documentation
We've added some documentation on evaluating SentenceTransformer models properly with MTEB. It's rudimentary as the documentation on the MTEB side is already great, but it should get you started.

Plus many more smaller features & fixes (crash fixes, compatibility with datasets v4, FIPS compatibility, etc.).

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/tag/v5.1.0

Big thanks to all of the contributors for helping with the release, many of the features from this release were proposed by others. I have a big list of future potential features that I'd love to add, but I'm

jasoncorkill

posted an update 4 months ago

Post

3269

"Why did the bee get married?"

"Because he found his honey!"

This was the "funniest" joke out of 10'000 jokes we generated with LLMs. With 68% of respondents rating it as "funny".

Original jokes are particularly hard for LLMs, as jokes are very nuanced and a lot of context is needed to understand if something is "funny". Something that can only reliably be measured using humans.

LLMs are not equally good at generating jokes in every language. Generated English jokes turned out to be way funnier than the Japanese ones. 46% of English-speaking voters on average found the generated joke funny. The same statistic for other languages:

Vietnamese: 44%
Portuguese: 40%
Arabic: 37%
Japanese: 28%

There is not much variance in generation quality among models for any fixed language. But still Claude Sonnet 4 slightly outperforms others in Vietnamese, Arabic and Japanese and Gemini 2.5 Flash in Portuguese and English

We have release the 1 Million (!) native speaker ratings and the 10'000 jokes as a dataset for anyone to use:
Rapidata/multilingual-llm-jokes-4o-claude-gemini

7 replies

·

chansung

posted an update 4 months ago

Post

4269

YAML engineering becomes more and more important than ever from infra provisioning to model training (recipes).

Here, I built a simple editor first for @dstackai , and I will share the live endpoint this week. Let me know what you think about this approach.

Based on this approach, if people think this is useful, I am going to do the same thing for the LLM training recipes for popular frameworks such as Hugging Face open-r1, Axolotl, and so on. Let me hear.

tomaarsen

posted an update 4 months ago

Post

3074

‼️Sentence Transformers v5.0 is out! The biggest update yet introduces Sparse Embedding models, encode methods improvements, Router module for asymmetric models & much more. Sparse + Dense = 🔥 hybrid search performance! Details:

1️⃣ Sparse Encoder Models
Brand new support for sparse embedding models that generate high-dimensional embeddings (30,000+ dims) where <1% are non-zero:

- Full SPLADE, Inference-free SPLADE, and CSR architecture support
- 4 new modules, 12 new losses, 9 new evaluators
- Integration with @elastic-co , @opensearch-project , @NAVER LABS Europe, @qdrant , @IBM , etc.
- Decode interpretable embeddings to understand token importance
- Hybrid search integration to get the best of both worlds

2️⃣ Enhanced Encode Methods & Multi-Processing
- Introduce encode_query & encode_document automatically use predefined prompts
- No more manual pool management - just pass device list directly to encode()
- Much cleaner and easier to use than the old multi-process approach

3️⃣ Router Module & Advanced Training
- Router module with different processing paths for queries vs documents
- Custom learning rates for different parameter groups
- Composite loss logging - see individual loss components
- Perfect for two-tower architectures

4️⃣ Comprehensive Documentation & Training
- New Training Overview, Loss Overview, API Reference docs
- 6 new training example documentation pages
- Full integration examples with major search engines
- Extensive blogpost on training sparse models

Read the comprehensive blogpost about training sparse embedding models: https://huggingface.co/blog/train-sparse-encoder

See the full release notes here: https://github.com/UKPLab/sentence-transformers/releases/v5.0.0

What's next? We would love to hear from the community! What sparse encoder models would you like to see? And what new capabilities should Sentence Transformers handle - multimodal embeddings, late interaction models, or something else? Your feedback shapes our roadmap!

Blog-explorers

AI & ML interests

Recent Activity

Superpositional Gradient Descent: Harnessing Quantum Principles for Model Training

Pending Blog-Explorers Access Request

Surfer 2: The Next Generation of Cross-Platform Computer Use Agents

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

[Support] Community Articles

Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

AI & ML interests

Recent Activity

Team members 751

blog-explorers's activity

Pending Blog-Explorers Access Request

[Support] Community Articles