Spaces:

datajuicer
/

README

Running

App Files Files Community

README / README.md

yxdyc

Update README.md

43d358a verified about 1 month ago

preview code

raw

history blame contribute delete

3.37 kB

metadata

title: README
emoji: 🐢
colorFrom: yellow
colorTo: green
sdk: static
pinned: false

Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs).
You can try it in the playground with a managed JupyterLab.
More details can be found in our homepage or documents.

News

🎉 [2025-09-19] Our work of Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models has been accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)!
🎉 [2025-09-19] Our two works regarding data mixture/selection/synthesis: Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data and MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? have been accepted by NeurIPS'25!
🛠️ [2025-06-04] How to process feedback data in the "era of experience"? We propose Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of LLMs, which leverages Data-Juicer for its data pipelines tailored for RFT scenarios.
🎉 [2025-06-04] Our Data-Model Co-development Survey has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)! Welcome to explore and contribute the awesome-list.
🔎 [2025-06-04] We introduce DetailMaster: Can Your Text-to-Image Model Handle Long Prompts? A synthetic benchmark revealing notable performance drops despite large models' proficiency with short descriptions.
🎉 [2025-05-06] Our work of Data-Juicer Sandbox has been accepted as a ICML'25 Spotlight (top 2.6% of all submissions)!
💡 [2025-03-13] We propose MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? A new data synthesis method that enables large models to self-synthesize high-quality, low-variance data for efficient fine-tuning, (e.g., 16% gain on MathVision using only 400 samples).
🤝 [2025-02-28] DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
🎉 [2025-02-27] Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR'25!
💡 [2025-02-05] We propose a new data selection method, Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data. It is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs.