Spaces:
Running
Running
metadata
title: README
emoji: π’
colorFrom: yellow
colorTo: green
sdk: static
pinned: false
- Data-Juicer is a one-stop system to process text and multimodal data for and with foundation models (typically LLMs).
- You can try it in the playground with a managed JupyterLab.
- More details can be found in our homepage or documents.
News
- π [2025-09-19] Our work of Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models has been accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)!
- π [2025-09-19] Our two works regarding data mixture/selection/synthesis: Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data and MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? have been accepted by NeurIPS'25!
- π οΈ [2025-06-04] How to process feedback data in the "era of experience"? We propose Trinity-RFT: A General-Purpose and Unified Framework for Reinforcement Fine-Tuning of LLMs, which leverages Data-Juicer for its data pipelines tailored for RFT scenarios.
- π [2025-06-04] Our Data-Model Co-development Survey has been accepted by IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)! Welcome to explore and contribute the awesome-list.
- π [2025-06-04] We introduce DetailMaster: Can Your Text-to-Image Model Handle Long Prompts? A synthetic benchmark revealing notable performance drops despite large models' proficiency with short descriptions.
- π [2025-05-06] Our work of Data-Juicer Sandbox has been accepted as a ICML'25 Spotlight (top 2.6% of all submissions)!
- π‘ [2025-03-13] We propose MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning? A new data synthesis method that enables large models to self-synthesize high-quality, low-variance data for efficient fine-tuning, (e.g., 16% gain on MathVision using only 400 samples).
- π€ [2025-02-28] DJ has been integrated in Ray's official Ecosystem and Example Gallery. Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by Apache Arrow.
- π [2025-02-27] Our work on contrastive data synthesis, ImgDiff, has been accepted by CVPR'25!
- π‘ [2025-02-05] We propose a new data selection method, Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data. It is theoretically informed, via treating diversity as a reward, achieves better overall performance across 7 benchmarks when post-training SOTA LLMs.