Improve model card: Add pipeline tag, library name, paper abstract, and detailed sections
#1
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,14 +1,119 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
<h1 align="center">
|
| 5 |
MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
|
| 6 |
</h1>
|
| 7 |
<p align="center">
|
| 8 |
-
<a href="https://
|
| 9 |
-
<img alt="
|
|
|
|
|
|
|
|
|
|
| 10 |
</a>
|
| 11 |
<a href="https://opensource.org/license/apache-2-0">
|
| 12 |
<img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-green.svg?logo=apache">
|
| 13 |
</a>
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
pipeline_tag: text-generation
|
| 4 |
+
library_name: transformers
|
| 5 |
---
|
| 6 |
+
|
| 7 |
<h1 align="center">
|
| 8 |
MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
|
| 9 |
</h1>
|
| 10 |
<p align="center">
|
| 11 |
+
<a href="https://arxiv.org/abs/2510.14252">
|
| 12 |
+
<img alt="arXiv Paper" src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg?logo=arxiv">
|
| 13 |
+
</a>
|
| 14 |
+
<a href="https://huggingface.co/papers/2510.14252">
|
| 15 |
+
<img src="https://img.shields.io/badge/Huggingface-Paper-yellow?style=flat-square&logo=huggingface">
|
| 16 |
</a>
|
| 17 |
<a href="https://opensource.org/license/apache-2-0">
|
| 18 |
<img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-green.svg?logo=apache">
|
| 19 |
</a>
|
| 20 |
+
<br>
|
| 21 |
+
<a href="https://huggingface.co/datasets/Robot2050/MoM">
|
| 22 |
+
<img src="https://img.shields.io/badge/Huggingface-Dataset-FF6F00?style=flat-square&logo=huggingface">
|
| 23 |
+
</a>
|
| 24 |
+
<a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_1.5B">
|
| 25 |
+
<img src="https://img.shields.io/badge/Model-MemReader 1.5B-FF6F00?style=flat-square&logo=huggingface">
|
| 26 |
+
</a>
|
| 27 |
+
<a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_3B">
|
| 28 |
+
<img src="https://img.shields.io/badge/Model-MemReader 3B-FF6F00?style=flat-square&logo=huggingface">
|
| 29 |
+
</a>
|
| 30 |
+
<a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_ratio_7B">
|
| 31 |
+
<img src="https://img.shields.io/badge/Model-MemReader 7B-FF6F00?style=flat-square&logo=huggingface">
|
| 32 |
+
</a>
|
| 33 |
+
</p>
|
| 34 |
+
|
| 35 |
+
This repository contains the model associated with the paper [MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems](https://huggingface.co/papers/2510.14252).
|
| 36 |
+
|
| 37 |
+
## Abstract
|
| 38 |
+
The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
|
| 39 |
+
|
| 40 |
+
### 🎯 Who Should Pay Attention to Our Work?
|
| 41 |
+
|
| 42 |
+
This study proposes an innovative framework aimed at breaking through the cognitive bottlenecks of traditional RAG systems, offering significant reference value for researchers and engineers committed to enhancing the depth and breadth of information processing in LLMs. Specifically, professionals in the following fields will benefit from our work:
|
| 43 |
+
|
| 44 |
+
**Researchers in NLP and Information Retrieval**: The active memory extraction paradigm proposed in this paper challenges the traditional text processing workflow of "chunk first, then understand", providing a novel research perspective for fields such as document understanding, semantic segmentation, and knowledge representation.
|
| 45 |
+
|
| 46 |
+
**Developers of LLM Applications**: Our work directly addresses the core challenges faced by RAG systems in constructing knowledge-intensive applications, such as semantic incompleteness and logical fragmentation of text chunks. It offers a systematic approach to generating high-quality, structured document memories.
|
| 47 |
+
|
| 48 |
+
**Researchers in SLMs**: Facing the limitations of SLMs in complex cognitive tasks, we demonstrate, through the reverse construction strategy of the **C**hain reasoning **o**f **M**emory extraction (CoM), how to efficiently transfer the deep reasoning capabilities of LLMs to SLMs, opening up new pathways for building lightweight, high-performance intelligent systems.
|
| 49 |
+
|
| 50 |
+
**Scholars in the Interdisciplinary Field of Cognitive Science and AI**: The core of this study lies in simulating the cognitive processes of human experts by transforming unstructured text into hierarchical memories. This provides robust support for exploring human-like cognition, knowledge construction, and reasoning mechanisms in machines.
|
| 51 |
+
|
| 52 |
+
### ✨ Core Contributions
|
| 53 |
+
|
| 54 |
+
**Proposing Active Memory Extraction**: We advocate transforming text processing in RAG from passive text chunking to active memory extraction. By simulating domain experts, we first achieve a holistic and macroscopic understanding of documents and then construct structured document memories.
|
| 55 |
+
|
| 56 |
+
**Defining Structured Document Memories**: We formally define document memories as a triplet composed of a macroscopic logical outline, highly condensed core content, and semantically coherent atomic chunks.
|
| 57 |
+
|
| 58 |
+
**Constructing the MoM Framework and CoM**: We design the MoM framework, which generates high-quality memories through a multi-path sampling and multi-dimensional evaluation mechanism. Furthermore, we employ a reverse reasoning strategy to construct the CoM, thereby endowing SLMs with complex cognitive capabilities.
|
| 59 |
+
|
| 60 |
+
**Designing a Three-Layer Retrieval Mechanism and Providing Theoretical Proof**: We develop a three-layer document memory retrieval mechanism encompassing logical outlines, core content, and original text. From a probabilistic modeling perspective, we theoretically demonstrate that this strategy can more effectively reduce information loss and achieve more precise knowledge localization compared to fusing information before retrieval.
|
| 61 |
+
|
| 62 |
+
## **🛠️ Quick Start**
|
| 63 |
+
|
| 64 |
+
For full code and detailed instructions, please refer to the [GitHub repository](https://github.com/MemTensor/MoM).
|
| 65 |
+
|
| 66 |
+
- Install dependency packages
|
| 67 |
+
|
| 68 |
+
```bash
|
| 69 |
+
pip install -r requirements.txt
|
| 70 |
+
```
|
| 71 |
+
|
| 72 |
+
- Start the milvus-lite service (vector database)
|
| 73 |
+
|
| 74 |
+
```bash
|
| 75 |
+
milvus-server --data /Storage/path/of/the/database
|
| 76 |
+
```
|
| 77 |
+
|
| 78 |
+
- Download models to corresponding directories.
|
| 79 |
+
- Modify various configurations according to your need.
|
| 80 |
+
- Run `chunk_*.py` and `mom_*.py` to accomplish the text chunking task for domain documents.
|
| 81 |
+
|
| 82 |
+
```bash
|
| 83 |
+
CUDA_VISIBLE_DEVICES=0 nohup python chunk_gpt.py >> multifiled/qwen3_14B_set.log 2>&1 &
|
| 84 |
+
```
|
| 85 |
+
|
| 86 |
+
- Subsequently, execute `quick_start.py` and `retrieval.py` to carry out the retrieval and question-answering processes.
|
| 87 |
+
|
| 88 |
+
```bash
|
| 89 |
+
CUDA_VISIBLE_DEVICES=1 nohup python quick_start.py \
|
| 90 |
+
--docs_path 'crud_qwen3_14B_set.json' \
|
| 91 |
+
--collection_name 'crud_qwen3_14B_set' \
|
| 92 |
+
--retrieve_top_k 8 \
|
| 93 |
+
--task 'quest_answer' \
|
| 94 |
+
--construct_index \
|
| 95 |
+
>> log/mom_crud_qwen3_14B_set.log 2>&1 &
|
| 96 |
+
|
| 97 |
+
CUDA_VISIBLE_DEVICES=2 nohup python retrieval.py \
|
| 98 |
+
--data_path 'evaldata/multifieldqa_zh.json' \
|
| 99 |
+
--save_file 'eval/mom_multifieldqa_zh_qwen3_14B_set.json' \
|
| 100 |
+
--docs_path 'multifieldqa_zh_qwen3_14B_set.json' \
|
| 101 |
+
--collection_name 'multifieldqa_zh_qwen3_14B_set' \
|
| 102 |
+
--retrieve_top_k 8 \
|
| 103 |
+
--construct_index \
|
| 104 |
+
>> log/mom_multifieldqa_zh_huagong_qwen3_14B_set.log 2>&1 &
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
- Open and run `chunk.ipynb`, which will conduct a comprehensive quality assessment of the results generated by different chunking strategies.
|
| 108 |
+
|
| 109 |
+
### 📊 Results
|
| 110 |
+
|
| 111 |
+
We conduct extensive experiments on three QA datasets across different domains, including news, finance and so on.
|
| 112 |
+
|
| 113 |
+
**Performance Across Domains**: Our MemReader demonstrates outstanding performance in handling pure text QA tasks.
|
| 114 |
+
|
| 115 |
+
**Effectiveness of Evaluation Metrics**: The memory evaluation metrics we proposed are proven to effectively assess the quality of memory chunks, providing a reliable basis for the automatic screening of high-quality document memories.
|
| 116 |
+
|
| 117 |
+
**Information Supportiveness of Retrieved Content**: The results indicate that the memories extracted and organized by MoM can provide more comprehensive information for downstream tasks.
|
| 118 |
+
|
| 119 |
+

|