Improve model card: Add metadata, paper details, and enhanced structure
#2
by
nielsr
HF Staff
- opened
README.md
CHANGED
|
@@ -1,14 +1,85 @@
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
| 4 |
<h1 align="center">
|
| 5 |
MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
|
| 6 |
</h1>
|
| 7 |
<p align="center">
|
| 8 |
-
<a href="https://
|
| 9 |
-
<img alt="
|
|
|
|
|
|
|
|
|
|
| 10 |
</a>
|
| 11 |
<a href="https://opensource.org/license/apache-2-0">
|
| 12 |
<img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-green.svg?logo=apache">
|
| 13 |
</a>
|
| 14 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: apache-2.0
|
| 3 |
+
library_name: transformers
|
| 4 |
+
pipeline_tag: text-generation
|
| 5 |
---
|
| 6 |
+
|
| 7 |
<h1 align="center">
|
| 8 |
MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems
|
| 9 |
</h1>
|
| 10 |
<p align="center">
|
| 11 |
+
<a href="https://arxiv.org/abs/2510.14252">
|
| 12 |
+
<img alt="arXiv Paper" src="https://img.shields.io/badge/arXiv-Paper-b31b1b.svg?logo=arxiv">
|
| 13 |
+
</a>
|
| 14 |
+
<a href="https://huggingface.co/papers/2510.14252">
|
| 15 |
+
<img src="https://img.shields.io/badge/Huggingface-Paper-yellow?style=flat-square&logo=huggingface">
|
| 16 |
</a>
|
| 17 |
<a href="https://opensource.org/license/apache-2-0">
|
| 18 |
<img alt="Apache 2.0 License" src="https://img.shields.io/badge/License-Apache_2.0-green.svg?logo=apache">
|
| 19 |
</a>
|
| 20 |
+
<br>
|
| 21 |
+
<a href="https://huggingface.co/datasets/Robot2050/MoM">
|
| 22 |
+
<img src="https://img.shields.io/badge/Huggingface-Dataset-FF6F00?style=flat-square&logo=huggingface">
|
| 23 |
+
</a>
|
| 24 |
+
<a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_1.5B">
|
| 25 |
+
<img src="https://img.shields.io/badge/Model-MemReader 1.5B-FF6F00?style=flat-square&logo=huggingface">
|
| 26 |
+
</a>
|
| 27 |
+
<a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_3B">
|
| 28 |
+
<img src="https://img.shields.io/badge/Model-MemReader 3B-FF6F00?style=flat-square&logo=huggingface">
|
| 29 |
+
</a>
|
| 30 |
+
<a href="https://huggingface.co/Robot2050/MoM/tree/main/scenario_ratio_7B">
|
| 31 |
+
<img src="https://img.shields.io/badge/Model-MemReader 7B-FF6F00?style=flat-square&logo=huggingface">
|
| 32 |
+
</a>
|
| 33 |
+
</p>
|
| 34 |
+
|
| 35 |
+
This repository presents the **MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems** framework, detailed in the paper [MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems](https://huggingface.co/papers/2510.14252).
|
| 36 |
+
|
| 37 |
+
## Abstract
|
| 38 |
+
|
| 39 |
+
The traditional RAG paradigm, which typically engages in the comprehension of relevant text chunks in response to received queries, inherently restricts both the depth of knowledge internalization and reasoning capabilities. To address this limitation, our research transforms the text processing in RAG from passive chunking to proactive understanding, defining this process as document memory extraction with the objective of simulating human cognitive processes during reading. Building upon this, we propose the Mixtures of scenario-aware document Memories (MoM) framework, engineered to efficiently handle documents from multiple domains and train small language models (SLMs) to acquire the ability to proactively explore and construct document memories. The MoM initially instructs large language models (LLMs) to simulate domain experts in generating document logical outlines, thereby directing structured chunking and core content extraction. It employs a multi-path sampling and multi-perspective evaluation mechanism, specifically designing comprehensive metrics that represent chunk clarity and extraction completeness to select the optimal document memories. Additionally, to infuse deeper human-like reading abilities during the training of SLMs, we incorporate a reverse reasoning strategy, which deduces refined expert thinking paths from high-quality outcomes. Finally, leveraging diverse forms of content generated by MoM, we develop a three-layer document memory retrieval mechanism, which is grounded in our theoretical proof from the perspective of probabilistic modeling. Extensive experimental results across three distinct domains demonstrate that the MoM framework not only resolves text chunking challenges in existing RAG systems, providing LLMs with semantically complete document memories, but also paves the way for SLMs to achieve human-centric intelligent text processing.
|
| 40 |
+
|
| 41 |
+
### 🎯 Who Should Pay Attention to Our Work?
|
| 42 |
+
|
| 43 |
+
This study proposes an innovative framework aimed at breaking through the cognitive bottlenecks of traditional RAG systems, offering significant reference value for researchers and engineers committed to enhancing the depth and breadth of information processing in LLMs. Specifically, professionals in the following fields will benefit from our work:
|
| 44 |
+
|
| 45 |
+
**Researchers in NLP and Information Retrieval**: The active memory extraction paradigm proposed in this paper challenges the traditional text processing workflow of "chunk first, then understand", providing a novel research perspective for fields such as document understanding, semantic segmentation, and knowledge representation.
|
| 46 |
+
|
| 47 |
+
**Developers of LLM Applications**: Our work directly addresses the core challenges faced by RAG systems in constructing knowledge-intensive applications, such as semantic incompleteness and logical fragmentation of text chunks. It offers a systematic approach to generating high-quality, structured document memories.
|
| 48 |
+
|
| 49 |
+
**Researchers in SLMs**: Facing the limitations of SLMs in complex cognitive tasks, we demonstrate, through the reverse construction strategy of the **C**hain reasoning **o**f **M**emory extraction (CoM), how to efficiently transfer the deep reasoning capabilities of LLMs to SLMs, opening up new pathways for building lightweight, high-performance intelligent systems.
|
| 50 |
+
|
| 51 |
+
**Scholars in the Interdisciplinary Field of Cognitive Science and AI**: The core of this study lies in simulating the cognitive processes of human experts by transforming unstructured text into hierarchical memories. This provides robust support for exploring human-like cognition, knowledge construction, and reasoning mechanisms in machines.
|
| 52 |
+
|
| 53 |
+
## ✨ Core Contributions
|
| 54 |
+
|
| 55 |
+
**MoM** offers several key innovations:
|
| 56 |
+
* **Proposing Active Memory Extraction**: We advocate transforming text processing in RAG from passive text chunking to active memory extraction. By simulating domain experts, we first achieve a holistic and macroscopic understanding of documents and then construct structured document memories.
|
| 57 |
+
* **Defining Structured Document Memories**: We formally define document memories as a triplet composed of a macroscopic logical outline, highly condensed core content, and semantically coherent atomic chunks.
|
| 58 |
+
* **Constructing the MoM Framework and CoM**: We design the MoM framework, which generates high-quality memories through a multi-path sampling and multi-dimensional evaluation mechanism. Furthermore, we employ a reverse reasoning strategy to construct the CoM, thereby endowing SLMs with complex cognitive capabilities.
|
| 59 |
+
* **Designing a Three-Layer Retrieval Mechanism and Providing Theoretical Proof**: We develop a three-layer document memory retrieval mechanism encompassing logical outlines, core content, and original text. From a probabilistic modeling perspective, we theoretically demonstrate that this strategy can more effectively reduce information loss and achieve more precise knowledge localization compared to fusing information before retrieval.
|
| 60 |
+
|
| 61 |
+
## 🛠️ Quick Start
|
| 62 |
+
|
| 63 |
+
For detailed installation instructions and how to run the MoM framework, please refer to the [official GitHub repository](https://github.com/MemTensor/MoM). The repository provides scripts for setting up the environment, starting the Milvus-lite service, and executing the text chunking, retrieval, and question-answering processes.
|
| 64 |
+
|
| 65 |
+
## Checkpoints
|
| 66 |
+
|
| 67 |
+
More models from the MoM family can be found on Hugging Face:
|
| 68 |
+
* [MemReader 1.5B](https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_1.5B)
|
| 69 |
+
* [MemReader 3B](https://huggingface.co/Robot2050/MoM/tree/main/scenario_cot_ratio_3B)
|
| 70 |
+
* [MemReader 7B](https://huggingface.co/Robot2050/MoM/tree/main/scenario_ratio_7B)
|
| 71 |
+
|
| 72 |
+
## Citation
|
| 73 |
+
|
| 74 |
+
If you find this work helpful or inspiring, please feel free to cite the paper:
|
| 75 |
+
|
| 76 |
+
```bibtex
|
| 77 |
+
@article{liu2024mom,
|
| 78 |
+
title={MoM: Mixtures of Scenario-Aware Document Memories for Retrieval-Augmented Generation Systems},
|
| 79 |
+
author={Liu, Dongyang and Zhao, Shitian and Zhuo, Le and Lin, Weifeng and Qiao, Yu and Li, Hongsheng and Gao, Peng},
|
| 80 |
+
year={2024},
|
| 81 |
+
eprint={2510.14252},
|
| 82 |
+
archivePrefix={arXiv},
|
| 83 |
+
primaryClass={cs.CL}
|
| 84 |
+
}
|
| 85 |
+
```
|