Pretraining Context Compressor for Large Language Models with Embedding-Based Memory
- Yuhong Dai ,
- Jianxun Lian ,
- Yitian Huang ,
- Wei Zhang ,
- Mingyang Zhou ,
- Mingqi Wu ,
- Xing Xie ,
- Hao Liao
ACL 2025 |
Efficient processing of long contexts in large language models (LLMs) is essential for real world applications like retrieval-augmented generation and in-context learning, especially in resource-constrained environments such as edge computing. This paper explores the embedding-based context compression to reduce inference costs while preserving the downstream LLM configurations. We propose a decoupled compressor-LLM framework, pretrained on text reconstruction and completion tasks, designed to effectively preserve essential contextual information within condensed embedding representations. Our extensive experiments investigate pretraining, model configurations, compression rates, efficiency across tasks, and adaptability to various LLMs. Results demonstrate that our approach outperforms competitive baselines in three domains and across eight datasets while being adapt able to different downstream LLMs. We find that thorough pretraining and carefully selected compression rates, such as 4x and 16x, enable a lightweight compressor to achieve a good balance between accuracy and speed. These findings underscore the potential of embedding based compression to enhance LLM efficiency and motivate further research in this area.