AutoChunker: Structured text chunking and its evaluation

Arihant Jain; Purav Aggarwal; Anoop S V K K Saladi

Publication

AutoChunker: Structured text chunking and its evaluation

By Arihant Jain, Purav Aggarwal, Anoop S V K K Saladi

2025

Download Copy BibTeX

Share

Download

Copy BibTeX

Share

Text chunking is fundamental to modern retrieval-augmented systems, yet existing methods often struggle with maintaining semantic coherence, both within and across chunks, while dealing with document structure and noise. We present AutoChunker, a bottom-up approach for text chunking that combines document structure awareness with noise elimination. AutoChunker leverages language models to identify and segregate logical units of information (a chunk) while preserving document hierarchy through a tree-based representation. To evaluate the chunking operator, we introduce a comprehensive evaluation framework based on five core tenets: noise reduction, completeness, context coherence, task relevance, and retrieval performance. Experimental results on Support and Wikipedia articles demonstrate that AutoChunker significantly outperforms existing methods, reducing noise while improving chunk completeness compared to state-of-the-art baselines. When integrated with an online product support system, our approach led to improvements in retrieval performance and customer return rates. Our work not only advances the state of text chunking but also provides a standardized framework for evaluating chunking strategies, addressing a critical gap in the field.

AutoChunker: Structured text chunking and its evaluation

Latest news

Work with us