面向 RAG 知识库问答系统的语义感知分块策略研究
通过优化文档分块方式,提升检索相关性与回答质量
构建一个实验性的 RAG 知识库系统,并分析不同分块策略对以下三个关键维度的影响,以揭示分块策略如何影响 RAG 系统的性能表现。
核心思路:提出基于语义边界检测(Semantic Boundary Detection)的分块方法,利用句子间的语义相似度变化来识别主题边界,实现自然语义切分。将该方法与两种传统基线进行受控对比实验。
# ═══ 语义边界检测分块器 — 核心创新 ═══ import numpy as np from sentence_transformers import SentenceTransformer class SemanticBoundaryChunker: """ 基于句子嵌入余弦相似度的语义边界检测分块器。 通过识别相邻句子间的语义跃变点实现自然切分。 """ def __init__(self, model_name="paraphrase-multilingual-MiniLM-L12-v2", percentile=25, min_chunk=100, max_chunk=800): self.encoder = SentenceTransformer(model_name) self.percentile = percentile self.min_chunk = min_chunk self.max_chunk = max_chunk def split(self, text: str) -> list[str]: # Step 1: 句子分割 sentences = self._split_sentences(text) # Step 2: 计算句子嵌入 embeddings = self.encoder.encode(sentences, normalize_embeddings=True) # Step 3: 计算相邻句子余弦相似度 similarities = np.dot(embeddings[:-1], embeddings[1:].T).diagonal() # Step 4: 动态阈值: θ = percentile(sim) threshold = np.percentile(similarities, self.percentile) boundaries = np.where(similarities < threshold)[0] + 1 # Step 5: 按边界切分 + 自适应合并 chunks = self._merge_chunks(sentences, boundaries) return chunks
| Metric | Fixed-size | Recursive | Semantic (Ours) | vs Fixed | vs Recursive |
|---|---|---|---|---|---|
| Precision@5 | 0.624 | 0.713 | 0.837 | +34.1% | +17.4% |
| Recall@5 | 0.581 | 0.672 | 0.794 | +36.7% | +18.2% |
| MRR | 0.543 | 0.635 | 0.762 | +40.3% | +20.0% |
| F1 Score | 0.602 | 0.692 | 0.815 | +35.4% | +17.8% |
| Answer Accuracy | 0.651 | 0.724 | 0.853 | +31.0% | +17.8% |
| Avg Latency (ms) | 118 | 125 | 156 | +32.2% | +24.8% |