Appropriateness Filtering of Claims for Corporate RAG Knowledge Base Storage

Fall 2025 · CSCI 5541 NLP · University of Minnesota

Team: NetWatch

Demonstration of methodology to increase enterprise RAG quality by pre-embedding filtering of claim-level content, pulled from informal communications such as email or Slack.

A. Berg

Alex Berg
ber00221@umn.edu

Z. Johnson

Zephaniah Johnson
joh15514@umn.edu

A. Slinger

Alex Slinger
sling031@umn.edu

S. Subramanian

Sunder Subramanian
subra287@umn.edu



Abstract

Retrieval-Augmented Generation (RAG) quality depends on what gets embedded. Informal business channels (email/Slack) often mix valuable facts with personal, toxic, speculative, or sarcastic content. We propose a pre-embedding filtering framework that decomposes messages into claim-level units and scores each with a modular MoE of fine-tuned RoBERTa-large classifiers (relevance, tone/sarcasm, confidentiality/PII, toxicity, speculation/opinion, inconsistency).

In our demo we show a pipeline that downloads emails received to an email address, converts this email into specific claims, and then ranks the claims on various attributes to automatically decide which claims are kept and which are dropped; with retained claims keeping scores for downstream weighting. We will compare pre-filtered RAG against vanilla RAG and LLM-prompted filtering, studying compounding effects of layered filters on QA quality and safety within an enterprise domain.


Teaser Figure

High-level NetWatch flow: ingestion → claim extraction → modular filtering → embedding/storage → retrieval & QA.

Project pipeline diagram

Notes

Classifier scores are stored as metadata for retrieval-time re-weighting, allowing ambiguous claims to be down-ranked instead of hard-dropped.


Introduction / Background / Motivation

Problem. Enterprises rely on RAG over large internal corpora, but ~unstructured email/chat blends professional and personal content. Embedding inappropriate or unreliable text degrades retrieval and raises compliance risk.

Gap. Prior work emphasizes post-retrieval tricks (graph fusion, RRF); little evaluates source-level filtering before embedding when full context (headers/threads) is available.

Hypothesis. Appropriateness filtering at ingestion improves QA relevance and reduces PII/leak risk versus post-hoc ranking alone. Ambiguous claims (e.g., sarcasm intertwined with facts) are retained with metadata rather than blindly dropped.


Approach

  1. Ingestion. Gmail for demo (IMAP via imaplib) + ENRON emails for testing (from the Berkeley annotated archive). Email message parsed into text, headers, thread context.
  2. Claim decomposition. A claim creation, decomposition, and verification module—adapted from Microsoft’s “Claimify” methodology—splits messages into atomic claims; ambiguous spans preserved with context.
  3. MoE filtering. Current prototype uses an LLM pipeline: “Claim 7-step CoT + 2-step ReAct classifier.ipynb” implementing per-claim decision heads (Relevance, PII/Confidentiality, Tone/Sarcasm, Toxicity, Speculation/Opinion, Inconsistency). Finetuned RoBERTa-large heads are planned; LLM classifiers are the present default.
  4. Storage & retrieval. Pinecone index netwatch-claims, embeddings via OpenAI text-embedding-3-large (3072 dims). Retrieve via ANN; semantic RRF ranking under construction.
  5. QA agent. LangChain + LangGraph tool calls (lookup_in_rag) wired to OpenAI gpt-5-mini. DSPy is used for more complex prompt/program structuring.

Baselines. (A) Vanilla RAG (no pre-filter). (B) RAG + LLM re-rank/filters (zero-/few-shot). (C) Optional: post-filter RRF only.

Novelty. We evaluate ingestion-time, claim-level filtering in an informal-communications domain where appropriate and non-appropriate content co-exist within messages, and run end-to-end ablations to quantify compounding effects.

Overall Implementation Snapshot


Claim Extraction Module

Our current Claim Creation prompt architecture is a 4 step method involving selection, where the claims are first curated by paraphrasing the email, disambiguation, where the claims are augmented with as much context as possible from the email and decomposition, where the claims are broken down to become independent claims.


Data

Primary. ~1,700 ENRON emails with tone/topic labels (email-level), sourced from the Berkeley annotated archive. For claim-level training/eval we are also considering to construct a synthetic composite set by inserting professional facts/QA pairs (from MeetingBank-QA-Summary) into non-professional ENRON emails, yielding mixed-context messages with groundable QA.

Claim extraction eval. For our Claim Extraction we hope to ultimately Benchmark against Claimify-style datasets to ensure accurate splitting independent of LLM prompting.


Preliminary Results & Evaluation

Qualitative results (initial testing)

Based on initial testing we observed the following trends:

Live Demo (local auth): Gradio chat is protected with teamNetwatch / 5541FinalProject. Share link available during mentor meeting.

Two tracks. (1) Classifier gating quality on claims. (2) Retrieval/QA utility after pre-filtering. We drop ROUGE-L (not appropriate for this task) and use metrics aligned to accept/reject behavior and retrieval ranking quality.

Classifier Metrics (per-head, claim level)

Classifier (current) Source Accept Acc↑ Reject Acc↑ Balanced Acc↑ Precisionaccept Precisionreject AUPRC / ROC-AUC↑ Notes
Relevance LLM CoT+ReAct TBD TBD TBD TBD TBD TBD Finetune planned (RoBERTa-large)
PII / Confidentiality LLM CoT+ReAct TBD TBD TBD TBD TBD TBD Redaction candidates emitted
Tone / Sarcasm LLM CoT+ReAct TBD TBD TBD TBD TBD TBD Down-rank if ambiguous
Toxicity LLM CoT+ReAct TBD TBD TBD TBD TBD TBD Gate hard above threshold
Speculation / Opinion LLM CoT+ReAct TBD TBD TBD TBD TBD TBD Store w/ flag vs. drop
Inconsistency LLM CoT+ReAct TBD TBD TBD TBD TBD TBD Cross-claim check (planned)
Table A. Per-classifier gating metrics focused on accept/reject behavior.

Retrieval/QA Metrics (system level)

Setting nDCG@10↑ P@10↑ PII Leak↓ Faithfulness↑ Notes
Vanilla RAG TBD TBD TBD TBD No filtering
Pre-filtered RAG (ours) TBD TBD TBD TBD Claim-level MoE; metadata re-weighting
LLM-filtered RAG TBD TBD TBD TBD Zero/few-shot filters
Table B. Retrieval/QA metrics (no ROUGE-L); populate with midterm numbers.

Status (midterm):


Plan & Risks

Risks. Email-level labels → create synthetic claim-level supervision; class imbalance → weighted sampling/focal loss; privacy eval → synthetic PII injections; ambiguity → keep with metadata; engineering → unify index dimension (avoid 768 vs 3072 mismatch), replace placeholder RNG scoring with RRF, guard optional Gemini code paths.


Conclusion & Future Work

We introduce ingestion-time appropriateness filtering for corporate RAG: claim-level MoE classifiers decide what to store and how to weight retrieval. Next, we will broaden datasets (Avocado), expand filters (consistency checking against KB), and evaluate redaction policies vs. hard drops.