Appropriateness Filtering for Corporate RAG Database Storage

Abstract

Retrieval-Augmented Generation (RAG) quality depends on what gets embedded. Informal business channels (email/Slack) often mix valuable facts with personal, toxic, speculative, or sarcastic content. We propose a pre-embedding filtering framework that decomposes messages into claim-level units and scores each with a modular MoE of fine-tuned RoBERTa-large classifiers (relevance, tone/sarcasm, confidentiality/PII, toxicity, speculation/opinion, inconsistency).

In our demo we show a pipeline that downloads emails received to an email address, converts this email into specific claims, and then ranks the claims on various attributes to automatically decide which claims are kept and which are dropped; with retained claims keeping scores for downstream weighting. We will compare pre-filtered RAG against vanilla RAG and LLM-prompted filtering, studying compounding effects of layered filters on QA quality and safety within an enterprise domain.

Teaser Figure

High-level NetWatch flow: ingestion → claim extraction → modular filtering → embedding/storage → retrieval & QA.

Project pipeline diagram

Notes

Classifier scores are stored as metadata for retrieval-time re-weighting, allowing ambiguous claims to be down-ranked instead of hard-dropped.

Introduction / Background / Motivation

Problem. Enterprises rely on RAG over large internal corpora, but ~unstructured email/chat blends professional and personal content. Embedding inappropriate or unreliable text degrades retrieval and raises compliance risk.

Gap. Prior work emphasizes post-retrieval tricks (graph fusion, RRF); little evaluates source-level filtering before embedding when full context (headers/threads) is available.

Hypothesis. Appropriateness filtering at ingestion improves QA relevance and reduces PII/leak risk versus post-hoc ranking alone. Ambiguous claims (e.g., sarcasm intertwined with facts) are retained with metadata rather than blindly dropped.

Approach

Ingestion. Gmail for demo (IMAP via imaplib) + ENRON emails for testing (from the Berkeley annotated archive). Email message parsed into text, headers, thread context.
Claim decomposition. A claim creation, decomposition, and verification module—adapted from Microsoft’s “Claimify” methodology—splits messages into atomic claims; ambiguous spans preserved with context.
MoE filtering. Current prototype uses an LLM pipeline: “Claim 7-step CoT + 2-step ReAct classifier.ipynb” implementing per-claim decision heads (Relevance, PII/Confidentiality, Tone/Sarcasm, Toxicity, Speculation/Opinion, Inconsistency). Finetuned RoBERTa-large heads are planned; LLM classifiers are the present default.
Storage & retrieval. Pinecone index netwatch-claims, embeddings via OpenAI text-embedding-3-large (3072 dims). Retrieve via ANN; semantic RRF ranking under construction.
QA agent. LangChain + LangGraph tool calls (lookup_in_rag) wired to OpenAI gpt-5-mini. DSPy is used for more complex prompt/program structuring.

Baselines. (A) Vanilla RAG (no pre-filter). (B) RAG + LLM re-rank/filters (zero-/few-shot). (C) Optional: post-filter RRF only.

Novelty. We evaluate ingestion-time, claim-level filtering in an informal-communications domain where appropriate and non-appropriate content co-exist within messages, and run end-to-end ablations to quantify compounding effects.

Overall Implementation Snapshot

Technologies: Implemented in Python using Google Colab.
- LLM: OpenAI “GPT-5-mini” via API; DSPy for complex query/prompt structuring.
- Vector database: Pinecone (hosted on AWS us-east-1), orchestrated via LangGraph tool calls.
- Embeddings: OpenAI text-embedding-3-large with EMBED_DIMS = 3072.
- Data I/O: Email and claim DataFrames are persisted to Google Drive (XLSX) between sessions.
Agent/UI: LangGraph state graph with tool calling - linked to Gradio ChatInterface.
Email ingest: IMAP to netwatch5541@gmail.com; appends rows to gmail_msgs_df.
Create Claims: Convert complex gmail_msgs_df into gmail_claims_df
Filter Claims: Score claims and append rankings to gmail_claims_df
Build DB: build_db() embeds and indexes claims.
Retrieval: db_lookup() and tool lookup_in_rag (currently with RRF placeholder).

Claim Extraction Module

Our current Claim Creation prompt architecture is a 4 step method involving selection, where the claims are first curated by paraphrasing the email, disambiguation, where the claims are augmented with as much context as possible from the email and decomposition, where the claims are broken down to become independent claims.

Goal: Robust claim segmentation + per-claim gating before embedding.
Flow: (1) sentence spans → (2) minimal-claim refinement (CoT) → (3) entity/PII redaction candidates → (4) ReAct verification for uncertain cases.
Outputs: list of claims with labels + rationales + confidence; metadata carried into retrieval for re-weighting.
Status: running as LLM pipeline; finetuned heads planned for Relevance/PII/Tone first.

Data

Primary. ~1,700 ENRON emails with tone/topic labels (email-level), sourced from the Berkeley annotated archive. For claim-level training/eval we are also considering to construct a synthetic composite set by inserting professional facts/QA pairs (from MeetingBank-QA-Summary) into non-professional ENRON emails, yielding mixed-context messages with groundable QA.

Claim extraction eval. For our Claim Extraction we hope to ultimately Benchmark against Claimify-style datasets to ensure accurate splitting independent of LLM prompting.

Preliminary Results & Evaluation

Qualitative results (initial testing)

Based on initial testing we observed the following trends:

Claims: The claim generation system we have created works at varying levels of proficiency based on the type of prompt that we input into the LLM. In particular the LLm still seems to struggle to generate fully independently understandable claims - for example, creating claims that reference previous claims without including the necessary context. Given the ultimate usage of the Claims, after first reviews we consider it likely that we will also have to add a context summary alongside each claim.
Filters: The Confidentiality filter appears most sensitive in spot-checks—LLM tends to over-flag mentions of “private/confidential” even when referring to a separate document (e.g., an email announcing a new “Confidentiality Policy” isn’t itself confidential). Conversely, the Toxicity filter generally detects toxic CoT but underrates misogyny and other “casual” toxic language.

Live Demo (local auth): Gradio chat is protected with teamNetwatch / 5541FinalProject. Share link available during mentor meeting.

Two tracks. (1) Classifier gating quality on claims. (2) Retrieval/QA utility after pre-filtering. We drop ROUGE-L (not appropriate for this task) and use metrics aligned to accept/reject behavior and retrieval ranking quality.

Classifier Metrics (per-head, claim level)

Table A. Per-classifier gating metrics focused on *accept/reject* behavior.
Classifier (current)	Source	Accept Acc↑	Reject Acc↑	Balanced Acc↑	Precision_accept↑	Precision_reject↑	AUPRC / ROC-AUC↑	Notes
Relevance	LLM CoT+ReAct	TBD	TBD	TBD	TBD	TBD	TBD	Finetune planned (RoBERTa-large)
PII / Confidentiality	LLM CoT+ReAct	TBD	TBD	TBD	TBD	TBD	TBD	Redaction candidates emitted
Tone / Sarcasm	LLM CoT+ReAct	TBD	TBD	TBD	TBD	TBD	TBD	Down-rank if ambiguous
Toxicity	LLM CoT+ReAct	TBD	TBD	TBD	TBD	TBD	TBD	Gate hard above threshold
Speculation / Opinion	LLM CoT+ReAct	TBD	TBD	TBD	TBD	TBD	TBD	Store w/ flag vs. drop
Inconsistency	LLM CoT+ReAct	TBD	TBD	TBD	TBD	TBD	TBD	Cross-claim check (planned)

Retrieval/QA Metrics (system level)

Table B. Retrieval/QA metrics (no ROUGE-L); populate with midterm numbers.
Setting	nDCG@10↑	P@10↑	PII Leak↓	Faithfulness↑	Notes
Vanilla RAG	TBD	TBD	TBD	TBD	No filtering
Pre-filtered RAG (ours)	TBD	TBD	TBD	TBD	Claim-level MoE; metadata re-weighting
LLM-filtered RAG	TBD	TBD	TBD	TBD	Zero/few-shot filters

Status (midterm):

✅ APIs + index + 3072-d embeddings in place; retrieval tool wired into LangGraph agent.
✅ XLSX/ENRON + Gmail IMAP ingest working.
🟨 Claimify (LLM CoT+ReAct) active as default classifiers; moving into main file.
🟨 RRF and filter-aware boosts to replace placeholder RNG.
⚠️ Unify any 768-dim vestiges; keep Gemini helper gated/removed.

Plan & Risks

Week 1–2: Fine-tune relevance/tone/privacy heads; calibrate thresholds; stand up Pinecone + LangChain flow.
Week 3: Run Vanilla vs. Pre-filtered vs. LLM-filtered comparisons on held-out mixed ENRON set; report EM/F1, faithfulness, PII leak.
Week 4: Optional semantic RRF ablation; error analysis; finalize slides/report/site.

Risks. Email-level labels → create synthetic claim-level supervision; class imbalance → weighted sampling/focal loss; privacy eval → synthetic PII injections; ambiguity → keep with metadata; engineering → unify index dimension (avoid 768 vs 3072 mismatch), replace placeholder RNG scoring with RRF, guard optional Gemini code paths.

Conclusion & Future Work

We introduce ingestion-time appropriateness filtering for corporate RAG: claim-level MoE classifiers decide what to store and how to weight retrieval. Next, we will broaden datasets (Avocado), expand filters (consistency checking against KB), and evaluate redaction policies vs. hard drops.

Appropriateness Filtering of Claims for Corporate RAG Knowledge Base Storage

Fall 2025 · CSCI 5541 NLP · University of Minnesota

Team: NetWatch

Abstract

Teaser Figure

Notes

Introduction / Background / Motivation

Approach

Overall Implementation Snapshot

Claim Extraction Module

Data

Preliminary Results & Evaluation

Qualitative results (initial testing)

Classifier Metrics (per-head, claim level)

Retrieval/QA Metrics (system level)

Plan & Risks

Conclusion & Future Work