Demonstration of methodology to increase enterprise RAG quality by pre-embedding filtering of claim-level content, pulled from informal communications such as email or Slack.
Retrieval-Augmented Generation (RAG) quality depends on what gets embedded. Informal business channels (email/Slack) often mix valuable facts with personal, toxic, speculative, or sarcastic content. We propose a pre-embedding filtering framework that decomposes messages into claim-level units and scores each with a modular MoE of fine-tuned RoBERTa-large classifiers (relevance, tone/sarcasm, confidentiality/PII, toxicity, speculation/opinion, inconsistency).
In our demo we show a pipeline that downloads emails received to an email address, converts this email into specific claims, and then ranks the claims on various attributes to automatically decide which claims are kept and which are dropped; with retained claims keeping scores for downstream weighting. We will compare pre-filtered RAG against vanilla RAG and LLM-prompted filtering, studying compounding effects of layered filters on QA quality and safety within an enterprise domain.
High-level NetWatch flow: ingestion → claim extraction → modular filtering → embedding/storage → retrieval & QA.
Classifier scores are stored as metadata for retrieval-time re-weighting, allowing ambiguous claims to be down-ranked instead of hard-dropped.
Problem. Enterprises rely on RAG over large internal corpora, but ~unstructured email/chat blends professional and personal content. Embedding inappropriate or unreliable text degrades retrieval and raises compliance risk.
Gap. Prior work emphasizes post-retrieval tricks (graph fusion, RRF); little evaluates source-level filtering before embedding when full context (headers/threads) is available.
Hypothesis. Appropriateness filtering at ingestion improves QA relevance and reduces PII/leak risk versus post-hoc ranking alone. Ambiguous claims (e.g., sarcasm intertwined with facts) are retained with metadata rather than blindly dropped.
imaplib) + ENRON emails for testing (from the Berkeley annotated archive). Email message parsed into text, headers, thread context.Pinecone index netwatch-claims, embeddings via OpenAI text-embedding-3-large (3072 dims). Retrieve via ANN; semantic RRF ranking under construction.lookup_in_rag) wired to OpenAI gpt-5-mini. DSPy is used for more complex prompt/program structuring.Baselines. (A) Vanilla RAG (no pre-filter). (B) RAG + LLM re-rank/filters (zero-/few-shot). (C) Optional: post-filter RRF only.
Novelty. We evaluate ingestion-time, claim-level filtering in an informal-communications domain where appropriate and non-appropriate content co-exist within messages, and run end-to-end ablations to quantify compounding effects.
us-east-1), orchestrated via LangGraph tool calls.text-embedding-3-large with EMBED_DIMS = 3072.ChatInterface.netwatch5541@gmail.com; appends rows to gmail_msgs_df.gmail_msgs_df into gmail_claims_df gmail_claims_df build_db() embeds and indexes claims.db_lookup() and tool lookup_in_rag (currently with RRF placeholder).Our current Claim Creation prompt architecture is a 4 step method involving selection, where the claims are first curated by paraphrasing the email, disambiguation, where the claims are augmented with as much context as possible from the email and decomposition, where the claims are broken down to become independent claims.
Primary. ~1,700 ENRON emails with tone/topic labels (email-level), sourced from the Berkeley annotated archive. For claim-level training/eval we are also considering to construct a synthetic composite set by inserting professional facts/QA pairs (from MeetingBank-QA-Summary) into non-professional ENRON emails, yielding mixed-context messages with groundable QA.
Claim extraction eval. For our Claim Extraction we hope to ultimately Benchmark against Claimify-style datasets to ensure accurate splitting independent of LLM prompting.
Based on initial testing we observed the following trends:
teamNetwatch / 5541FinalProject. Share link available during mentor meeting.Two tracks. (1) Classifier gating quality on claims. (2) Retrieval/QA utility after pre-filtering. We drop ROUGE-L (not appropriate for this task) and use metrics aligned to accept/reject behavior and retrieval ranking quality.
| Classifier (current) | Source | Accept Acc↑ | Reject Acc↑ | Balanced Acc↑ | Precisionaccept↑ | Precisionreject↑ | AUPRC / ROC-AUC↑ | Notes |
|---|---|---|---|---|---|---|---|---|
| Relevance | LLM CoT+ReAct | TBD | TBD | TBD | TBD | TBD | TBD | Finetune planned (RoBERTa-large) |
| PII / Confidentiality | LLM CoT+ReAct | TBD | TBD | TBD | TBD | TBD | TBD | Redaction candidates emitted |
| Tone / Sarcasm | LLM CoT+ReAct | TBD | TBD | TBD | TBD | TBD | TBD | Down-rank if ambiguous |
| Toxicity | LLM CoT+ReAct | TBD | TBD | TBD | TBD | TBD | TBD | Gate hard above threshold |
| Speculation / Opinion | LLM CoT+ReAct | TBD | TBD | TBD | TBD | TBD | TBD | Store w/ flag vs. drop |
| Inconsistency | LLM CoT+ReAct | TBD | TBD | TBD | TBD | TBD | TBD | Cross-claim check (planned) |
| Setting | nDCG@10↑ | P@10↑ | PII Leak↓ | Faithfulness↑ | Notes |
|---|---|---|---|---|---|
| Vanilla RAG | TBD | TBD | TBD | TBD | No filtering |
| Pre-filtered RAG (ours) | TBD | TBD | TBD | TBD | Claim-level MoE; metadata re-weighting |
| LLM-filtered RAG | TBD | TBD | TBD | TBD | Zero/few-shot filters |
Status (midterm):
Risks. Email-level labels → create synthetic claim-level supervision; class imbalance → weighted sampling/focal loss; privacy eval → synthetic PII injections; ambiguity → keep with metadata; engineering → unify index dimension (avoid 768 vs 3072 mismatch), replace placeholder RNG scoring with RRF, guard optional Gemini code paths.
We introduce ingestion-time appropriateness filtering for corporate RAG: claim-level MoE classifiers decide what to store and how to weight retrieval. Next, we will broaden datasets (Avocado), expand filters (consistency checking against KB), and evaluate redaction policies vs. hard drops.