Text Processing

NLP, embeddings, classification, RAG

Text processing is the foundation of modern NLP and AI applications, encompassing everything from basic tokenization to sophisticated RAG pipelines. It bridges raw text data and machine understanding through embedding models, vector databases, and retrieval systems. In 2025, text processing powers search engines, chatbots, document analysis tools, and enterprise knowledge systems.

Key Formulas

\text{TF-IDF = TF(term, doc) × IDF(term) = (count(term in doc) / total terms in doc) × log(N / df(term))}

\text{Cosine Similarity = (A · B) / (||A|| × ||B||) = Σ(Ai × Bi) / (√ΣAi² × √ΣBi²)}

\text{Euclidean Distance = √Σ(Ai - Bi)² - used for comparing embedding vectors}

\text{BM25 Score = Σ(IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl)))}

\text{Chunk Overlap Ratio = overlap\_tokens / chunk\_tokens - typically 10-20\% for RAG}

\text{Recall@K = (relevant items in top K) / (total relevant items) - for retrieval evaluation}

\text{MRR (Mean Reciprocal Rank) = (1/|Q|) × Σ(1 / rank of first relevant item)}

\text{Chunk Size (tokens) ≈ input\_tokens / (overlap\_ratio + 1) - for equal-sized chunks}

Key Concepts

Tokenization Strategies

Tokenization converts text into discrete units (tokens) that models process. Modern approaches include: (1) WordPiece/BPE (Byte Pair Encoding) - used by GPT-4, LLaMA, splits words into subword units; (2) SentencePiece - language-agnostic, handles whitespace-free languages; (3) Character-level - used by some specialized models, preserves exact text; (4) Tiktoken - OpenAI's efficient BPE implementation for GPT models. In 2025, tokenization overhead matters: GPT-4o uses ~4 chars/token for English, ~1-2 chars/token for code. Token costs directly impact API pricing.

Embedding Models

Embeddings map text to dense vectors (typically 384-3072 dimensions) capturing semantic meaning. Leading models in 2025: (1) OpenAI text-embedding-3-large (3072d) - highest quality, supports dimension reduction; (2) Cohere embed-v3 - strong multilingual, supports compression; (3) Voyage AI voyage-2 (1024d) - excellent for long documents; (4) Google text-gecko (768d) - integrated with Vertex AI; (5) E5-large-v2 (1024d) - open-source, strong benchmark performance; (6) BGE-large-en - open-source, competitive with commercial models. Model selection depends on language coverage, dimension constraints, cost, and domain specialization.

Vector Databases

Vector databases store and retrieve embeddings efficiently. Top options in 2025: (1) Pinecone - managed, serverless or pod-based, excellent for production; (2) Chroma - lightweight, runs locally, great for development; (3) pgvector - PostgreSQL extension, leverages existing infrastructure; (4) Weaviate - open-source with GraphQL, strong hybrid search; (5) Qdrant - Rust-based, high performance, open-source; (6) Milvus - distributed, handles billions of vectors. Key features to evaluate: metadata filtering, hybrid (vector + keyword) search, multi-tenancy, replication, backup/restore, and pricing model. Pinecone's serverless tier charges per query, while pgvector adds minimal overhead to existing Postgres deployments.

RAG Pipeline Architecture

Retrieval-Augmented Generation (RAG) combines retrieval with generation. Core components: (1) Document Processing - extract text, chunk, embed; (2) Vector Store - persist embeddings with metadata; (3) Query Processing - embed query, retrieve top-K chunks; (4) Context Assembly - format retrieved chunks with query; (5) Generation - LLM synthesizes answer from context. Advanced patterns: Hybrid RAG (vector + BM25), Reranking (Cross-encoder after retrieval), Query Expansion (generate multiple query variations), Self-Querying (LLM generates metadata filters), Multi-Vector Store (separate indexes by document type). In 2025, production RAG systems use 100-500 token chunks with 10-20% overlap and retrieve 5-20 documents per query.

Chunking Strategies

Chunking splits documents into retrievable units. Common approaches: (1) Fixed-size - simple, predictable token counts, risks mid-sentence splits; (2) Recursive character - splits on paragraphs, then sentences, then words; (3) Semantic - respects document structure (headings, sections); (4) Sentence-based - NLTK/spaCy sentence boundaries; (5) Sliding window - overlapping chunks for context continuity; (6) Parent-child - retrieve small chunks, return larger parent context. Best practices: 256-512 tokens for dense text, 512-1024 for code/technical docs, 10-20% overlap preserves context across boundaries. LangChain's RecursiveCharacterTextSplitter with separators ['\n\n', '\n', '. ', ' ', ''] handles most documents well.

Named Entity Recognition (NER)

NER identifies and classifies entities in text (people, organizations, locations, dates, etc.). Modern approaches: (1) Transformer-based - spaCy's transformer pipeline, Hugging Face models (xlm-roberta-ner); (2) LLM extraction - prompt GPT-4/Claude with structured output; (3) Fine-tuned models - domain-specific (biomedical, legal, financial). In 2025, spaCy's en_core_web_trf achieves ~92% F1 on standard benchmarks. For production: define entity schema upfront, handle entity disambiguation (Apple the company vs. fruit), validate extracted entities, and handle edge cases (multi-word entities, nested entities). LLM-based NER offers flexibility but higher latency and cost.

Sentiment Analysis

Sentiment analysis classifies text emotional tone. Approaches: (1) Lexicon-based - VADER, AFINN, fast but limited nuance; (2) ML classifiers - Naive Bayes, SVM, trained on labeled data; (3) Deep learning - BERT-based models (cardiffnlp/twitter-roberta-base-sentiment); (4) LLM prompting - GPT-4/Claude with structured output. Modern production systems often combine: fast lexicon for real-time, transformer for batch, LLM for ambiguous cases. Output formats: binary (positive/negative), ternary (pos/neg/neutral), fine-grained (5-star), aspect-based (sentiment per feature). Key challenge: sarcasm, negation, and domain-specific language require contextual understanding.

Text Classification

Text classification assigns predefined categories to text. Methods: (1) Traditional ML - TF-IDF + SVM/Naive Bayes, still effective for simple tasks; (2) Fine-tuned transformers - BERT/RoBERTa with classification head; (3) Zero-shot - classify without training data using NLI models or LLMs; (4) Few-shot - LLM with example prompts. In 2025, zero-shot classification via CLIP-style models or LLM prompting handles many use cases without training data. For high-volume production: fine-tune smaller models (DistilBERT, MiniLM) for speed. Evaluate using precision, recall, F1 per class - macro-averaged for balanced class importance, weighted for imbalanced datasets.

Solved Examples

Problem 1:

Calculate the TF-IDF score for the term 'machine' in a document where 'machine' appears 5 times, the document has 100 total words, and the term appears in 50 documents out of a corpus of 10,000 documents.

Solution:

Step 1: Calculate Term Frequency (TF)
TF = count(term in doc) / total terms in doc
TF = 5 / 100 = 0.05

Step 2: Calculate Inverse Document Frequency (IDF)
IDF = log(N / df(term))
IDF = log(10000 / 50) = log(200) ≈ 2.301

Step 3: Calculate TF-IDF
TF-IDF = TF × IDF
TF-IDF = 0.05 × 2.301 = 0.115

Answer: The TF-IDF score for 'machine' is approximately 0.115. This indicates moderate importance - the term appears somewhat frequently in this document but is also present in many other documents.

Problem 2:

A RAG system retrieves 10 chunks from a vector database. The query embedding has cosine similarity scores: [0.89, 0.85, 0.82, 0.78, 0.75, 0.72, 0.68, 0.65, 0.61, 0.58]. Calculate the Mean Reciprocal Rank (MRR) if the relevant answer appears in chunks at positions 1, 3, and 7. Then explain why MRR matters for RAG evaluation.

Solution:

Step 1: Identify the first relevant result
The first relevant chunk is at position 1 (0-indexed: position 0)
Reciprocal rank = 1/(position + 1) = 1/(0 + 1) = 1

Step 2: Calculate MRR for this single query
MRR = 1 / (first relevant position + 1)
MRR = 1 / 1 = 1.0

Step 3: Interpret the result
MRR = 1.0 is the best possible score - the top result was relevant.

Step 4: Why MRR matters for RAG
MRR measures how early relevant results appear, not just whether they appear. For RAG, this is critical because: (1) LLMs have limited context windows - early results get more attention; (2) Users prefer concise answers - finding relevance quickly reduces context bloat; (3) Reranking models can improve MRR by promoting relevant chunks.

Answer: MRR = 1.0 for this query. For multi-query evaluation, average the reciprocal ranks across all queries. A production RAG system should target MRR > 0.7 for top-10 retrieval.

Problem 3:

Design a chunking strategy for a legal document repository containing 500 PDFs averaging 50 pages each. Documents contain sections, subsections, and clause numbering. The goal is RAG for legal Q&A. Specify chunk size, overlap, splitting strategy, and justify each choice.

Solution:

Step 1: Analyze document characteristics
- Legal documents have hierarchical structure (sections, subsections, clauses)
- Cross-references between sections are common
- Each clause may be 100-500 tokens
- Context window target: 8000 tokens for generation

Step 2: Choose splitting strategy
Use semantic chunking respecting document structure:
- Primary separator: section headings (regex match 'Section \d+')
- Secondary separator: subsection headings
- Tertiary: clause boundaries (numbered lists like '1.1', '1.2')

Step 3: Determine chunk parameters
- Target chunk size: 512 tokens (covers typical legal clause + context)
- Overlap: 100 tokens (20%) - captures cross-references
- Maximum chunk: 1024 tokens - merge short adjacent clauses
- Minimum chunk: 128 tokens - avoid over-fragmentation

Step 4: Add metadata for filtering
Store with each chunk:
- Document title, date, jurisdiction
- Section/subsection path
- Clause numbers referenced

Step 5: Justification
- 512 tokens balances retrieval precision (smaller = more specific) with context preservation
- 20% overlap captures references that span clause boundaries
- Semantic splitting respects legal document structure
- Metadata enables jurisdiction-based filtering before vector search

Answer: Semantic chunking with 512-token target, 100-token overlap, splitting on section/clause boundaries, with legal metadata (jurisdiction, document type, date) for pre-filtering.

Problem 4:

Compare embedding models for a multilingual RAG system serving English, Spanish, and Mandarin documents. The system processes 1M queries/month with average 5 chunks retrieved per query. Compare: OpenAI text-embedding-3-large, Cohere embed-multilingual-v3, and a self-hosted E5 model on dimensions, cost, latency, and quality.

Solution:

Step 1: Define comparison criteria
- Dimensions: storage and memory footprint
- Cost: API costs for 1M queries × 5 chunks = 5M embeddings/month (new) + 1M queries/month
- Latency: response time for embedding operations
- Quality: multilingual benchmark performance (MTEB)

Step 2: Analyze each model

OpenAI text-embedding-3-large:
- Dimensions: 3072 (reducible to 256+)
- Cost:

0.13/1M tokens input. Assume 500 tokens/query + 2500 tokens/chunks = 3000 tokens total × 1M = 3B tokens =

390/month
- Latency: ~50-100ms via API
- Quality: Excellent English, good multilingual (not top tier)

Cohere embed-multilingual-v3:
- Dimensions: 1024
- Cost:

0.10/1M tokens × 3B tokens =

300/month
- Latency: ~30-60ms via API
- Quality: Top-tier multilingual, specifically trained for 100+ languages

Self-hosted E5-mistral-7b-instruct:
- Dimensions: 4096
- Cost: GPU instance ~

0.50-1/hour × 720 hours =

360-720/month
- Latency: ~10-20ms (local, no network)
- Quality: Strong but requires maintenance, model management

Step 3: Recommendation
For multilingual (English/Spanish/Mandarin): Cohere embed-multilingual-v3 is optimal.
- Best multilingual quality by design
- Reasonable cost (~$300/month)
- Good latency
- No infrastructure management

Step 4: Alternative scenarios
- If budget critical: Self-hosted E5 (if you have GPU infrastructure)
- If English-only: OpenAI text-embedding-3-small for lower cost
- If quality paramount: Cohere for multilingual, OpenAI large for English

Answer: Cohere embed-multilingual-v3 at ~$300/month, 1024 dimensions, 30-60ms latency, best-in-class multilingual quality. Choose OpenAI for English-only at scale, or self-host E5 if you have existing GPU infrastructure and want no per-query costs.

Tips & Tricks

Start with off-the-shelf embedding models (text-embedding-3-small, Cohere embed) before fine-tuning. Fine-tuning embeddings requires 10K+ labeled query-document pairs and often provides marginal gains.
Use dimension reduction (Matryoshka embeddings) to cut storage 4x with minimal quality loss. OpenAI text-embedding-3-large supports 256, 512, 1024, or 3072 dimensions.
Always evaluate retrieval with domain-specific test queries. Generic benchmarks (MTEB) don't reflect your actual use case. Build a test set of 100+ real queries with ground-truth relevant documents.
Hybrid search (vector + BM25) typically outperforms pure vector search by 5-15% on recall. Use vector for semantic matching, BM25 for exact term matching. Most vector databases (Pinecone, Weaviate, pgvector) support hybrid.
Chunk overlap is your friend for context preservation. 10-20% overlap prevents information loss at chunk boundaries. For technical documents, consider 25% overlap.
Pre-filter by metadata before vector search when possible. Filter on date ranges, document types, or categories to reduce search space and improve both speed and precision.

Ready to practice?

Test your understanding with questions and get instant feedback.

Start Exercise →