Introduction
Building NLP systems that work across India's linguistic diversity is challenging. With 22 official languages and hundreds of dialects, creating a single question-answering system that serves all users requires careful architecture and multilingual expertise.
In this post, I'll walk through IndicRAG, a production-ready Retrieval-Augmented Generation (RAG) pipeline I built to handle document Q&A across 12+ Indian languages, including Hindi, Bengali, Tamil, Telugu, and more.
The Challenge: Multilingual Document Understanding
Traditional NLP systems face several obstacles in Indian language contexts:
- Script diversity: Devanagari, Dravidian, Perso-Arabic scripts require different processing
- Limited training data: Most models are English-centric
- Code-mixing: Users often switch between languages mid-sentence
- Domain-specific vocabulary: Technical and legal terms vary by language
- OCR challenges: Scanned documents have varied quality
RAG Architecture Overview
Retrieval-Augmented Generation combines the best of information retrieval and generative models:
- Document Ingestion: Process and chunk documents with context preservation
- Vector Encoding: Convert chunks to embeddings using multilingual models
- Semantic Search: Retrieve relevant chunks using FAISS vector similarity
- Reranking: Cross-encoder models improve retrieval precision
- Generation: Contextualized answer generation with mT5/mBERT
Component 1: Multilingual Document Processing
OCR Integration for Scanned PDFs
Many Indian language documents are scanned images. I integrated dual OCR engines:
- Tesseract: Lightweight, supports 100+ languages including all Indian scripts
- LayoutParser: Deep learning-based layout detection for complex documents
from tesseract import image_to_string
from layoutparser import Detectron2LayoutModel
# Multi-stage OCR pipeline
def extract_text_multilingual(image_path, lang='hin+eng'):
layout = layout_model.detect(image_path)
text_blocks = []
for block in layout:
text = image_to_string(
block.crop_image(image_path),
lang=lang,
config='--psm 6' # Uniform block of text
)
text_blocks.append(text)
return '\n'.join(text_blocks)
Semantic Chunking
Instead of naive fixed-size chunking, I implemented semantic boundary detection:
- Preserve paragraph and section boundaries
- Keep tables and lists intact
- Add overlapping context windows (256 tokens overlap)
- Metadata tagging for document structure
Component 2: Multilingual Dense Embeddings
The key to cross-lingual retrieval is using models trained on parallel multilingual data:
Model Selection: LaBSE vs mBERT
I evaluated multiple embedding models:
- LaBSE (Language-agnostic BERT Sentence Embeddings): Trained on 109 languages with translation pairs, excellent for semantic similarity across languages
- mBERT (Multilingual BERT): Good general-purpose embeddings but weaker cross-lingual alignment
- XLM-RoBERTa: Strong performance but larger model size
Winner: LaBSE for its superior cross-lingual retrieval performance despite slightly higher inference cost.
Vector Storage with FAISS
For efficient similarity search over millions of document chunks, I used FAISS (Facebook AI Similarity Search):
import faiss
import numpy as np
# Create FAISS index with inner product similarity
dimension = 768 # LaBSE embedding size
index = faiss.IndexFlatIP(dimension)
# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings)
index.add(embeddings)
# Search with GPU acceleration (optional)
gpu_index = faiss.index_cpu_to_gpu(
faiss.StandardGpuResources(), 0, index
)
Component 3: Cross-Encoder Reranking
Initial retrieval using dense embeddings can miss nuanced matches. A two-stage approach improves precision:
- Stage 1 (Fast): FAISS retrieves top 50 candidates (~10ms)
- Stage 2 (Accurate): Cross-encoder reranks to top 5 (~100ms)
Why Cross-Encoders?
Unlike bi-encoders (which encode query and document separately), cross-encoders process them jointly, capturing fine-grained interaction signals:
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_name = "cross-encoder/mmarco-mMiniLMv2-L12-H384-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
reranker = AutoModelForSequenceClassification.from_pretrained(model_name)
def rerank(query, candidates):
pairs = [[query, doc] for doc in candidates]
inputs = tokenizer(pairs, padding=True, truncation=True,
return_tensors='pt', max_length=512)
scores = reranker(**inputs).logits.squeeze(-1)
return scores.argsort(descending=True)
Component 4: Contextual Answer Generation
With the most relevant chunks retrieved, I use sequence-to-sequence models for answer generation:
Model: mT5 (Multilingual Text-to-Text Transfer Transformer)
- Covers 101 languages including all major Indian languages
- Trained on mC4 corpus (multilingual Common Crawl)
- Supports abstractive and extractive QA
from transformers import MT5ForConditionalGeneration, MT5Tokenizer
model = MT5ForConditionalGeneration.from_pretrained("google/mt5-large")
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-large")
def generate_answer(query, context):
input_text = f"question: {query} context: {context}"
inputs = tokenizer(input_text, return_tensors="pt",
max_length=1024, truncation=True)
outputs = model.generate(
**inputs,
max_length=256,
num_beams=4,
early_stopping=True,
no_repeat_ngram_size=3
)
return tokenizer.decode(outputs[0], skip_special_tokens=True)
Deployment: FastAPI Service
For production deployment, I built a REST API with FastAPI:
Key Features
- Async processing: Non-blocking I/O for concurrent requests
- Response streaming: Progressive answer generation
- Caching: Redis for frequent queries
- Rate limiting: Protect against abuse
- Monitoring: Prometheus metrics and logging
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio
app = FastAPI(title="IndicRAG API")
class Query(BaseModel):
question: str
language: str
top_k: int = 5
@app.post("/ask")
async def ask_question(query: Query):
# Retrieve relevant chunks
chunks = await retrieve_chunks(
query.question,
query.language,
top_k=query.top_k
)
# Rerank
reranked = await rerank_chunks(query.question, chunks)
# Generate answer
answer = await generate_answer(
query.question,
reranked[:3]
)
return {
"answer": answer,
"sources": [chunk.metadata for chunk in reranked[:3]],
"confidence": calculate_confidence(reranked)
}
Performance Optimizations
1. Model Quantization
Reduced model size by 4x using INT8 quantization with negligible accuracy loss:
- FP32 mT5-large: 2.4GB â INT8: 600MB
- Inference speed: 2.3x faster on CPU
2. Batch Processing
Process multiple queries together for better GPU utilization:
# Batch encoding
embeddings = model.encode(
sentences,
batch_size=32,
show_progress_bar=False,
convert_to_numpy=True
)
3. Approximate Nearest Neighbors
For very large document collections (>10M chunks), use HNSW indexing:
# FAISS HNSW index for fast approximate search
index = faiss.IndexHNSWFlat(dimension, 32)
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 128
Evaluation Metrics
I evaluated IndicRAG on multiple benchmarks:
- XQuAD (Hindi/Telugu): F1 Score 78.3%
- AI2 Reasoning Challenge (translated): 71.2% accuracy
- Custom domain QA: 82.5% exact match on legal documents
- Retrieval precision@5: 89.7%
Real-World Challenges & Solutions
Challenge 1: Code-Mixed Queries
Users often ask "What is property tax in ā¤ŽāĨā¤ā¤Ŧā¤?" (mixing English and Hindi)
Solution: Language detection with fallback to multi-script embedding models
Challenge 2: Domain Terminology
Legal and medical terms often lack good translations
Solution: Custom terminology dictionaries and domain-specific fine-tuning
Challenge 3: Context Window Limitations
Long documents exceed model context limits (512-1024 tokens)
Solution: Hierarchical retrieval with document-level and chunk-level search
Lessons Learned
- Language-specific preprocessing matters: Hindi text requires different tokenization than Tamil
- Evaluation is hard: Translation-based evaluation misses cultural context
- User feedback loops: Implicit feedback (click-through) beats explicit ratings
- Fallback strategies: Always have a plan when models fail (e.g., keyword search)
Open Source & GitHub
The complete IndicRAG codebase is available on GitHub:
github.com/DNSdecoded/IndicRAG
Includes:
- Document preprocessing pipeline
- Multi-stage retrieval implementation
- FastAPI deployment code
- Docker containerization
- Evaluation scripts and benchmarks
Conclusion
Building production-grade multilingual NLP systems requires balancing model performance, inference speed, and engineering complexity. IndicRAG demonstrates that with careful architecture and optimization, it's possible to serve accurate, fast question-answering across diverse Indian languages.
The key takeaways:
- Use multilingual models trained on parallel data (LaBSE, mT5)
- Implement two-stage retrieval (fast dense + accurate reranking)
- Optimize for production (quantization, batching, caching)
- Plan for edge cases (code-mixing, domain terms, long contexts)
Questions about multilingual RAG or want to collaborate? Get in touch!
â Back to Blog