Introduction
In a country as linguistically diverse as India, with over 22 officially recognized languages and hundreds of dialects, building AI systems that can understand and process multilingual content is not just a technical challenge—it's a necessity. IndicRAG is a production-ready Retrieval-Augmented Generation (RAG) system designed specifically for Indian languages.
This project emerged from a simple observation: while RAG systems have revolutionized document-based question answering in English, most existing solutions fall short when dealing with Indic languages. IndicRAG addresses this gap by combining state-of-the-art multilingual embeddings, cross-encoder reranking, and OCR capabilities into a single, scalable pipeline.
The Challenge: Why Multilingual RAG is Hard
Building NLP systems that work across India's linguistic diversity is challenging. Traditional English-centric RAG systems face several obstacles:
- Script Diversity: Devanagari, Dravidian, and Perso-Arabic scripts require specialized processing and tokenization.
- Limited Training Data: Most open-source embedding models are English-centric and weak in cross-lingual alignment.
- Code-Mixing: Users often switch between languages mid-sentence, such as mixing English and Hindi.
- OCR Quality: Scanned administrative and legal documents often have varied quality and complex layouts.
- Semantic Nuances: Direct translation often fails to capture semantic meaning, requiring language-aware embeddings.
The IndicRAG Architecture
To solve these challenges, IndicRAG utilizes a multi-stage retrieval architecture that balances speed with semantic nuance. The system consists of several integrated components:
1. Document Ingestion & OCR
Many Indian language documents are scanned images. IndicRAG integrates a dual-stage OCR engine:
- LayoutParser: Deep learning-based layout detection to identify complex structures like tables and headers.
- Tesseract: Lightweight OCR engine supporting 100+ languages, including all major Indian scripts.
- Semantic Chunking: Preserves paragraph boundaries with overlapping context windows instead of naive fixed-size chunking.
from pytesseract import image_to_string
from layoutparser import Detectron2LayoutModel
def extract_text_from_image(image_path, lang='hin+eng'):
# Load and detect layout
model = Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x')
layout = model.detect(image)
# Extract text regions in reading order
text_blocks = []
for block in layout:
if block.type == 'Text':
region = image[block.coordinates]
text = image_to_string(region, lang=lang)
text_blocks.append(text)
return '\n\n'.join(text_blocks)2. Stage 1: Dense Retrieval with LaBSE
IndicRAG utilizes LaBSE (Language-agnostic BERT Sentence Embeddings). Unlike standard mBERT, LaBSE is specifically trained on parallel corpora across 109 languages, ensuring that a query in Telugu can effectively retrieve relevant context even if the source document is in English or Hindi.
from sentence_transformers import SentenceTransformer
# Load multilingual embedding model
embedder = SentenceTransformer('AI4Bharat/indic-bert-v1-mlm-only')
def create_embeddings(chunks, lang='hindi'):
# Generate embeddings with language tags
tagged_texts = [f"[{lang}] {chunk}" for chunk in chunks]
embeddings = embedder.encode(
tagged_texts,
normalize_embeddings=True # For cosine similarity
)
return embeddings3. Stage 2: Cross-Encoder Reranking
Initial retrieval using dense embeddings can miss nuanced matches. We use a cross-encoder reranker to process the top candidates, identifying the most contextual matches before final generation.
from sentence_transformers import CrossEncoder
reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')
def rerank_results(query, retrieved_chunks, top_k=3):
# Prepare query-chunk pairs
pairs = [[query, chunk['chunk']] for chunk in retrieved_chunks]
scores = reranker.predict(pairs)
# Sort by score and return top-k
ranked_indices = np.argsort(scores)[::-1][:top_k]
return [retrieved_chunks[idx] for idx in ranked_indices]4. Contextual Generation with mT5
For the generation phase, the system leverages mT5 (Multilingual T5). Pre-trained on the mC4 corpus across 101 languages, mT5 is exceptionally robust at generating natural-sounding responses in languages like Marathi, Tamil, and Hindi.
Performance & Optimization
IndicRAG is optimized for production environments, focusing on both accuracy and latency:
- INT8 Quantization: Reduced model size from 2.4GB to 600MB, resulting in 2.3x faster inference on CPU.
- FastAPI Service: Asynchronous API with Redis caching and Prometheus monitoring.
- FAISS Optimization: GPU-accelerated search enables sub-200ms retrieval times.
Production API Design
IndicRAG is deployed as a production-ready FastAPI service with async request handling and caching:
FastAPI Implementation
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio
import redis
import json
app = FastAPI(title="IndicRAG API", version="1.0.0")
# Redis for caching
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)
class QueryRequest(BaseModel):
question: str
language: str = "hi" # Default Hindi
top_k: int = 5
use_reranking: bool = True
class QueryResponse(BaseModel):
answer: str
source_chunks: list
latency_ms: float
language: str
@app.post("/query", response_model=QueryResponse)
async def query_documents(request: QueryRequest):
"""Process multilingual document query"""
import time
start = time.time()
# Check cache first
cache_key = f"query:{request.question}:{request.language}"
cached = redis_client.get(cache_key)
if cached:
result = json.loads(cached)
result['latency_ms'] = (time.time() - start) * 1000
return result
# Retrieve relevant chunks
retrieved = await retrieve_chunks(
request.question,
request.language,
request.top_k
)
# Optional reranking
if request.use_reranking:
retrieved = rerank_results(request.question, retrieved, top_k=3)
# Generate answer
answer = await generate_answer(request.question, retrieved, request.language)
result = {
"answer": answer,
"source_chunks": [
{"text": chunk['chunk'], "score": float(chunk['score'])}
for chunk in retrieved
],
"language": request.language
}
# Cache for 1 hour
redis_client.setex(cache_key, 3600, json.dumps(result))
latency = (time.time() - start) * 1000
result['latency_ms'] = latency
return result
async def retrieve_chunks(question, language, top_k):
"""Async retrieval from FAISS"""
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
None,
lambda: faiss_index.search(question, language, top_k)
)
async def generate_answer(question, chunks, language):
"""Async answer generation with mT5"""
context = "\\n".join([c['chunk'] for c in chunks])
prompt = f"Question: {question}\\nContext: {context}\\nAnswer:"
loop = asyncio.get_event_loop()
return await loop.run_in_executor(
None,
lambda: mt5_model.generate(prompt, max_length=150)
)Redis Caching Strategy
Caching frequently asked questions significantly reduces latency and GPU load:
class CacheManager:
def __init__(self, redis_client):
self.redis = redis_client
self.cache_ttl = 3600 # 1 hour
def get_cache_key(self, question, language, filters=None):
"""Generate deterministic cache key"""
key_data = {
"q": question.lower().strip(),
"lang": language,
"filters": filters or {}
}
key_str = json.dumps(key_data, sort_keys=True)
return f"query:{hashlib.sha256(key_str.encode()).hexdigest()[:16]}"
def get_cached_result(self, cache_key):
"""Retrieve cached result"""
cached = self.redis.get(cache_key)
if cached:
self.redis.incr(f"{cache_key}:hits") # Track hit count
return json.loads(cached)
return None
def cache_result(self, cache_key, result):
"""Store result with TTL"""
self.redis.setex(cache_key, self.cache_ttl, json.dumps(result))
def get_cache_stats(self):
"""Get cache performance metrics"""
total_keys = len(self.redis.keys("query:*"))
return {
"total_cached_queries": total_keys,
"cache_size_mb": self.redis.info()['used_memory'] / (1024**2)
}Docker Deployment
Containerizing IndicRAG for scalable deployment:
# Dockerfile
FROM python:3.10-slim
WORKDIR /app
# Install system dependencies
RUN apt-get update && apt-get install -y \\
tesseract-ocr \\
tesseract-ocr-hin tesseract-ocr-tam tesseract-ocr-tel \\
&& rm -rf /var/lib/apt/lists/*
# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy application
COPY src/ ./src/
COPY models/ ./models/
# Download models (cached in layer)
RUN python -c "from sentence_transformers import SentenceTransformer; \\
SentenceTransformer('sentence-transformers/LaBSE')"
EXPOSE 8000
CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]Docker Compose Orchestration
version: '3.8'
services:
indicrag-api:
build: .
ports:
- "8000:8000"
environment:
- REDIS_URL=redis://redis:6379
- FAISS_INDEX_PATH=/data/faiss_index
- MODEL_DEVICE=cuda # or cpu
volumes:
- ./data:/data
- ./models:/app/models
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
depends_on:
- redis
- prometheus
restart: unless-stopped
redis:
image: redis:7-alpine
ports:
- "6379:6379"
volumes:
- redis_data:/data
command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
grafana:
image: grafana/grafana:latest
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
volumes:
- grafana_data:/var/lib/grafana
depends_on:
- prometheus
volumes:
redis_data:
prometheus_data:
grafana_data:Load Balancing and Scaling
For high-traffic production environments, deploy multiple API instances behind a load balancer:
# nginx.conf for load balancing
upstream indicrag_backend {
least_conn; # Load balance based on connections
server indicrag-api-1:8000 weight=1 max_fails=3 fail_timeout=30s;
server indicrag-api-2:8000 weight=1 max_fails=3 fail_timeout=30s;
server indicrag-api-3:8000 weight=1 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name api.indicrag.example.com;
location / {
proxy_pass http://indicrag_backend;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
# Timeouts for long-running queries
proxy_connect_timeout 60s;
proxy_send_timeout 60s;
proxy_read_timeout 60s;
# Request buffering
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
}
# Health check endpoint
location /health {
access_log off;
proxy_pass http://indicrag_backend/health;
}
}Monitoring and Metrics
Comprehensive observability with Prometheus and custom metrics:
from prometheus_client import Counter, Histogram, Gauge
import time
# Metrics
query_counter = Counter('indicrag_queries_total', 'Total queries', ['language', 'status'])
query_latency = Histogram('indicrag_query_latency_seconds', 'Query latency', ['language'])
cache_hit_rate = Gauge('indicrag_cache_hit_rate', 'Cache hit rate')
active_connections = Gauge('indicrag_active_connections', 'Active connections')
@app.middleware("http")
async def metrics_middleware(request, call_next):
"""Track request metrics"""
start = time.time()
# Increment active connections
active_connections.inc()
try:
response = await call_next(request)
# Record metrics
latency = time.time() - start
query_latency.labels(language=request.query_params.get('lang', 'unknown')).observe(latency)
query_counter.labels(
language=request.query_params.get('lang', 'unknown'),
status=response.status_code
).inc()
return response
finally:
active_connections.dec()
@app.get("/metrics")
async def get_metrics():
"""Prometheus metrics endpoint"""
from prometheus_client import generate_latest
return Response(generate_latest(), media_type="text/plain")Performance Under Load
Load testing with 100 concurrent users shows robust performance:
- Horizontal Scaling: Near-linear scaling up to 5 API instances with Redis cache
- GPU Utilization: 85% average GPU utilization with batched inference
- Memory Footprint: 2.8GB per instance (quantized models + FAISS index)
- Cold Start Time: ~12 seconds (model loading + FAISS index mmap)
Key Takeaways
Building production-grade multilingual NLP systems requires balancing model performance, inference speed, and engineering complexity. By combining script-aware OCR, cross-lingual embeddings, and quantized transformer models, IndicRAG bridges the digital divide for millions of non-English speakers.
👉 github.com/DNSdecoded/IndicRAG
Interested in the intersection of multilingual NLP and production ML systems? Let's connect!
← Back to Blog