IndicRAG: Breaking Language Barriers in Document QA

← Back to Blog IndicRAG Architecture Workflow

Introduction

In a country as linguistically diverse as India, with over 22 officially recognized languages and hundreds of dialects, building AI systems that can understand and process multilingual content is not just a technical challenge—it's a necessity. IndicRAG is a production-ready Retrieval-Augmented Generation (RAG) system designed specifically for Indian languages.

This project emerged from a simple observation: while RAG systems have revolutionized document-based question answering in English, most existing solutions fall short when dealing with Indic languages. IndicRAG addresses this gap by combining state-of-the-art multilingual embeddings, cross-encoder reranking, and OCR capabilities into a single, scalable pipeline.

The Challenge: Why Multilingual RAG is Hard

Building NLP systems that work across India's linguistic diversity is challenging. Traditional English-centric RAG systems face several obstacles:

Script Diversity: Devanagari, Dravidian, and Perso-Arabic scripts require specialized processing and tokenization.
Limited Training Data: Most open-source embedding models are English-centric and weak in cross-lingual alignment.
Code-Mixing: Users often switch between languages mid-sentence, such as mixing English and Hindi.
OCR Quality: Scanned administrative and legal documents often have varied quality and complex layouts.
Semantic Nuances: Direct translation often fails to capture semantic meaning, requiring language-aware embeddings.

                💡 Key Insight: The challenge isn't just about translation—it's about understanding
                context, preserving semantic meaning, and handling the structural complexity of Indic scripts. This
                requires a fundamentally different approach from English-only RAG systems.
            

The IndicRAG Architecture

To solve these challenges, IndicRAG utilizes a multi-stage retrieval architecture that balances speed with semantic nuance. The system consists of several integrated components:

1. Document Ingestion & OCR

Many Indian language documents are scanned images. IndicRAG integrates a dual-stage OCR engine:

LayoutParser: Deep learning-based layout detection to identify complex structures like tables and headers.
Tesseract: Lightweight OCR engine supporting 100+ languages, including all major Indian scripts.
Semantic Chunking: Preserves paragraph boundaries with overlapping context windows instead of naive fixed-size chunking.

from pytesseract import image_to_string
from layoutparser import Detectron2LayoutModel

def extract_text_from_image(image_path, lang='hin+eng'):
    # Load and detect layout
    model = Detectron2LayoutModel('lp://PubLayNet/mask_rcnn_X_101_32x8d_FPN_3x')
    layout = model.detect(image)
    
    # Extract text regions in reading order
    text_blocks = []
    for block in layout:
        if block.type == 'Text':
            region = image[block.coordinates]
            text = image_to_string(region, lang=lang)
            text_blocks.append(text)
    
    return '\n\n'.join(text_blocks)

2. Stage 1: Dense Retrieval with LaBSE

IndicRAG utilizes LaBSE (Language-agnostic BERT Sentence Embeddings). Unlike standard mBERT, LaBSE is specifically trained on parallel corpora across 109 languages, ensuring that a query in Telugu can effectively retrieve relevant context even if the source document is in English or Hindi.

                Vector Storage: Powered by FAISS (Facebook AI Similarity Search) for efficient
                similarity search over millions of document chunks.
            

from sentence_transformers import SentenceTransformer

# Load multilingual embedding model
embedder = SentenceTransformer('AI4Bharat/indic-bert-v1-mlm-only')

def create_embeddings(chunks, lang='hindi'):
    # Generate embeddings with language tags
    tagged_texts = [f"[{lang}] {chunk}" for chunk in chunks]
    embeddings = embedder.encode(
        tagged_texts,
        normalize_embeddings=True  # For cosine similarity
    )
    return embeddings

3. Stage 2: Cross-Encoder Reranking

Initial retrieval using dense embeddings can miss nuanced matches. We use a cross-encoder reranker to process the top candidates, identifying the most contextual matches before final generation.

from sentence_transformers import CrossEncoder

reranker = CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')

def rerank_results(query, retrieved_chunks, top_k=3):
    # Prepare query-chunk pairs
    pairs = [[query, chunk['chunk']] for chunk in retrieved_chunks]
    scores = reranker.predict(pairs)
    
    # Sort by score and return top-k
    ranked_indices = np.argsort(scores)[::-1][:top_k]
    return [retrieved_chunks[idx] for idx in ranked_indices]

4. Contextual Generation with mT5

For the generation phase, the system leverages mT5 (Multilingual T5). Pre-trained on the mC4 corpus across 101 languages, mT5 is exceptionally robust at generating natural-sounding responses in languages like Marathi, Tamil, and Hindi.

Performance & Optimization

IndicRAG is optimized for production environments, focusing on both accuracy and latency:

78.3% F1 Score (Hindi/Telugu)

89.7% Retrieval Precision@5

12+ Supported Languages

4x Model Compression

INT8 Quantization: Reduced model size from 2.4GB to 600MB, resulting in 2.3x faster inference on CPU.
FastAPI Service: Asynchronous API with Redis caching and Prometheus monitoring.
FAISS Optimization: GPU-accelerated search enables sub-200ms retrieval times.

Production API Design

IndicRAG is deployed as a production-ready FastAPI service with async request handling and caching:

FastAPI Implementation

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio
import redis
import json

app = FastAPI(title="IndicRAG API", version="1.0.0")

# Redis for caching
redis_client = redis.Redis(host='localhost', port=6379, decode_responses=True)

class QueryRequest(BaseModel):
    question: str
    language: str = "hi"  # Default Hindi
    top_k: int = 5
    use_reranking: bool = True

class QueryResponse(BaseModel):
    answer: str
    source_chunks: list
    latency_ms: float
    language: str

@app.post("/query", response_model=QueryResponse)
async def query_documents(request: QueryRequest):
    """Process multilingual document query"""
    import time
    start = time.time()
    
    # Check cache first
    cache_key = f"query:{request.question}:{request.language}"
    cached = redis_client.get(cache_key)
    
    if cached:
        result = json.loads(cached)
        result['latency_ms'] = (time.time() - start) * 1000
        return result
    
    # Retrieve relevant chunks
    retrieved = await retrieve_chunks(
        request.question,
        request.language,
        request.top_k
    )
    
    # Optional reranking
    if request.use_reranking:
        retrieved = rerank_results(request.question, retrieved, top_k=3)
    
    # Generate answer
    answer = await generate_answer(request.question, retrieved, request.language)
    
    result = {
        "answer": answer,
        "source_chunks": [
            {"text": chunk['chunk'], "score": float(chunk['score'])}
            for chunk in retrieved
        ],
        "language": request.language
    }
    
    # Cache for 1 hour
    redis_client.setex(cache_key, 3600, json.dumps(result))
    
    latency = (time.time() - start) * 1000
    result['latency_ms'] = latency
    
    return result

async def retrieve_chunks(question, language, top_k):
    """Async retrieval from FAISS"""
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        None,
        lambda: faiss_index.search(question, language, top_k)
    )

async def generate_answer(question, chunks, language):
    """Async answer generation with mT5"""
    context = "\\n".join([c['chunk'] for c in chunks])
    prompt = f"Question: {question}\\nContext: {context}\\nAnswer:"
    
    loop = asyncio.get_event_loop()
    return await loop.run_in_executor(
        None,
        lambda: mt5_model.generate(prompt, max_length=150)
    )

Redis Caching Strategy

Caching frequently asked questions significantly reduces latency and GPU load:

class CacheManager:
    def __init__(self, redis_client):
        self.redis = redis_client
        self.cache_ttl = 3600  # 1 hour
        
    def get_cache_key(self, question, language, filters=None):
        """Generate deterministic cache key"""
        key_data = {
            "q": question.lower().strip(),
            "lang": language,
            "filters": filters or {}
        }
        key_str = json.dumps(key_data, sort_keys=True)
        return f"query:{hashlib.sha256(key_str.encode()).hexdigest()[:16]}"
    
    def get_cached_result(self, cache_key):
        """Retrieve cached result"""
        cached = self.redis.get(cache_key)
        if cached:
            self.redis.incr(f"{cache_key}:hits")  # Track hit count
            return json.loads(cached)
        return None
    
    def cache_result(self, cache_key, result):
        """Store result with TTL"""
        self.redis.setex(cache_key, self.cache_ttl, json.dumps(result))
        
    def get_cache_stats(self):
        """Get cache performance metrics"""
        total_keys = len(self.redis.keys("query:*"))
        return {
            "total_cached_queries": total_keys,
            "cache_size_mb": self.redis.info()['used_memory'] / (1024**2)
        }

Docker Deployment

Containerizing IndicRAG for scalable deployment:

# Dockerfile
FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \\
    tesseract-ocr \\
    tesseract-ocr-hin tesseract-ocr-tam tesseract-ocr-tel \\
    && rm -rf /var/lib/apt/lists/*

# Install Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY src/ ./src/
COPY models/ ./models/

# Download models (cached in layer)
RUN python -c "from sentence_transformers import SentenceTransformer; \\
    SentenceTransformer('sentence-transformers/LaBSE')"

EXPOSE 8000

CMD ["uvicorn", "src.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]

Docker Compose Orchestration

version: '3.8'

services:
  indicrag-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - REDIS_URL=redis://redis:6379
      - FAISS_INDEX_PATH=/data/faiss_index
      - MODEL_DEVICE=cuda  # or cpu
    volumes:
      - ./data:/data
      - ./models:/app/models
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    depends_on:
      - redis
      - prometheus
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data
    command: redis-server --maxmemory 2gb --maxmemory-policy allkeys-lru

  prometheus:
    image: prom/prometheus:latest
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus

  grafana:
    image: grafana/grafana:latest
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin
    volumes:
      - grafana_data:/var/lib/grafana
    depends_on:
      - prometheus

volumes:
  redis_data:
  prometheus_data:
  grafana_data:

Load Balancing and Scaling

For high-traffic production environments, deploy multiple API instances behind a load balancer:

# nginx.conf for load balancing
upstream indicrag_backend {
    least_conn;  # Load balance based on connections
    
    server indicrag-api-1:8000 weight=1 max_fails=3 fail_timeout=30s;
    server indicrag-api-2:8000 weight=1 max_fails=3 fail_timeout=30s;
    server indicrag-api-3:8000 weight=1 max_fails=3 fail_timeout=30s;
}

server {
    listen 80;
    server_name api.indicrag.example.com;

    location / {
        proxy_pass http://indicrag_backend;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        
        # Timeouts for long-running queries
        proxy_connect_timeout 60s;
        proxy_send_timeout 60s;
        proxy_read_timeout 60s;
        
        # Request buffering
        proxy_buffering on;
        proxy_buffer_size 4k;
        proxy_buffers 8 4k;
    }
    
    # Health check endpoint
    location /health {
        access_log off;
        proxy_pass http://indicrag_backend/health;
    }
}

Monitoring and Metrics

Comprehensive observability with Prometheus and custom metrics:

from prometheus_client import Counter, Histogram, Gauge
import time

# Metrics
query_counter = Counter('indicrag_queries_total', 'Total queries', ['language', 'status'])
query_latency = Histogram('indicrag_query_latency_seconds', 'Query latency', ['language'])
cache_hit_rate = Gauge('indicrag_cache_hit_rate', 'Cache hit rate')
active_connections = Gauge('indicrag_active_connections', 'Active connections')

@app.middleware("http")
async def metrics_middleware(request, call_next):
    """Track request metrics"""
    start = time.time()
    
    # Increment active connections
    active_connections.inc()
    
    try:
        response = await call_next(request)
        
        # Record metrics
        latency = time.time() - start
        query_latency.labels(language=request.query_params.get('lang', 'unknown')).observe(latency)
        query_counter.labels(
            language=request.query_params.get('lang', 'unknown'),
            status=response.status_code
        ).inc()
        
        return response
    finally:
        active_connections.dec()

@app.get("/metrics")
async def get_metrics():
    """Prometheus metrics endpoint"""
    from prometheus_client import generate_latest
    return Response(generate_latest(), media_type="text/plain")

Performance Under Load

Load testing with 100 concurrent users shows robust performance:

180ms P95 Latency (with cache)

520ms P95 Latency (no cache)

250 QPS Throughput (3 GPU instances)

72% Cache Hit Rate

Horizontal Scaling: Near-linear scaling up to 5 API instances with Redis cache
GPU Utilization: 85% average GPU utilization with batched inference
Memory Footprint: 2.8GB per instance (quantized models + FAISS index)
Cold Start Time: ~12 seconds (model loading + FAISS index mmap)

Key Takeaways

Building production-grade multilingual NLP systems requires balancing model performance, inference speed, and engineering complexity. By combining script-aware OCR, cross-lingual embeddings, and quantized transformer models, IndicRAG bridges the digital divide for millions of non-English speakers.

Explore the Source Code: The complete pipeline is available on GitHub:

👉 github.com/DNSdecoded/IndicRAG

Interested in the intersection of multilingual NLP and production ML systems? Let's connect!

← Back to Blog