Building Multilingual NLP Systems with RAG

← Back to Blog

Introduction

Building NLP systems that work across India's linguistic diversity is challenging. With 22 official languages and hundreds of dialects, creating a single question-answering system that serves all users requires careful architecture and multilingual expertise.

In this post, I'll walk through IndicRAG, a production-ready Retrieval-Augmented Generation (RAG) pipeline I built to handle document Q&A across 12+ Indian languages, including Hindi, Bengali, Tamil, Telugu, and more.

The Challenge: Multilingual Document Understanding

Traditional NLP systems face several obstacles in Indian language contexts:

Script diversity: Devanagari, Dravidian, Perso-Arabic scripts require different processing
Limited training data: Most models are English-centric
Code-mixing: Users often switch between languages mid-sentence
Domain-specific vocabulary: Technical and legal terms vary by language
OCR challenges: Scanned documents have varied quality

RAG Architecture Overview

Retrieval-Augmented Generation combines the best of information retrieval and generative models:

Document Ingestion: Process and chunk documents with context preservation
Vector Encoding: Convert chunks to embeddings using multilingual models
Semantic Search: Retrieve relevant chunks using FAISS vector similarity
Reranking: Cross-encoder models improve retrieval precision
Generation: Contextualized answer generation with mT5/mBERT

Component 1: Multilingual Document Processing

OCR Integration for Scanned PDFs

Many Indian language documents are scanned images. I integrated dual OCR engines:

Tesseract: Lightweight, supports 100+ languages including all Indian scripts
LayoutParser: Deep learning-based layout detection for complex documents

from tesseract import image_to_string
from layoutparser import Detectron2LayoutModel

# Multi-stage OCR pipeline
def extract_text_multilingual(image_path, lang='hin+eng'):
    layout = layout_model.detect(image_path)
    text_blocks = []
    for block in layout:
        text = image_to_string(
            block.crop_image(image_path),
            lang=lang,
            config='--psm 6'  # Uniform block of text
        )
        text_blocks.append(text)
    return '\n'.join(text_blocks)

Semantic Chunking

Instead of naive fixed-size chunking, I implemented semantic boundary detection:

Preserve paragraph and section boundaries
Keep tables and lists intact
Add overlapping context windows (256 tokens overlap)
Metadata tagging for document structure

Component 2: Multilingual Dense Embeddings

The key to cross-lingual retrieval is using models trained on parallel multilingual data:

Model Selection: LaBSE vs mBERT

I evaluated multiple embedding models:

LaBSE (Language-agnostic BERT Sentence Embeddings): Trained on 109 languages with translation pairs, excellent for semantic similarity across languages
mBERT (Multilingual BERT): Good general-purpose embeddings but weaker cross-lingual alignment
XLM-RoBERTa: Strong performance but larger model size

Winner: LaBSE for its superior cross-lingual retrieval performance despite slightly higher inference cost.

Vector Storage with FAISS

For efficient similarity search over millions of document chunks, I used FAISS (Facebook AI Similarity Search):

import faiss
import numpy as np

# Create FAISS index with inner product similarity
dimension = 768  # LaBSE embedding size
index = faiss.IndexFlatIP(dimension)

# Normalize embeddings for cosine similarity
faiss.normalize_L2(embeddings)
index.add(embeddings)

# Search with GPU acceleration (optional)
gpu_index = faiss.index_cpu_to_gpu(
    faiss.StandardGpuResources(), 0, index
)

Component 3: Cross-Encoder Reranking

Initial retrieval using dense embeddings can miss nuanced matches. A two-stage approach improves precision:

Stage 1 (Fast): FAISS retrieves top 50 candidates (~10ms)
Stage 2 (Accurate): Cross-encoder reranks to top 5 (~100ms)

Why Cross-Encoders?

Unlike bi-encoders (which encode query and document separately), cross-encoders process them jointly, capturing fine-grained interaction signals:

from transformers import AutoTokenizer, AutoModelForSequenceClassification

model_name = "cross-encoder/mmarco-mMiniLMv2-L12-H384-v1"
tokenizer = AutoTokenizer.from_pretrained(model_name)
reranker = AutoModelForSequenceClassification.from_pretrained(model_name)

def rerank(query, candidates):
    pairs = [[query, doc] for doc in candidates]
    inputs = tokenizer(pairs, padding=True, truncation=True, 
                      return_tensors='pt', max_length=512)
    scores = reranker(**inputs).logits.squeeze(-1)
    return scores.argsort(descending=True)

Component 4: Contextual Answer Generation

With the most relevant chunks retrieved, I use sequence-to-sequence models for answer generation:

Model: mT5 (Multilingual Text-to-Text Transfer Transformer)

Covers 101 languages including all major Indian languages
Trained on mC4 corpus (multilingual Common Crawl)
Supports abstractive and extractive QA

from transformers import MT5ForConditionalGeneration, MT5Tokenizer

model = MT5ForConditionalGeneration.from_pretrained("google/mt5-large")
tokenizer = MT5Tokenizer.from_pretrained("google/mt5-large")

def generate_answer(query, context):
    input_text = f"question: {query} context: {context}"
    inputs = tokenizer(input_text, return_tensors="pt", 
                      max_length=1024, truncation=True)
    
    outputs = model.generate(
        **inputs,
        max_length=256,
        num_beams=4,
        early_stopping=True,
        no_repeat_ngram_size=3
    )
    
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

Deployment: FastAPI Service

For production deployment, I built a REST API with FastAPI:

Key Features

Async processing: Non-blocking I/O for concurrent requests
Response streaming: Progressive answer generation
Caching: Redis for frequent queries
Rate limiting: Protect against abuse
Monitoring: Prometheus metrics and logging

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import asyncio

app = FastAPI(title="IndicRAG API")

class Query(BaseModel):
    question: str
    language: str
    top_k: int = 5

@app.post("/ask")
async def ask_question(query: Query):
    # Retrieve relevant chunks
    chunks = await retrieve_chunks(
        query.question, 
        query.language, 
        top_k=query.top_k
    )
    
    # Rerank
    reranked = await rerank_chunks(query.question, chunks)
    
    # Generate answer
    answer = await generate_answer(
        query.question, 
        reranked[:3]
    )
    
    return {
        "answer": answer,
        "sources": [chunk.metadata for chunk in reranked[:3]],
        "confidence": calculate_confidence(reranked)
    }

Performance Optimizations

1. Model Quantization

Reduced model size by 4x using INT8 quantization with negligible accuracy loss:

FP32 mT5-large: 2.4GB → INT8: 600MB
Inference speed: 2.3x faster on CPU

2. Batch Processing

Process multiple queries together for better GPU utilization:

# Batch encoding
embeddings = model.encode(
    sentences,
    batch_size=32,
    show_progress_bar=False,
    convert_to_numpy=True
)

3. Approximate Nearest Neighbors

For very large document collections (>10M chunks), use HNSW indexing:

# FAISS HNSW index for fast approximate search
index = faiss.IndexHNSWFlat(dimension, 32)
index.hnsw.efConstruction = 200
index.hnsw.efSearch = 128

Evaluation Metrics

I evaluated IndicRAG on multiple benchmarks:

XQuAD (Hindi/Telugu): F1 Score 78.3%
AI2 Reasoning Challenge (translated): 71.2% accuracy
Custom domain QA: 82.5% exact match on legal documents
Retrieval precision@5: 89.7%

Real-World Challenges & Solutions

Challenge 1: Code-Mixed Queries

Users often ask "What is property tax in मुंबई?" (mixing English and Hindi)

Solution: Language detection with fallback to multi-script embedding models

Challenge 2: Domain Terminology

Legal and medical terms often lack good translations

Solution: Custom terminology dictionaries and domain-specific fine-tuning

Challenge 3: Context Window Limitations

Long documents exceed model context limits (512-1024 tokens)

Solution: Hierarchical retrieval with document-level and chunk-level search

Lessons Learned

Language-specific preprocessing matters: Hindi text requires different tokenization than Tamil
Evaluation is hard: Translation-based evaluation misses cultural context
User feedback loops: Implicit feedback (click-through) beats explicit ratings
Fallback strategies: Always have a plan when models fail (e.g., keyword search)

Open Source & GitHub

The complete IndicRAG codebase is available on GitHub:

github.com/DNSdecoded/IndicRAG

Includes:

Document preprocessing pipeline
Multi-stage retrieval implementation
FastAPI deployment code
Docker containerization
Evaluation scripts and benchmarks

Conclusion

Building production-grade multilingual NLP systems requires balancing model performance, inference speed, and engineering complexity. IndicRAG demonstrates that with careful architecture and optimization, it's possible to serve accurate, fast question-answering across diverse Indian languages.

The key takeaways:

Use multilingual models trained on parallel data (LaBSE, mT5)
Implement two-stage retrieval (fast dense + accurate reranking)
Optimize for production (quantization, batching, caching)
Plan for edge cases (code-mixing, domain terms, long contexts)

Questions about multilingual RAG or want to collaborate? Get in touch!

← Back to Blog