20M+ Indian legal documents with citation graphs and vector embeddings – potential uses for legal NLP? [D]

Reddit r/MachineLearning / 4/14/2026

💬 OpinionDeveloper Stack & InfrastructureIdeas & Deep AnalysisTools & Practical UsageModels & Research

共有:

Key Points

The project builds a large-scale legal NLP corpus covering 20M+ Indian court cases across the Supreme Court, all 25 High Courts, and 14 tribunals, with structured metadata such as courts, dates, parties, judges, and referenced statutes/sections.
It includes a machine-readable citation graph that distinguishes relationship types (e.g., cited vs. overruled vs. mentioned) across the corpus, intended to be useful for legal network analysis and influence modeling.
Each case is vector-embedded using Voyage AI (1024d dense) and also indexed with BM25 sparse vectors, enabling retrieval research for legal question answering.
The dataset aligns cases with 23,122 Acts and Statutes and provides an extraction pipeline that uses regex/heuristics plus LLM-based extraction to generate training data for tasks like legal NER and structured metadata generation.
The corpus is offered via API and bulk export (JSON/Parquet), with limitations noted around English-heavy coverage, metadata completeness, and citation classification precision.

been working on structuring India's legal corpus for the past 2 years and wanted to share what I've built and hear from people working on legal NLP or low-resource Indian language models.

dataset is 20M+ Indian court cases from the Supreme Court, all 25 High Courts, and 14 Tribunals. each case has structured metadata (court, bench, date, parties, judges, sections cited, acts referenced, case type). there's a citation graph across the full corpus where I've classified relationships as followed, distinguished, overruled, or mentioned.

every case is embedded with Voyage AI (1024d dense) plus BM25 sparse vectors. I have also cross-referenced 23,122 Acts and Statutes with the cases that interpret them.

Some things that might be interesting to this community:

citation network thing across 20M+ cases is, as far as I know, the first machine-readable one for Indian law.

could be useful for graph neural network research, legal outcome prediction, or influence analysis on which judgments are most cited and which are being overruled.

most Indian language NLP corpora are conversational or news text. Legal text is a completely different register. formal, precise, domain-specific. the bilingual pairs from the translation service could be useful for fine-tuning Indian language models on formal and legal domains.

the metadata extraction pipeline identifies judges, advocates, parties, sections, acts, and dates from unstructured judgment text. built with a mix of regex, heuristics, and LLM-based extraction. the structured outputs could serve as training data for legal NER models.

Indian court judgments are long. Median around 3,000 words, some exceed 50,000 words.

if anyone is benchmarking retrieval-augmented generation on legal domains, this corpus plus the citation graph could work as an evaluation bed. Ground truth exists in the citation relationships: if Case A cites Case B, a good retriever should show B when asked about the legal question in A.

data is available via API and bulk export in JSON and Parquet. Indian court judgments are public domain under Indian law so no copyright issues for research use.

being upfront about limitations: coverage is primarily English text (except Supreme court one, they have 3-4 translated language copies ) since Indian HCs issue orders in English, the regional language data comes from our translation service not from original regional language judgments.

metadata extraction accuracy varies by court, SC and major HCs are cleaner while smaller tribunals have messier inputs. The citation graph is extracted heuristically plus LLM-assisted, I estimate around 90-95% precision on citation extraction and lower on treatment classification. Not all 20M cases have complete metadata, coverage is best for post-2007 judgments.

would love to hear from anyone working on legal NLP, Indian language models, or graph-based legal analysis. What would be most useful to you from a dataset like this?

deets at vaquill

submitted by /u/zriyansh
[link] [comments]