Lugha Datasets

Multilingual datasets for frontier AI teams.

Parallel pairs, domain corpora, and custom collections across African and Asian languages — engineered for training, fine-tuning, and evaluation at production scale.

Built for the calibre of frontier model labs, localization platforms, and enterprise AI teams.

Catalog

Production-ready datasets, licensed for commercial AI.

African language pairs

Volume
12M+ aligned segments
Languages
Swahili, Yoruba, Hausa, Amharic, Zulu, Kinyarwanda, Lingala, +18 more
Format
JSONL · Parquet · TMX
Licence
Commercial · per-seat or perpetual

Asian language pairs

Volume
8M+ aligned segments
Languages
Hindi, Urdu, Bengali, Tamil, Tagalog, Vietnamese, Burmese, +14 more
Format
JSONL · Parquet · TMX
Licence
Commercial · per-seat or perpetual

Domain corpora

Volume
3M+ specialist segments
Languages
Legal, Medical, Financial, Public-sector, Technical
Format
JSONL with glossaries
Licence
Per-domain commercial licence

Custom collection

Volume
Built to spec
Languages
Any pair we cover · sourced & QA'd to your guidelines
Format
Your schema
Licence
Negotiated per engagement

What you get

Audit-ready data, delivered the way enterprise AI teams ship.

Provenance & licensing

Every segment is sourced under documented agreements with native contributors. You receive a chain-of-custody manifest with each delivery.

Quality guarantees

Per-pair COMET and BLEU benchmarks, fluency scoring, terminology validation, and dual-pass native review on premium tiers.

Enterprise delivery

S3, GCS, or signed URL delivery. SOC2-aligned controls, encryption at rest, and SLA-backed support for production accounts.

Request access

Tell us about your use case.

We'll send samples, a licensing proposal, and pricing within one business day.

Submissions are routed to our partnerships team. Replies within 1 business day.