Lugha Datasets

Multilingual datasets for frontier AI teams.

Parallel pairs, domain corpora, and custom collections across African and Asian languages — engineered for training, fine-tuning, and evaluation at production scale.

Built for the calibre of frontier model labs, localization platforms, and enterprise AI teams.

Catalog

Production-ready datasets, licensed for commercial AI.

African language pairs

Volume: 12M+ aligned segments
Languages: Swahili, Yoruba, Hausa, Amharic, Zulu, Kinyarwanda, Lingala, +18 more
Format: JSONL · Parquet · TMX
Licence: Commercial · per-seat or perpetual

Asian language pairs

Volume: 8M+ aligned segments
Languages: Hindi, Urdu, Bengali, Tamil, Tagalog, Vietnamese, Burmese, +14 more
Format: JSONL · Parquet · TMX
Licence: Commercial · per-seat or perpetual

Domain corpora

Volume: 3M+ specialist segments
Languages: Legal, Medical, Financial, Public-sector, Technical
Format: JSONL with glossaries
Licence: Per-domain commercial licence

Custom collection

Volume: Built to spec
Languages: Any pair we cover · sourced & QA'd to your guidelines
Format: Your schema
Licence: Negotiated per engagement

What you get

Audit-ready data, delivered the way enterprise AI teams ship.

Provenance & licensing

Every segment is sourced under documented agreements with native contributors. You receive a chain-of-custody manifest with each delivery.

Quality guarantees

Per-pair COMET and BLEU benchmarks, fluency scoring, terminology validation, and dual-pass native review on premium tiers.

Enterprise delivery

S3, GCS, or signed URL delivery. SOC2-aligned controls, encryption at rest, and SLA-backed support for production accounts.

Request access

Tell us about your use case.

We'll send samples, a licensing proposal, and pricing within one business day.

Prefer general contact? Reach our team