← All posts

GitHub Opened a Multilingual Repositories Dataset for AI

80M+ classification rows across 40M public repos — README, issue, and PR language signals under CC0 for evaluating multilingual developer AI tools.

Contents

In brief

GitHub released the GitHub Multilingual Repositories Dataset — repository-level metadata, not a dump of repo text. It spans 80+ million classification rows across 40 million public projects: language signals for READMEs, the most-commented issue, and the most-commented pull request. License: CC0-1.0. The goal is to help researchers and developers find non-English collaboration in open source and build better evaluation for multilingual AI tooling.

What happened

Software is written in programming languages, but collaboration happens in human languages. READMEs explain setup, issues ask for help, pull requests debate design — often in English, but not always. As AI plays a larger role in how teams build software, knowing where multilingual developer content lives matters.

The dataset does not ship full repository text. For each public repo it provides:

  • language classification for the README, top issue, and top PR — from the first 150 characters (texts under 20 characters are excluded);
  • three independent classifiers — fastText, gcld3, and lingua-py — each with a confidence score; only labels above 0.5 confidence are included;
  • repository metadata: creation time, disk usage, stars, forks, primary language, SPDX license, issue/PR counts, snapshot date.

GitHub deliberately does not collapse the three classifiers into one label. Coverage and calibration differ, especially for lower-resource languages. Researchers choose their precision/recall tradeoff — e.g. require all three classifiers to agree for a high-precision Greek subset, or relax for exploratory Romance-language studies.

Language distribution varies by content type. Portuguese leads non-English READMEs (3M+ repos). Korean is most common in issue text but only fifth in READMEs among non-English languages.

Why it matters

Many European and other languages remain underrepresented in web corpora used to train and evaluate LLMs. Coding assistants, doc generators, and review bots may work well for some communities and lag for others.

Repository text is not generic web prose: install steps, bug templates, review comments, community norms. The dataset is a discovery tool, not ground-truth language ID — short snippets, badges, mixed languages, and code break classifiers. GitHub warns against using it as a language-ID benchmark and against inferring sensitive attributes about people; signals are repo-level only.

The release follows 2025 Microsoft European Digital Commitments on open multilingual data. The team discussed it at the Open Innovation Dialogue Hub in Strasbourg on June 16, 2026.

In practice

  1. Corpus discovery — filter repos with high confidence for a target language, then fetch actual README/issue/PR text via the GitHub API (not from the dataset itself).
  2. AI tool evaluation — build test sets for assistants and doc tools in target languages; compare quality before/after fine-tuning or prompt changes.
  3. Community research — study how non-English communities use issues vs READMEs for support and onboarding.
  4. Product arguments — language share stats support localization and model coverage decisions with data.
  5. Filter strictness — combine confidence thresholds and classifier agreement for production pipelines; loosen for exploration.

The dataset is on GitHub under CC0 — critique, extend, and ship tools on top without licensing friction.

Takeaway

This is not another web crawl — it is a map of multilingual collaboration in open source. For builders of developer AI and language-representation researchers, it is a practical starting point with explicit caveats. GitHub invites the community to share interesting builds on top of it.