Data - page 2

The data science behind topic suggestions

Add topics to repositories

Earlier this year, we launched topics, a new feature that lets you tag repositories with descriptive words or phrases. Topics help you create connections between similar GitHub projects and explore them by type, technology, and other characteristics they have in common.

All public repositories show topic suggestions, so you can quickly tag repositories with relevant words and phrases. These suggestions are the result of some exciting data science work—in particular, a topic extraction framework based on text mining, natural language processing, and machine learning called repo-topix.

Learn more about repo-topix from the Engineering Blog

Topic suggestions close up

Now when you add or reject topics, you’re doing more than keeping projects organized. Every topic will contribute to surfacing connections and inspiring discovery across GitHub. Repository names, descriptions, and READMEs from millions of public projects serve as the very start of an ever-evolving knowledge graph of concepts. Eventually, the graph will map how these concepts relate to each other and to the code, people, and projects on GitHub.

Topics is part of a greater effort to use our public data to make meaningful improvements to how people discover, interact, and build on GitHub. We’ll be sharing more ways that data can improve the way you work at Universe—our flagship product and community conference.

Get tickets to GitHub Universe

Announcing an open data set on the open source community

We just released an open data set for the open source community, researchers, and curious data wonks to study.

The data includes responses from 5,500 open source participants randomly sampled from over 3,800 projects on and over 500 sourced from communities that work on other platforms. Altogether, the data represents some of the most comprehensive and high-quality data on the open source community to date.

header from the survey website

The Open Source Survey covers a broad set of topics, including:

  • What people value in the software they use and in open source projects
  • How and where people find and provide help
  • Privacy preferences and practices
  • Employer policies around using and contributing to open source
  • Negative experiences and their consequences
  • Personal backgrounds of community members

We hope you’ll use the data to inform decisions about community, tooling, and prioritization of work; understand the needs and experiences of different parts of the community; and do new and interesting research on a remarkable system of peer production that powers so much of modern life.

In the meantime, we’ve started using the findings to help us understand what makes a healthy community and how we can improve GitHub for maintainers, contributors, and end users.

plot of importance of various attributes to project use and contribution

Huge thanks to all of our collaborators in academia, industry, and the open source community who contributed topic ideas and questions, helped with translations, and took the survey. You can find the data, and an analysis of the key findings, at Let us know how you use the data or write to us with questions or comments.

GitHub data, ready for you to explore with BigQuery

GitHub data is available for public analysis using Google BigQuery, and we’d like to help you take it for a spin.

If you’d like to find out more about what data is available and how it’s been used so far, watch this conversation between GitHub Data Analyst Alyson La and Google Developer Advocate Felipe Hoffa. You’ll learn the story behind the datasets and what types of analysis they make possible. You’ll also see how we’ve visualized data with Tableau and Looker.

There’s a lot of data out there, but it’s all available through BigQuery in two large data sets. The original, community-led GitHub Archive project launched in 2012 and captures almost 30 million events monthly, including issues, commits, and pushes. Last year, we worked with Google to release The GitHub Public Data Set, separate tables with information on all projects that have open source licenses, including commits, file contents, and file paths.

You can also use the GH torrent project to complement the existing datasets with additional metadata.

We ran a list of queries on the datasets above to create the open source section of our Octoverse report, but anyone can run an analysis. Here are the results of some of the queries run so far.

  • “This should never have happened” has appeared in code comments more than a million times (hear this data point for yourself in this Changelog episode)
  • Where does open source happen? GitHub top countries shares which countries have the most open source developers per capita
  • How reliable is GitHub? Felipe runs a query to find out in GitHub reliability with BigQuery
  • There are a lot of feels in open source. Geeksta examines how emotions are expressed in GitHub commit messages
  • Are bigger pull requests better? Jessie Frazelle analyzed the top 15 projects on GitHub in terms of pull requests opened vs. pull requests closed

Happy exploring!

Making open source data more available

Data gives us insight into how people build software, and the activities of open source communities on GitHub represent one of the richest datasets ever created of people working together at scale.

In 2012, the community led project, GitHub Archive was launched, providing a glimpse into the ways people build software on GitHub. Today, we’re delighted to announce that, in collaboration with Google, we are releasing a collection of additional BigQuery tables to expand on the data from that project1.

This 3TB+ dataset comprises the largest released source of GitHub activity to date. It contains activity data for more than 2.8 million open source GitHub repositories including more than 145 million unique commits, over 2 billion different file paths and the contents of the latest revision for 163 million files, all of which are searchable with regular expressions.

With this new dataset, it’s a simple query to find out which are the most commonly used Go packages, which US-schools have the most open source contributors and find all of the things that should never happen.

Just as books capture thoughts and ideas, software encodes human knowledge in a machine-readable form. This dataset is a great start toward the pursuit of documenting the open source community’s vast repository of knowledge—but there’s more to be done. Over the coming months, you can expect to hear from us on how we hope to make open source data even more available, portable, and useful.

Whether you’re a researcher studying open source communities, an organization looking to monitor the health of your open source projects, or curious about the latest trends in software development, go check out the new dataset hosted on Google Cloud to analyze one of the largest datasets of people collaborating on the planet.

1. If you’d like to hear more about the data release then check out this episode of The Changelog.

Language Trends on GitHub

Recently we took a look at the popularity of programming languages used on

Below is a graph that shows the change in rank of languages since GitHub launched in 2008.

non-forks_v3 jpg 002

The rank represents languages used in public & private repositories, excluding forks, as detected by Linguist.

It should be noted that this graph represents each language’s relative popularity on GitHub. For example, Ruby on Rails has been on GitHub since 2008, which may explain Ruby’s early popularity.

Between 2008 and 2015 GitHub gained the most traction in the Java community, which changed in rank from 7th to 2nd. Possible contributing factors to this growth could be the growing popularity of Android and the increasing demand for version control platforms at businesses and enterprises.




Discover new ways to build better

Try Marketplace apps free for 14 days

Learn more