Engineering - page 3


Measuring the many sizes of a Git repository

Is your Git repository bursting at the seams? git-sizer is a new open source tool that can tell you when your repo is getting too big. git-sizer computes various Git repository size metrics and alerts you to any that might cause problems or inconvenience.

What is “big”?

When people talk about the size of a Git repository, they often talk about the total size needed by Git to store the project’s history in its internal, highly-compressed format—basically, the amount of disk space used by the .git directory. This number is easy to measure. It’s also useful, because it indicates how long it takes to clone the repository and how much disk space it will use.

At GitHub we host over 78 million Git repositories, so we’ve seen it all. What we find is that many of the repositories that tax our servers the most are not unusually big. The most challenging repositories to host are often those that have an unusual internal layout that Git is not optimized for.

Many properties aside from overall size can make a Git repository unwieldy. For example:

  • It could contain an astronomical number of Git objects (which are used to store the repository’s history)

  • The total size of the Git objects could be huge when uncompressed (even though their size is reasonable when compressed)

  • When the repository is checked out, the size of the working copy might be gigantic

  • The repository could have an unreasonable number of commits in its history

  • It could include enormous individual files or directories

  • It could contain large files/directories that have been modified very many times

  • It could contain too many references (branches, tags, etc)

Any of these properties, if taken to an extreme, can cause certain Git operations to perform poorly. And surprisingly, a repository can be grossly oversized in almost any of these ways without using a worrying amount of disk space.

It also makes sense to consider whether the size of your repository is commensurate with the type and scope of your project. The Linux kernel has been developed over 25 years by thousands of contributors, so it is not at all alarming that it has grown to 1.5 GB. But if your weekend class assignment is already 1.5 GB, that’s probably a strong hint that you could be using Git more effectively!

Sizing up your repository

You can use git-sizer to measure many size-related properties of your repository, including all of those listed above. To do so, you’ll need a local clone of the repository and a copy of the Git command-line client installed and in your execution PATH. Then:

  1. Install git-sizer
  2. Change to the directory containing your repository
  3. Run git-sizer. You can learn about its command-line options by running git-sizer --help, but no options are required

git-sizer will gather statistics about all of the references and reachable Git objects in your repository and output a report. For example, here is the verbose output for the Linux kernel repository:

$ git-sizer --verbose
Processing blobs: 1652370
Processing trees: 3396199
Processing commits: 722647
Matching commits to trees: 722647
Processing annotated tags: 534
Processing references: 539
| Name                         | Value     | Level of concern               |
| ---------------------------- | --------- | ------------------------------ |
| Overall repository size      |           |                                |
| * Commits                    |           |                                |
|   * Count                    |   723 k   | *                              |
|   * Total size               |   525 MiB | **                             |
| * Trees                      |           |                                |
|   * Count                    |  3.40 M   | **                             |
|   * Total size               |  9.00 GiB | ****                           |
|   * Total tree entries       |   264 M   | *****                          |
| * Blobs                      |           |                                |
|   * Count                    |  1.65 M   | *                              |
|   * Total size               |  55.8 GiB | *****                          |
| * Annotated tags             |           |                                |
|   * Count                    |   534     |                                |
| * References                 |           |                                |
|   * Count                    |   539     |                                |
|                              |           |                                |
| Biggest objects              |           |                                |
| * Commits                    |           |                                |
|   * Maximum size         [1] |  72.7 KiB | *                              |
|   * Maximum parents      [2] |    66     | ******                         |
| * Trees                      |           |                                |
|   * Maximum entries      [3] |  1.68 k   |                                |
| * Blobs                      |           |                                |
|   * Maximum size         [4] |  13.5 MiB | *                              |
|                              |           |                                |
| History structure            |           |                                |
| * Maximum history depth      |   136 k   |                                |
| * Maximum tag depth      [5] |     1     | *                              |
|                              |           |                                |
| Biggest checkouts            |           |                                |
| * Number of directories  [6] |  4.38 k   | **                             |
| * Maximum path depth     [7] |    14     | *                              |
| * Maximum path length    [8] |   134 B   | *                              |
| * Number of files        [9] |  62.3 k   | *                              |
| * Total size of files    [9] |   747 MiB |                                |
| * Number of symlinks    [10] |    40     |                                |
| * Number of submodules       |     0     |                                |

[1]  91cc53b0c78596a73fa708cceb7313e7168bb146
[2]  2cde51fbd0f310c8a2c5f977e665c0ac3945b46d
[3]  4f86eed5893207aca2c2da86b35b38f2e1ec1fc8 (refs/heads/master:arch/arm/boot/dts)
[4]  a02b6794337286bc12c907c33d5d75537c240bd0 (refs/heads/master:drivers/gpu/drm/amd/include/asic_reg/vega10/NBIO/nbio_6_1_sh_mask.h)
[5]  5dc01c595e6c6ec9ccda4f6f69c131c0dd945f8c (refs/tags/v2.6.11)
[6]  1459754b9d9acc2ffac8525bed6691e15913c6e2 (589b754df3f37ca0a1f96fccde7f91c59266f38a^{tree})
[7]  78a269635e76ed927e17d7883f2d90313570fdbc (dae09011115133666e47c35673c0564b0a702db7^{tree})
[8]  ce5f2e31d3bdc1186041fdfd27a5ac96e728f2c5 (refs/heads/master^{tree})
[9]  532bdadc08402b7a72a4b45a2e02e5c710b7d626 (e9ef1fe312b533592e39cddc1327463c30b0ed8d^{tree})
[10] f29a5ea76884ac37e1197bef1941f62fda3f7b99 (f5308d1b83eba20e69df5e0926ba7257c8dd9074^{tree})

The git-sizer project page explains the output in detail. The most interesting thing to look at is the “level of concern” column, which gives a rough indication of which parameters are high compared with a typical, modest-sized Git repository. A lot of asterisks would suggest that your repository is stretching Git beyond its sweet spot, and that some Git operations might be noticeably slower than usual. If you see exclamation marks instead of asterisks in this column, then you likely have a problem that needs addressing.

As you can see from the output, even though the Linux kernel is a big project by most standards, it is fairly well-balanced and none of its parameters have extreme values. Some Git operations will certainly take longer than they would in a small repository, but not unreasonably, and not out of proportion to the scope of the project. The kernel project is comfortably manageable in Git.

If the git-sizer analysis flags up any problems in your repository, we suggest referring again to the git-sizer project page, where you will find many suggestions and resources for improving the structure of your Git repository. Please note that by far the easiest time to improve your repository structure is when you are just beginning to use Git, for example when migrating a repository from another version control system, before a lot of developers have started cloning and contributing to the repository. And keep in mind that repositories only grow over time, so it is preferable to establish good practices early.

Summary

Git is famous for its speed and ability to deal with even quite large development projects. But every system has its limits, and if you push its limits too hard, your experience might suffer. git-sizer can help you evaluate whether your Git repository will live happily within Git, or whether it would be advisable to slim it down to make your Git experience as delightful as it can be.

Getting involved: git-sizer is open source! If you’d like to report bugs or contribute new features, head over to the project page.

Weak cryptographic standards removed

Earlier today we permanently removed support for the following weak cryptographic standards on github.com and api.github.com:

  • TLSv1/TLSv1.1: This applies to all HTTPS connections, including web, API, and Git connections to https://github.com and https://api.github.com.
  • diffie-hellman-group1-sha1: This applies to all SSH connections to github.com
  • diffie-hellman-group14-sha1: This applies to all SSH connections to github.com

This change was originally announced last year, with the final timeline for the removal posted three weeks ago. If you run into any issues or have any questions, please don’t hesitate to let us know.

Weak cryptographic standards removal notice

Last year we announced the deprecation of several weak cryptographic standards. Then we provided a status update toward the end of last year outlining some changes we’d made to make the transition easier for clients. We quickly approached the February 1, 2018 cutoff date we mentioned in previous posts and, as a result, pushed back our schedule by one week. On February 8, 2018 we’ll start disabling the following:

  • TLSv1/TLSv1.1: This applies to all HTTPS connections, including web, API, and git connections to https://github.com and https://api.github.com.
  • diffie-hellman-group1-sha1: This applies to all SSH connections to github.com
  • diffie-hellman-group14-sha1: This applies to all SSH connections to github.com

We’ll disable the algorithms in two stages:

  • February 8, 2018 19:00 UTC (11:00 am PST): Disable deprecated algorithms for one hour
  • February 22, 2018 19:00 UTC (11:00 am PST): Permanently disable deprecated algorithms

For more details, head to the Engineering Blog.

Game Off, our annual game jam returns in November

GitHub Game Off 2017

Game Off—our fifth annual game jam returns in just two weeks!

A game jam is a hackathon for creating video games. Although most game jams run for 24-72 hours, the Game Off runs for the entire month of November. You’ll have 30 days to create a game inspired by or loosely based on a theme that we’ll announce Wednesday, November 1, at 13:37 pm PDT.

As always, you’re encouraged to use open source game engines, libraries, and tools, but you’re free to use any technology you want. It’s a perfect excuse to experiment with something new, too.

This year, the Game Off will take place on itch.io–an open marketplace for indie game developers and platform for running game jams among other things. Best of all, this year, you’ll be the judging the entries.

We’ll announce all the latest updates on our blog and Twitter account. Stay tuned and follow along with the #GitHubGameOff hashtag!

Join the jam on itch.io today

GLHF! We can’t wait to see play what you make <3

Doubling Bug Bounty rewards

Hack the World 2017

We’re coming up on four years since the Bug Bounty program was first announced. A lot has changed in that time, and we constantly try to keep our reward structure inline with top security bug bounty programs. We’re excited to announce that starting today we’re doubling our payout amounts, bringing the minimum and maximum payouts to $555 and $20,000, respectively. This means that any report eligible for a bounty will be met with at least a $555 reward. This doesn’t mean we’re raising the bar for what is considered a valid report, we’re simply raising the payouts.

This bump to our payouts aligns with Hack the World, an annual hacking competition by HackerOne, which kicked off this morning and runs until November 18th. During this time participants compete against each other to find the most security vulnerabilities across all sites on HackerOne’s platform. We’re one of the sponsors, which means hackers will be rewarded with twice the reputation points on HackerOne when finding bugs on GitHub over the next month! As an additional incentive, we will also be rewarding all valid submissions with free unlimited private repositories for life. The increased bounty payouts are here to stay, but unlimited private repositories will only rewarded on reports submitted on or before November 18th!

Ready to compete? Submit all reports to our Bug Bounty program. For more details on the competition, please visit the Hack the World website.

Changelog

Subscribe

Discover new ways to build better

Try Marketplace apps free for 14 days

Learn more