Since launching the new GitHub Desktop in 2017, we’ve focused on improving collaboration in the app, laying the foundation how you can work with Desktop today.
Today, we’re releasing GitHub Desktop 1.5, representing a culmination of the work we’ve been doing this past year. This release completes the merge collaboration cycle by providing a way to initiate a merge in the branch dropdown, guiding you through resolving merge conflicts, and informing you when a merge is complete. It also includes our first step toward improving onboarding onto GitHub Desktop with the option to clone and add new repositories in the repository dropdown.
With today’s GitHub Desktop release, you can merge with confidence knowing that even if conflicts occur, we’ll help you through it so you can keep shipping. Merge conflicts can be intimidating for new developers, especially those working in teams. In our usability tests, the audible “NOOOOO” when encountering a conflict became predictable.
In our previous release, we reduced some of that anxiety by informing you whether or not you would encounter merge conflicts before merging, but you still needed to actually resolve the conflicts on your own. With more than 10 percent of all merges in the app resulting in merge conflicts, we knew we could do better. And with GitHub Desktop 1.5, you’re no longer on your own. The app will now inform you which files have conflicts, route you to your preferred editor to resolve them, list the conflicts that you still need to address, and show you when everything is resolved and ready to merge.
As we’ve released features related to merging over the past several months, we’ve also had an opportunity to listen to lots of users. We care about your feedback, and this release incorporates several changes based on what we’ve learned from you. With GitHub Desktop 1.5, you can now initiate a merge from the branch dropdown, and you’ll receive feedback in the app to let you know when a merge is completed successfully.
We’ve also seen that the core function of adding a repository to Desktop has been difficult to find and use. We solved this by adding a simple way to create, add, or clone a repository right from the repository dropdown.
These changes are subtle, but together they represent our commitment to listen and learn from people using Desktop every day. We conduct user interviews and usability testing on a regular basis—if you’d like to participate and help make Desktop even more useful, please sign up.
Finally, we want to call out that this release is the first time we’ve shipped a feature iteration built almost entirely by community contributors outside of GitHub. The improved merge flow was a combined effort from @JQuinnie and @bruncun, and there were more than 30 merged pull requests from the community since our last release.
We continue to be blown away by the community that has grown around GitHub Desktop as an open source product. There were more community pull requests merged in September and October than in any previous months, and there’s no sign of slowing down. We’re grateful for the community’s participation in improving GitHub Desktop, and if you feel inspired to build something awesome together, we’d love to see you in our open source repository.
Text editors like Atom have many features that make code easier to read and write—syntax highlighting and code folding are two of the most important examples. For decades, all major text editors have implemented these kinds of features based on a very crude understanding of code, obtained by searching for simple, regular expression patterns. This approach has severely limited how helpful text editors can be.
At GitHub, we want to explore new ways of making programming intuitive and delightful, so we’ve developed a parsing system called Tree-sitter that will serve as a new foundation for code analysis in Atom. Tree-sitter makes it possible for Atom to parse your code while you type—maintaining a syntax tree at all times that precisely describes the structure of your code. We’ve enabled the new system by default in Atom, bringing a number of improvements.
Atom’s syntax highlighting is now based on the syntax trees provided by Tree-sitter. This lets us use color to outline your code’s structure more clearly than before. Notice the consistency with which fields, functions, keywords, types, and variables are highlighted across a variety of languages:
In most text editors, code folding is based on indentation: lines with greater indentation are considered to be nested more deeply than lines with less indentation. But this doesn’t always match the structure of our code and can make code folding useless in some files. With Tree-sitter, Atom folds code based on its syntax, which allows folding to work as expected, even for code like this:
Atom also uses syntax trees as the basis for two new editing commands:
Select Larger Syntax Node and
Select Smaller Syntax Node, bound to Alt+Up and Alt+Down. These commands can make many editing tasks more efficient and fun, especially when used in combination with multiple cursors.
Parsing an entire source file can be time-consuming. This is why most IDEs wait to parse your code until you stop typing for a moment, and there is often a delay before syntax highlighting updates. We want to avoid these delays, so we designed Tree-sitter to parse your code incrementally: it keeps the syntax tree up to date as you edit your code without ever having to re-parse the entire file from scratch.
If you write code and you’re interested in trying our new system, give the new version of Atom a try. We’d love to hear your feedback—tweet us at @AtomEditor or if you’ve run into a bug, open an issue.
Last week, GitHub experienced an incident that resulted in degraded service for 24 hours and 11 minutes. While portions of our platform were not affected by this incident, multiple internal systems were affected which resulted in our displaying of information that was out of date and inconsistent. Ultimately, no user data was lost; however manual reconciliation for a few seconds of database writes is still in progress. For the majority of the incident, GitHub was also unable to serve webhook events or build and publish GitHub Pages sites.
All of us at GitHub would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in GitHub and take pride in building resilient systems that enable our platform to remain highly available. With this incident, we failed you, and we are deeply sorry. While we cannot undo the problems that were created by GitHub’s platform being unusable for an extended period of time, we can explain the events that led to this incident, the lessons we’ve learned, and the steps we’re taking as a company to better ensure this doesn’t happen again.
The majority of user-facing GitHub services are run within our own data center facilities. The data center topology is designed to provide a robust and expandable edge network that operates in front of several regional data centers that power our compute and storage workloads. Despite the layers of redundancy built into the physical and logical components in this design, it is still possible that sites will be unable to communicate with each other for some amount of time.
At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.
In the past, we’ve discussed how we use MySQL to store GitHub metadata as well as our approach to MySQL High Availability. GitHub operates multiple MySQL clusters varying in size from hundreds of gigabytes to nearly five terabytes, each with up to dozens of read replicas per cluster to store non-Git metadata, so our applications can provide pull requests and issues, manage authentication, coordinate background processing, and serve additional functionality beyond raw Git object storage. Different data across different parts of the application is stored on various clusters through functional sharding.
To improve performance at scale, our applications will direct writes to the relevant primary for each cluster, but delegate read requests to a subset of replica servers in the vast majority of cases. We use Orchestrator to manage our MySQL cluster topologies and handle automated failover. Orchestrator considers a number of variables during this process and is built on top of Raft for consensus. It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.
During the network partition described above, Orchestrator, which had been active in our primary data center, began a process of leadership deselection, according to Raft consensus. The US West Coast data center and US East Coast public cloud Orchestrator nodes were able to establish a quorum and start failing over clusters to direct writes to the US West Coast data center. Orchestrator proceeded to organize the US West Coast database cluster topologies. When connectivity was restored, our application tier immediately began directing write traffic to the new primaries in the West Coast site.
The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.
Our internal monitoring systems began generating alerts indicating that our systems were experiencing numerous faults. At this time there were several engineers responding and working to triage the incoming notifications. By 23:02 UTC, engineers in our first responder team had determined that topologies for numerous database clusters were in an unexpected state. Querying the Orchestrator API displayed a database replication topology that only included servers from our US West Coast data center.
By this point the responding team decided to manually lock our internal deployment tooling to prevent any additional changes from being introduced. At 23:09 UTC, the responding team placed the site into yellow status. This action automatically escalated the situation into an active incident and sent an alert to the incident coordinator. At 23:11 UTC the incident coordinator joined and two minutes later made the decision change to status red.
It was understood at this time that the problem affected multiple database clusters. Additional engineers from GitHub’s database engineering team were paged. They began investigating the current state in order to determine what actions needed to be taken to manually configure a US East Coast database as the primary for each cluster and rebuild the replication topology. This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.
Guarding the confidentiality and integrity of user data is GitHub’s highest priority. In an effort to preserve this data, we decided that the 30+ minutes of data written to the US West Coast data center prevented us from considering options other than failing-forward in order to keep user data safe. However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls. This decision would result in our service being unusable for many users. We believe that the extended degradation of service was worth ensuring the consistency of our users’ data.
It was clear through querying the state of the database clusters that we needed to stop running jobs that write metadata about things like pushes. We made an explicit choice to partially degrade site usability by pausing webhook delivery and GitHub Pages builds instead of jeopardizing data we had already received from users. In other words, our strategy was to prioritize data integrity over site usability and time to recovery.
Engineers involved in the incident response team began developing a plan to resolve data inconsistencies and implement our failover procedures for MySQL. Our plan was to restore from backups, synchronize the replicas in both sites, fall back to a stable serving topology, and then resume processing queued jobs. We updated our status to inform users that we were going to be executing a controlled failover of an internal data storage system.
While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. The process to decompress, checksum, prepare, and load large backup files onto newly provisioned MySQL servers took the majority of time. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.
A backup process for all affected MySQL clusters had been initiated by this time and engineers were monitoring progress. Concurrently, multiple teams of engineers were investigating ways to speed up the transfer and recovery time without further degrading site usability or risking data corruption.
Several clusters had completed restoration from backups in our US East Coast data center and begun replicating new data from the West Coast. This resulted in slow site load times for pages that had to execute a write operation over a cross-country link, but pages reading from those database clusters would return up-to-date results if the read request landed on the newly restored replica. Other larger database clusters were still restoring.
Our teams had identified ways to restore directly from the West Coast to overcome throughput restrictions caused by downloading from off-site storage and were increasingly confident that restoration was imminent, and the time left to establishing a healthy replication topology was dependent on how long it would take replication to catch up. This estimate was linearly interpolated from the replication telemetry we had available and the status page was updated to set an expectation of two hours as our estimated time of recovery.
GitHub published a blog post to provide more context. We use GitHub Pages internally and all builds had been paused several hours earlier, so publishing this took additional effort. We apologize for the delay. We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints.
All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server that was co-located in the same physical data center as our application tier. While this improved performance substantially, there were still dozens of database read replicas that were multiple hours delayed behind the primary. These delayed replicas resulted in users seeing inconsistent data as they interacted with our services. We spread the read load across a large pool of read replicas and each request to our services had a good chance of hitting a read replica that was multiple hours delayed.
In reality, the time required for replication to catch up had adhered to a power decay function instead of a linear trajectory. Due to increased write load on our database clusters as users woke up and began their workday in Europe and the US, the recovery process took longer than originally estimated.
By now, we were approaching peak traffic load on GitHub.com. A discussion was had by the incident response team on how to proceed. It was clear that replication delays were increasing instead of decreasing towards a consistent state. We’d begun provisioning additional MySQL read replicas in the US East Coast public cloud earlier in the incident. Once these became available it became easier to spread read request volume across more servers. Reducing the utilization in aggregate across the read replicas allowed replication to catch up.
Once the replicas were in sync, we conducted a failover to the original topology, addressing the immediate latency/availability concerns. As part of a conscious decision to prioritize data integrity over a shorter incident window, we kept the service status red while we began processing the backlog of data we had accumulated.
During this phase of the recovery, we had to balance the increased load represented by the backlog, potentially overloading our ecosystem partners with notifications, and getting our services back to 100% as quickly as possible. There were over five million hook events and 80 thousand Pages builds queued.
As we re-enabled processing of this data, we processed ~200,000 webhook payloads that had outlived an internal TTL and were dropped. Upon discovering this, we paused that processing and pushed a change to increase that TTL for the time being.
To avoid further eroding the reliability of our status updates, we remained in degraded status until we had completed processing the entire backlog of data and ensured that our services had clearly settled back into normal performance levels.
All pending webhooks and Pages builds had been processed and the integrity and proper operation of all systems had been confirmed. The site status was updated to green.
During our recovery, we captured the MySQL binary logs containing the writes we took in our primary site that were not replicated to our West Coast site from each affected cluster. The total number of writes that were not replicated to the West Coast was relatively small. For example, one of our busiest clusters had 954 writes in the affected window. We are currently performing an analysis on these logs and determining which writes can be automatically reconciled and which will require outreach to users. We have multiple teams engaged in this effort, and our analysis has already determined a category of writes that have since been repeated by the user and successfully persisted. As stated in this analysis, our primary goal is preserving the integrity and accuracy of the data you store on GitHub.
In our desire to communicate meaningful information to you during the incident, we made several public estimates on time to repair based on the rate of processing of the backlog of data. In retrospect, our estimates did not factor in all variables. We are sorry for the confusion this caused and will strive to provide more accurate information in the future.
There are a number of technical initiatives that have been identified during this analysis. As we continue to work through an extensive post-incident analysis process internally, we expect to identify even more work that needs to happen.
Adjust the configuration of Orchestrator to prevent the promotion of database primaries across regional boundaries. Orchestrator’s actions behaved as configured, despite our application tier being unable to support this topology change. Leader-election within a region is generally safe, but the sudden introduction of cross-country latency was a major contributing factor during this incident. This was emergent behavior of the system given that we hadn’t previously seen an internal network partition of this magnitude.
We have accelerated our migration to a new status reporting mechanism that will provide a richer forum for us to talk about active incidents in crisper and clearer language. While many portions of GitHub were available throughout the incident, we were only able to set our status to green, yellow, and red. We recognize that this doesn’t give you an accurate picture of what is working and what is not, and in the future will be displaying the different components of the platform so you know the status of each service.
In the weeks prior to this incident, we had started a company-wide engineering initiative to support serving GitHub traffic from multiple data centers in an active/active/active design. This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact. This is a major effort and will take some time, but we believe that multiple well-connected sites in a geography provides a good set of trade-offs. This incident has added urgency to the initiative.
We will take a more proactive stance in testing our assumptions. GitHub is a fast growing company and has built up its fair share of complexity over the last decade. As we continue to grow, it becomes increasingly difficult to capture and transfer the historical context of trade-offs and decisions made to newer generations of Hubbers.
This incident has led to a shift in our mindset around site reliability. We have learned that tighter operational controls or improved response times are insufficient safeguards for site reliability within a system of services as complicated as ours. To bolster those efforts, we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub.
We know how much you rely on GitHub for your projects and businesses to succeed. No one is more passionate about the availability of our services and the correctness of your data. We will continue to analyze this event for opportunities to serve you better and earn the trust you place in us.
Several years ago we started scanning all pushes to public repositories for GitHub OAuth tokens and personal access tokens. Now we’re extending this capability to include tokens from cloud service providers and additional credentials, such as unencrypted SSH private keys associated with a user’s GitHub account.
We live in amazing times for software development. Capabilities that were once only available to large technology companies are now accessible to the smallest of startups. Developers can leverage cloud services to quickly perform continuous integration testing, deploy their code to fully scalable infrastructure, accept credit card payments from customers, and nearly anything else you can imagine.
Composing cloud services like this is the norm going forward, but it comes with inherent security complexities. Each cloud service a developer typically uses requires one or more credentials, often in the form of API tokens. In the wrong hands, they can be used to access sensitive customer data—or vast computing resources for mining cryptocurrency, presenting significant risks to both users and cloud service providers.
diff --git a/config.yml b/config.yml index 1110370..da7b5de 100644 --- a/config.yml +++ b/config.yml @@ -1,2 +1,2 @@ production: - github_token: 4e862a8ec12ef0ad7c1337e8b16d98f4d764b8f6 + github_token: <%= ENV["GITHUB_TOKEN"] %>
Fig 1: Your master branch might look innocent, but exposed credentials in your project’s history could prove costly.
GitHub, and our users, aren’t immune to this problem. That is why we started scanning for GitHub OAuth tokens in public repositories several years ago. But, our users are not just GitHub customers. They are also customers of cloud infrastructure providers, cloud payment processing providers, and other cloud service providers that have become commonplace in modern development.
Our existing solution focused on GitHub OAuth tokens exclusively and was never designed for extensibility. The existing code leveraged hand-tuned assembly that was extremely fast at finding 40-hex character strings (the format of GitHub OAuth tokens). This bit of code was patched into Git and run inline whenever code was pushed to GitHub. It was an amazing piece of work, but could not support multiple credential formats. Our vision was to support all of the popular cloud service providers.
So began our journey into the next generation of GitHub Token Scanning. The obvious path to a more extensible scanner is some form of regular expression support. However, scanning for credentials using typical regular expression libraries doesn’t scale from a performance perspective, as they optimize for a slightly different problem than the one we have.
The vast majority of regular expression libraries are designed to return the first match in a set of patterns. Given we have at least one pattern for each cloud service provider, we require all matches are returned and not just the first. The only way to ensure this with traditional libraries is to scan a given input once for each pattern. However, this increases the scan time dramatically for large repositories or large sets of patterns. Fortunately, scanning Git data for credentials is just a specific case of a general problem. For example, high-performance application-level firewalls similarly need to scan high-volume network traffic for sets of patterns to identify known viruses or malware. If you squint, scanning high-volume Git push data for credentials is a very similar problem.
Our research eventually lead us to a GitHub repository hosting the amazing Hyperscan library by Intel. This library is incredibly performant and provides exactly what we need. We will explore the technical details in more depth in a follow-up engineering post. But, in short, Hyperscan let us replace all of the assembly code patches to Git with a new standalone scanner, written in Go, that has scaled nicely.
In parallel with working on the implementation, we reached out to several cloud service providers we thought would be interested in testing out Token Scanning in a private beta. They were all enthusiastic to participate, as many of them had contacted us in the past looking for a solution to this widespread problem.
Since April, we’ve worked with cloud service providers in private beta to scan all changes to public repositories and public Gists for credentials (GitHub doesn’t scan private code). Each candidate credential is sent to the provider, including some basic metadata such as the repository name and the commit that introduced the credential. The provider can then validate the credential and decide if the credential should be revoked depending on the associated risks to the user or provider. Either way, the provider typically contacts the owner of the credential, letting them know what occurred and what action was taken.
We have received amazing feedback from both providers and users during the private beta. Cloud service providers have told us that GitHub Token Scanning has been tremendously effective in helping them identify credentials before malicious users. And, while GitHub users were not aware that Token Scanning was in beta, we did take notice of a tweet from a GitHub user that included a number of enthusiastic exclamation marks and 🔥 emojis. This user was extremely grateful for having received a notification from a participating cloud service provider less than a minute after they had accidentally pushed a highly sensitive credential to a public repository.
During the beta we have scanned millions of public repository changes and identified millions of candidate credentials. As announced yesterday at GitHub Universe, Token Scanning is now in public beta, and supports an increasing number of cloud providers. We’re excited by the impact of Token Scanning today and have lots of ideas about how to make it even more powerful in the future. Dealing with credentials is an unavoidable part of modern development. With GitHub by your side, we hope to minimize the security impact of such accidents.
There are many places a project can publicize security fixes within a new version: the CVE feed, various mailing lists and open source groups, or even within its release notes or changelog. Regardless of how projects share this information, some developers within the GitHub community will see the advisory and immediately bump their required versions of the dependency to a known safe version. If detected, we can use the information in these commits to generate security alerts for vulnerabilities which may not have been published in the CVE feed.
On an average day, the dependency graph can track around 10,000 commits to dependency files for any of our supported languages. We can’t manually process this many commits. Instead, we depend on machine intelligence to sift through them and extract those that might be related to a security release.
For this purpose, we created a machine learning model that scans text associated with public commits (the commit message and linked issues or pull requests) to filter out those related to possible security upgrades. With this smaller batch of commits, the model uses the diff to understand how required version ranges have changed. Then it aggregates across a specific timeframe to get a holistic view of all dependencies that a security release might affect. Finally, the model outputs a list of packages and version ranges it thinks require an alert and currently aren’t covered by any known CVE in our system.
No machine learning model is perfect. While machine intelligence can sift through thousands of commits in an instant, this anomaly-detection algorithm will still generate false positives for packages where no security patch was released. Security alert quality is a focus for us, so we review all model output before the community receives an alert.
Interested in learning more? Join us at GitHub Universe next week to explore the connections that push technology forward and keep projects secure through talks, trainings, and workshops. Tune in to the blog October 16-17 for more updates and announcements.