Protect Your Enterprise from GitHub® Data Leaks

5 min read • July 23, 2020

GitHub has become one of the most popular development platforms for both enterprises and individuals over the past ten years, as the software is relatively easy to use and new features are constantly being added. Unfortunately, it has also become a rich source for sensitive data leaks. According to ZDNet in early 2019, “A scan of billions of files from 13 percent of all GitHub public repositories over a period of six months has revealed that over 100,000 repos have leaked API tokens and cryptographic keys, with thousands of new repositories leaking new secrets on a daily basis.” Reports detailing GitHub data leaks are not new. Articles can be found as long ago as 2014, when ITNews shared that AWS urge[d] developers to scrub GitHub of secret keys. These publications and many others have alerted enterprises that digital risk protection goes beyond addressing malicious actors. Businesses must be equally vigilant in detecting and remediating breaches resulting from negligence.

GitHub data leak sources

Despite the popularity of this platform, there are drawbacks. One such issue is GitHub has become a source for exposed sensitive data that can lead to major breaches. CybelAngel analysts regularly see a variety of GitHub leaks; however, many of these leaks occur because of an accidental publication. One of the most common data leak issues occurs when users recognize they have inadvertently published sensitive data. While the offending user may subsequently delete the data, GitHub is designed to keep track of historical modifications of published code, so the sensitive data remains publicly accessible. While there are ways to permanently delete inputs, simply deleting the data is not sufficient to stop it from leaking. GitHub data leaks are as varied as the people who use the software. Examples of how these data leaks occur and the data exposed include:

An employee uploads a production database to solicit enhancement feedback and accidentally exposes PII test data for a new pharmaceutical drug
Another employee shares a newly developed script for recurrent tasks as for example extract statistics from a database
A supplier’s developer shares sensitive data related to a new chip prototype
A client’s employee publishes confidential information regarding new product or service their company is testing

Unless properly secured, all of these scenarios are potential causes of data leaks that could result in a major breach.

The magnitude of GitHub data leaks

When we talk about data leaks occurring on development platforms, we often think about the sensitive code. And that is what we can easily find when searching « password » or any other word used to name credentials on the GitHub search bar. In 2015, researchers from IBM institute produced a research paper that delved into the detection and mitigation of these secret-lock credentials. They investigated search words designed to expose credentials; thus, they simulated the way the attackers would and could search for this information to learn how to better prevent leaks. The IBM institute’s study was relevant in 2015, and it is even more relevant today. Credentials are discovered on a daily basis which provide access to internal development environments of companies from all around the world. These leaked passwords make it possible to access: sensitive data, devices, databases, AWS S3 buckets, or even the users’ corporate or personal emails. These compromised credentials can be retrieved and weaponized by cybercriminals. GitHub Data Leaks It might be surprising to learn that an internal database can leak online, even if it is correctly configured and should not be accessible. When this happens not only are the database configuration and requests exposed, but part of the database itself can be as well. For example, a production database sample may be downloaded and uploaded on the platform to show how it is supposed to work and/or to solicit comments from collaborators for enhancements. Once this is done, the database and associated data can be publicly accessible. CybelAngel Analysts find PII (personally identifiable information) of employees, lists of clients, or even lists of projects leaking from such a GitHub incident. CybelAngel analysts have detected uploaded .csv extensions files, which are usually a part of a database export, similar to the example in the previous paragraph and leaking the same types of data. These files might be used by an employee or a supplier who works on code testing and uses real data for their test. These .csv files often contain lists of users of an application, clients’ reviews of a product or service, or even online shopping purchases (with buyer’s PII). Software development firms have learned that source code they believed was confidential is found on GitHub. Painful lessons have been learned that using GitHub to collaborate can cost companies a competitive edge should their source code be exposed. Worse yet, there is the risk of a significant leak of credentials and the associated consequences. The ultimate irony is that malicious actors often use GitHub to share news to the public about their activities. Recently, a group of hackers published a list of leaked emails on Github. They also publish vulnerabilities discoveries on the platform, which is usually used by other hackers to attack the targeted company.

Avoiding GitHub data leaks

GitHub is a widely used and valuable software development platform; however, enterprises need to implement processes and procedures to ensure the safety of their data.

Train users about the limitation of private mode and the potential for credentials to be compromised is key. Making this training part of the company’s overall security program is key to stopping data leaks.
Limit the types of data such as: product test data, PII, secret-keys that can be checked into GitHub. This will lower the potential risk of accidental leaks of sensitive information.
There are any number of tools which can be used to stop the publication of code with sensitive information or to detect sensitive data that may have been inadvertently published on the platform. Investing in these tools can stop or mitigate a data leak.
CybelAngel analysts recommend using the GitHub private mode to avoid sharing sensitive data exposure, as it is one of the efficient methods to keep the data confidential. However users often prefer to use public mode, as the objective of the platform is to collaborate. If a user realizes that the sensitive data has been exposed they can switch to private mode to avoid further information dissemination.

Lastly, enterprises choose CybelAngel for digital risk protection across all layers of the internet, including their team’s and their third parties use of GitHub. By detecting data leaks quickly and working with clients to remediate critical data leaks, especially on codeshare platforms, such as GitHub, CybelAngel helps its client’s data leaks from becoming major breaches.

About the author

Benedicte Matran

Bénédicte Matran is Head of Marketing at CybelAngel, a cybersecurity SaaS company specializing in external attack surface management, with over 10 years of B2B marketing leadership across account-based marketing, growth, and field marketing. Based in Paris, she is an expert in enterprise go-to-market strategy, demand generation, and scaling marketing functions in high-growth tech environments.

read all the articles

Cyber Roundup: Week of June 8

Our Investigation of FIFA World Cup 2026 Fraud [Threat Report]