13M developers write 14K lines of code each per year, touching sensitive data 16,847,298 times per year.
If you need to understand how important, but also how difficult, it is to pinpoint sensitive data risks in a modern application stack, that is the number to keep in mind.
In an effort to better explain the urgency of data security, we went in search of tangible numbers and came up with those above. But, how did we end up with them? Let’s take a look.
Step 1: Analyzing two billion lines of code
We started by selecting the top 15K Open Source projects on GitHub (precisely 16,323), using a mix of technology stacks including JavaScript/TS, Ruby, Go, Python, PHP, Java, C#, and PHP. We ran our cutting-edge static code analysis tool that detects sensitive data throughout source code on each project. This tool is able to detect 120+ different data types including Personal Data (PD), Personal Identifiable Information (PII), Protected Health Information (PHI), and Financial Data.
In total, the 15K repositories analyzed represent over two billion lines of code.
Based on the average, we found that the projects analyzed have 132 KLOC(thousand lines of code), and process sensitive data 102 times—that means sensitive data is processed every 1.3 KLOC.
With that in mind, we acknowledge that the standard deviation is important, both in terms of lines of code and sensitive data occurrences. It’s not abnormal for applications to include code unrelated to data processing, including external libraries and boilerplate code.
To better reflect reality, we’ve decided to exclude 5% of the high and low ends of the dataset. With these conditions, we find that the projects analyzed have an average of 60 KLOC, and process sensitive data 70 times. That’s sensitive data processed every 0.9 KLOC.
This number is pretty conservative. Here are some caveats to help explain why this number is likely even higher in production applications::
- This data set is composed only of Open Source projects, which usually process less sensitive data than private projects such as a CRM, an eCommerce, or a medical appointment application.
- A certain number of lines of code in every project is considered “boilerplate”, most of the time provided by default by the framework developers chose. This is not “business logic” code and usually doesn’t touch sensitive data. We haven’t excluded every one of these lines of code in the numbers provided, even though we’ve eliminated the 5% outliers.
Ultimately, we believe that most organizations will find higher numbers across their private application’s code.
Step 2: How many developers write code?
According to GitHub, this year alone, 20.5M+ new developers joined GitHub, which represents 94M developers in total registered on the platform. Though, not every GitHub developer is a professional.
According to developer nation, there are over 13M software professionals:
“We estimate that there were 24.3M active software developers in the world at the start of 2021, out of which 13M are software professionals. There is an increase in the developer population of 3M developers since mid-2020, or an annual growth that hovers around 20%. Out of those 24.3M, two in three are below 35 years of age. We can expect that the developer population will more than double in the next decade to about 45M in 2030.”
We chose the lower number of 13M developers as it represents professionals, even though, security should not just be a professional concern.
Step 3: How many lines of code are written every year?
Let’s be clear, there is no official data here, and this is a heavily debated topic where we often hear quotes like “Language X is more verbose than Y” and “Good developers don’t write a lot of lines of code.” Ultimately it’s the law of large numbers that interests us, and we can make educated assumptions about it.
Sage McEnery ran an interesting analysis of this number in 2020. They started by estimating a junior developer writing about 100 LoC per day and that this number tends to diminish with each year of experience, resulting in the distribution below:
Sage correlated the percentage of developers between each year of experience, and using the 13M total developers we deducted before, we calculated that on average developers write 14 KLOC per year, for a total of 18,719,220,000 lines of code.
We count a year as the number of working days, about 253 days in 2022 (depending on where you live!), minus 20 days of vacations.
Step 4: Putting everything together
Our previous steps provided everything to finalize our estimation:
- Sensitive data is found every 0.9 KLOC.
- Developers write in total 18,719,220,000 lines of code every year, 14 KLOC each.
- There are 13M professional developers worldwide.
Sensitive data occurrences = Number of lines of code written every year / Sensitive data per LOC
(18,719,220,000 / 1000) * 0.9 = 16,847,298 sensitive data occurrences
As a result, sensitive data is invoked directly in lines of code 16,847,298 times per year.
The impact is massive
The fact that sensitive data is invoked in code written by developers 16 million times per year is extremely high, considering this represents as many opportunities for risk of leakage or breach if done so incorrectly. This is also very small when you compare it to the 19 billion lines of code written every year, but it certainly opened our eyes to the scope of the problem.
We have here the big challenge of security today: the most sensitive asset to protect is also the most difficult to monitor—it almost feels like finding a needle in a haystack.
At Bearer, we believe that only data-first security solutions—starting from sensitive data and pulling the thread to the associated risks and vulnerabilities—can provide the balance between the need to secure sensitive data and not burying developers in a pile of equal-priority issues.
We’re close to announcing a solution that makes this approach even easier. Sign up today to be one of the first to try it.