BLOG POST

kolega.dev - OWASP Benchmark Results

We scored +87.4% on OWASP's industry-standard security benchmark — more than 2x higher than the next best tool. Here's the full breakdown with methodology and raw results.
February 202612 min read
Faizan
OWASP Benchmark Kolega.devSAST Comparison AI Security Vulnerability Detection

Executive Summary

kolega.dev was evaluated against the OWASP Benchmark Project, the industry-standard test suite for measuring the accuracy of software vulnerability detection tools. kolega.dev achieved a score of +87.4% on the Java Benchmark v1.2 and +90.9% on the Python Benchmark v0.1 - more than double the highest published score from any other tool on the Java benchmark.

Because kolega.dev is an AI-agent-based platform, its analysis is non-deterministic. Each benchmark was run 3 independent times to demonstrate consistency. The best result from each set is reported as the headline score, with all individual runs disclosed below. Minor variations between runs are expected and reflect the inherent nature of AI-driven analysis.


About the OWASP Benchmark

The OWASP Benchmark is an open-source test suite maintained by the OWASP Foundation designed to evaluate the accuracy of security vulnerability detection tools. It contains thousands of intentionally crafted test cases - each either a real, exploitable vulnerability or a deliberate false positive designed to trick scanners.

Java Benchmark v1.2 - 2,740 test cases across 11 vulnerability categories: Command Injection, Weak Cryptography, Weak Hashing, LDAP Injection, Path Traversal, Insecure Cookie, SQL Injection, Trust Boundary Violation, Weak Randomness, XPath Injection, Cross-Site Scripting (XSS)

Python Benchmark v0.1 - 1,230 test cases across 14 vulnerability categories: Command Injection, Code Injection, Insecure Deserialization, Weak Hashing, LDAP Injection, Path Traversal, Open Redirect, Insecure Cookie, SQL Injection, Trust Boundary Violation, Weak Randomness, XPath Injection, Cross-Site Scripting (XSS), XML External Entity (XXE)

The benchmark is language-agnostic in methodology and has been used by both open-source and commercial vendors to evaluate their tools. The official scorecard publishes results for open-source tools, while commercial tool results are anonymized (SAST-01 through SAST-06).

The Python Benchmark v0.1 is a preliminary release. To date, no other tool has published results against it.


Scoring Methodology

All scores use the standard OWASP Benchmark methodology:

Metric

Formula

Meaning

TPR (True Positive Rate)

TP / (TP + FN)

Rate of correctly detected real vulnerabilities

FPR (False Positive Rate)

FP / (FP + TN)

Rate of incorrectly flagged safe code

Score (Youden's Index)

TPR - FPR

Normalized distance from random guess line


A perfect tool scores +100% (catches every vulnerability, flags nothing safe). A random guess scores 0%. A tool that flags everything scores 0% (100% TPR but also 100% FPR).


Results: Java Benchmark v1.2 (2,740 test cases)

Java Chart

Java Chart

Tool

TPR

FPR

Score

kolega.dev (best of 3)

96.9%

9.5%

+87.4%

FBwFindSecBugs v1.4.6

96.84%

57.74%

+39.10%

FBwFindSecBugs v1.4.5

95.20%

57.74%

+37.46%

FBwFindSecBugs v1.4.4

78.77%

44.64%

+34.13%

SonarQube Java Plugin v3.14

50.36%

17.02%

+33.34%

SAST-06 (Commercial)

85.02%

52.09%

+32.93%

SAST-04 (Commercial)

61.45%

28.81%

+32.64%

FBwFindSecBugs v1.4.3

77.60%

45.21%

+32.39%

SAST-02 (Commercial)

56.13%

25.53%

+30.60%

SAST-03 (Commercial)

46.33%

21.44%

+24.89%

OWASP ZAP vD-2016-09-05

19.95%

0.12%

+19.84%

SAST-05 (Commercial)

47.74%

29.03%

+18.71%

OWASP ZAP vD-2015-08-24

18.03%

0.04%

+17.99%

SAST-01 (Commercial)

28.96%

12.22%

+16.74%

FBwFindSecBugs v1.4.0

47.64%

35.99%

+11.65%

FindBugs v3.0.1

5.12%

5.19%

-0.07%

PMD v5.2.3

0.00%

0.00%

+0.00%


Industry tool results sourced from the official OWASP Benchmark scorecard. SAST-01 through SAST-06 are anonymized commercial tools. Commercial results date from Benchmark v1.1; open-source results from v1.2.


Results: Python Benchmark v0.1 (1,230 test cases)

Tool

TPR

FPR

Score

kolega.dev (best of 3)

98.2%

7.3%

+90.9%

No other tool has published results against the OWASP Python Benchmark to date.


All Individual Runs

Java Benchmark v1.2

Run

TP

FN

FP

TN

TPR

FPR

Score

Raw Results

1

1371

44

126

1199

96.9%

9.5%

+87.4%

results_1.json

2

1360

55

117

1208

96.1%

8.8%

+87.3%

results_2.json

3

1334

81

113

1212

94.3%

8.5%

+85.7%

results_3.json

Python Benchmark v0.1

Run

TP

FN

FP

TN

TPR

FPR

Score

Raw Results

1

450

2

76

702

99.6%

9.8%

+89.8%

results_1.json

2

444

8

57

721

98.2%

7.3%

+90.9%

results_2.json

3

452

0

81

697

100.0%

10.4%

+89.6%

results_3.json


About kolega.dev

kolega.dev is a next-generation AI-powered security analysis platform that performs whole-codebase vulnerability detection. Unlike traditional SAST tools that rely on pattern matching or rule-based engines, kolega.dev uses AI agents to understand code semantics, trace data flow, and reason about vulnerability conditions.

The version of kolega.dev platform used in this evaluation is not yet publicly released.

If you would like to reproduce these results or evaluate kolega.dev against your own codebase, please reach out to us.


References


Simple 3 click setup.

Deploy Kolega.dev.

Find and fix your technical debt.