Executive Summary

kolega.dev was evaluated against the OWASP Benchmark Project, the industry-standard test suite for measuring the accuracy of software vulnerability detection tools. kolega.dev achieved a score of +87.4% on the Java Benchmark v1.2 and +90.9% on the Python Benchmark v0.1 - more than double the highest published score from any other tool on the Java benchmark.

Because kolega.dev is an AI-agent-based platform, its analysis is non-deterministic. Each benchmark was run 3 independent times to demonstrate consistency. The best result from each set is reported as the headline score, with all individual runs disclosed below. Minor variations between runs are expected and reflect the inherent nature of AI-driven analysis.

About the OWASP Benchmark

The OWASP Benchmark is an open-source test suite maintained by the OWASP Foundation designed to evaluate the accuracy of security vulnerability detection tools. It contains thousands of intentionally crafted test cases - each either a real, exploitable vulnerability or a deliberate false positive designed to trick scanners.

Java Benchmark v1.2 - 2,740 test cases across 11 vulnerability categories: Command Injection, Weak Cryptography, Weak Hashing, LDAP Injection, Path Traversal, Insecure Cookie, SQL Injection, Trust Boundary Violation, Weak Randomness, XPath Injection, Cross-Site Scripting (XSS)

Python Benchmark v0.1 - 1,230 test cases across 14 vulnerability categories: Command Injection, Code Injection, Insecure Deserialization, Weak Hashing, LDAP Injection, Path Traversal, Open Redirect, Insecure Cookie, SQL Injection, Trust Boundary Violation, Weak Randomness, XPath Injection, Cross-Site Scripting (XSS), XML External Entity (XXE)

The benchmark is language-agnostic in methodology and has been used by both open-source and commercial vendors to evaluate their tools. The official scorecard publishes results for open-source tools, while commercial tool results are anonymized (SAST-01 through SAST-06).

The Python Benchmark v0.1 is a preliminary release. To date, no other tool has published results against it.

Scoring Methodology

All scores use the standard OWASP Benchmark methodology:

Metric	Formula	Meaning
TPR (True Positive Rate)	TP / (TP + FN)	Rate of correctly detected real vulnerabilities
FPR (False Positive Rate)	FP / (FP + TN)	Rate of incorrectly flagged safe code
Score (Youden's Index)	TPR - FPR	Normalized distance from random guess line

A perfect tool scores +100% (catches every vulnerability, flags nothing safe). A random guess scores 0%. A tool that flags everything scores 0% (100% TPR but also 100% FPR).

Results: Java Benchmark v1.2 (2,740 test cases)

Java Chart

Tool	TPR	FPR	Score
kolega.dev (best of 3)	96.9%	9.5%	+87.4%
FBwFindSecBugs v1.4.6	96.84%	57.74%	+39.10%
FBwFindSecBugs v1.4.5	95.20%	57.74%	+37.46%
FBwFindSecBugs v1.4.4	78.77%	44.64%	+34.13%
SonarQube Java Plugin v3.14	50.36%	17.02%	+33.34%
SAST-06 (Commercial)	85.02%	52.09%	+32.93%
SAST-04 (Commercial)	61.45%	28.81%	+32.64%
FBwFindSecBugs v1.4.3	77.60%	45.21%	+32.39%
SAST-02 (Commercial)	56.13%	25.53%	+30.60%
SAST-03 (Commercial)	46.33%	21.44%	+24.89%
OWASP ZAP vD-2016-09-05	19.95%	0.12%	+19.84%
SAST-05 (Commercial)	47.74%	29.03%	+18.71%
OWASP ZAP vD-2015-08-24	18.03%	0.04%	+17.99%
SAST-01 (Commercial)	28.96%	12.22%	+16.74%
FBwFindSecBugs v1.4.0	47.64%	35.99%	+11.65%
FindBugs v3.0.1	5.12%	5.19%	-0.07%
PMD v5.2.3	0.00%	0.00%	+0.00%

Industry tool results sourced from the official OWASP Benchmark scorecard. SAST-01 through SAST-06 are anonymized commercial tools. Commercial results date from Benchmark v1.1; open-source results from v1.2.

Results: Python Benchmark v0.1 (1,230 test cases)

Tool	TPR	FPR	Score
kolega.dev (best of 3)	98.2%	7.3%	+90.9%

No other tool has published results against the OWASP Python Benchmark to date.

All Individual Runs

Java Benchmark v1.2

Run	TP	FN	FP	TN	TPR	FPR	Score	Raw Results
1	1371	44	126	1199	96.9%	9.5%	+87.4%	results_1.json
2	1360	55	117	1208	96.1%	8.8%	+87.3%	results_2.json
3	1334	81	113	1212	94.3%	8.5%	+85.7%	results_3.json

Python Benchmark v0.1

Run	TP	FN	FP	TN	TPR	FPR	Score	Raw Results
1	450	2	76	702	99.6%	9.8%	+89.8%	results_1.json
2	444	8	57	721	98.2%	7.3%	+90.9%	results_2.json
3	452	0	81	697	100.0%	10.4%	+89.6%	results_3.json

About kolega.dev

kolega.dev is a next-generation AI-powered security analysis platform that performs whole-codebase vulnerability detection. Unlike traditional SAST tools that rely on pattern matching or rule-based engines, kolega.dev uses AI agents to understand code semantics, trace data flow, and reason about vulnerability conditions.

The version of kolega.dev platform used in this evaluation is not yet publicly released.

If you would like to reproduce these results or evaluate kolega.dev against your own codebase, please reach out to us.

kolega.dev - OWASP Benchmark Results