Benchmarks

This repository contains benchmarks for ZenGuard AI and information on how to run them.

There are two types of benchmarks that we run against ZenGuard AI:

Hugging Face datasets based benchmarks
ZenGuard AI Generated Benchmark - Zen Bench

Here you can find both benchmark results and how to run them yourself.

Public Datasets Benchmarks

We are constantly monitoring Hugging Face for new datasets that relate to GenAI security. Then we run them against ZenGuard AI to find any potential security issues with our product.

ZenGuard AI Accuracy Against Hugging Face Datasets

#	Dataset	Accuracy	Date Added
1	xTRam1/safe-guard-prompt-injection	96%	2024-07-01
2	yanismiraoui/prompt_injections	96.5%	2024-08-01
3	Harelix/Prompt-Injection-Mixed-Techniques-2024	92%	2024-10-01
4	aporia-ai/prompt_injection	87.68%	2024-05-15
5	deepset/prompt-injections	87%	2024-05-15
6	JasperLS/prompt-injections	87%	2024-05-15

DIY Benchmarking

We have developed the ZenGuard Benchmarks PyPi package to help test and benchmark ZenGuard AI better.

Here are the instructions on how to use the package.

Benchmarking Output

Here is an example of what the benchmarking output looks like:

Benchmarking split: train: 100%|██████████| 1034/1034 [07:06<00:00,  2.42it/s]
========== BENCHMARK RESULTS START ==========

Dataset: yanismiraoui/prompt_injections
ZenGuard Benchmark Results:
Total Samples: 1034
    Correct: 998
    False Positives: 0
    False Negatives: 36
    Accuracy: 96.52%

========== BENCHMARK RESULTS END ==========

Where:

Total Samples: The total number of prompts processed.
Correct: The number of prompts that were classified correctly.
False Positives: The number of prompts incorrectly identified as attacks.
False Negatives: The number of actual prompt attacks that went undetected.
Accuracy: The ratio of correctly classified prompts to the total number of samples.

Zen CX Bench

We created CX-skewed benchmark dataset. This dataset is designed to test both generic chat messages and attacks, but also specific CX-related prompts and attacks. CX prompts were carefully crafted to be similar to the messages that we see in our day-to-day production.

Dataset Info:

1200 prompts including:
- 200 attacks - jailbreaks, prompt injections.
- 1000 generic prompts - 700 generic chat messages and 300 CX-specific agentic interactions

Results:

#	Name	Accuracy	F1	False Positives	Date Added
1	ZenGuard	96.3%	87%	6	2025-01-14
2	Protect AI	91%	73.4%	70	2025-01-14
3	Microsoft Prompt Shield	74.8%	41.2%	218	2025-01-14
4	Guardrails AI	70.4%	20.9%	212	2025-01-14
5	Meta Llama Guard	55.7%	39.6%	515	2025-01-14
6	Lakera	48.4%	37.5%	615	2025-01-14

These results indicate high False Positive rates, meaning that some of the solutions are overaggressive in their detections of generic and CX-specific prompts. The high number of False Positives is very detrimental to the production systems and user experience.

Benchmarks

Public Datasets Benchmarks​

ZenGuard AI Accuracy Against Hugging Face Datasets​

DIY Benchmarking​

Benchmarking Output​

Zen CX Bench​

Contents

Public Datasets Benchmarks

ZenGuard AI Accuracy Against Hugging Face Datasets

DIY Benchmarking

Benchmarking Output

Zen CX Bench