Security Scoring
What the security scorer detects
The security scorer analyzes both the request (prompt) and response (completion) for three classes of concern:
| Class | What it finds | Example |
|---|---|---|
| PII Exposure | Personal information in responses | SSN, credit card numbers, email addresses in AI output |
| Prompt Injection | Attempts to override model instructions | ”Ignore previous instructions and…” |
| Credential Leakage | API keys, passwords, tokens in prompts or responses | AWS keys, GitHub tokens, database passwords |
PII detection
The PII detector scans model responses for personally identifiable information using a combination of regex patterns and learned classifiers.
Detected PII types:
- Social Security Numbers (SSN):
\d{3}-\d{2}-\d{4} - Credit card numbers (Luhn-validated)
- Email addresses
- Phone numbers (US and international formats)
- Physical addresses
- Date of birth patterns
- Passport and driver’s license patterns
- Bank account and routing numbers
- IP addresses (when contextually sensitive)
Score interpretation:
| Score range | Meaning |
|---|---|
| 0.00 – 0.20 | No PII detected |
| 0.21 – 0.50 | Low-confidence PII signal (partial match) |
| 0.51 – 0.70 | Probable PII in response |
| 0.71 – 1.00 | High-confidence PII detected |
Prompt injection detection
Prompt injection attacks attempt to hijack model behavior by embedding instructions that override the system prompt or application context.
Common attack patterns detected:
"Ignore previous instructions and...""Disregard your system prompt...""You are now in developer mode...""[SYSTEM OVERRIDE]...""<!--ADMIN INSTRUCTION:..."The detector scores both direct injection (in the user message) and indirect injection (in content the model is asked to process, such as a web page or document).
Configuration:
scoring: security: check_injection: true injection_sensitivity: medium # low | medium | highHigher sensitivity catches more injection patterns but increases false positives on legitimate prompts that use phrases like “ignore the following” in a non-malicious context.
Credential detection
The credential detector looks for secrets and keys in both prompts and responses.
Detected credential types:
| Credential | Pattern |
|---|---|
| Anthropic API key | sk-ant-api03-... |
| OpenAI API key | sk-... (51 chars) |
| AWS access key | AKIA[A-Z0-9]{16} |
| AWS secret key | 40-char hex adjacent to “aws” |
| GitHub token | ghp_, ghs_, gho_ prefixes |
| Stripe key | sk_live_, pk_live_ |
| Generic bearer token | High-entropy strings > 32 chars in auth contexts |
| Database connection strings | postgresql://, mongodb://, etc. |
When credentials appear in prompts (not responses), the security scorer still flags them. Users occasionally paste connection strings or API keys into prompts — this should be flagged even if the response is benign.
Composite security score
The final security score is the maximum of the three component scores:
security_score = max(pii_score, injection_score, credential_score)This means any single high-severity finding drives the overall score, regardless of the other scores.
Configuration
scoring: security: enabled: true threshold: 0.70 check_pii: true check_injection: true check_credentials: true injection_sensitivity: medium pii_types: - ssn - credit_card - email - phone - addressExcluding false positives
If your application legitimately discusses security topics (a security training app, a documentation generator), you can adjust thresholds or disable specific checks:
scoring: security: threshold: 0.85 # Raise threshold for security-topic apps check_injection: false # Disable injection check if prompts discuss injections