Security Scoring

What the security scorer detects

The security scorer analyzes both the request (prompt) and response (completion) for three classes of concern:

Class	What it finds	Example
PII Exposure	Personal information in responses	SSN, credit card numbers, email addresses in AI output
Prompt Injection	Attempts to override model instructions	”Ignore previous instructions and…”
Credential Leakage	API keys, passwords, tokens in prompts or responses	AWS keys, GitHub tokens, database passwords

PII detection

The PII detector scans model responses for personally identifiable information using a combination of regex patterns and learned classifiers.

Detected PII types:

Social Security Numbers (SSN): \d{3}-\d{2}-\d{4}
Credit card numbers (Luhn-validated)
Email addresses
Phone numbers (US and international formats)
Physical addresses
Date of birth patterns
Passport and driver’s license patterns
Bank account and routing numbers
IP addresses (when contextually sensitive)

Score interpretation:

Score range	Meaning
0.00 – 0.20	No PII detected
0.21 – 0.50	Low-confidence PII signal (partial match)
0.51 – 0.70	Probable PII in response
0.71 – 1.00	High-confidence PII detected

Prompt injection detection

Prompt injection attacks attempt to hijack model behavior by embedding instructions that override the system prompt or application context.

Common attack patterns detected:

"Ignore previous instructions and..."
"Disregard your system prompt..."
"You are now in developer mode..."
"[SYSTEM OVERRIDE]..."
"<!--ADMIN INSTRUCTION:..."

The detector scores both direct injection (in the user message) and indirect injection (in content the model is asked to process, such as a web page or document).

Configuration:

scoring:
  security:
    check_injection: true
    injection_sensitivity: medium   # low | medium | high

Higher sensitivity catches more injection patterns but increases false positives on legitimate prompts that use phrases like “ignore the following” in a non-malicious context.

Credential detection

The credential detector looks for secrets and keys in both prompts and responses.

Detected credential types:

Credential	Pattern
Anthropic API key	`sk-ant-api03-...`
OpenAI API key	`sk-...` (51 chars)
AWS access key	`AKIA[A-Z0-9]{16}`
AWS secret key	40-char hex adjacent to “aws”
GitHub token	`ghp_`, `ghs_`, `gho_` prefixes
Stripe key	`sk_live_`, `pk_live_`
Generic bearer token	High-entropy strings > 32 chars in auth contexts
Database connection strings	`postgresql://`, `mongodb://`, etc.

When credentials appear in prompts (not responses), the security scorer still flags them. Users occasionally paste connection strings or API keys into prompts — this should be flagged even if the response is benign.

Composite security score

The final security score is the maximum of the three component scores:

security_score = max(pii_score, injection_score, credential_score)

This means any single high-severity finding drives the overall score, regardless of the other scores.

Configuration

scoring:
  security:
    enabled: true
    threshold: 0.70
    check_pii: true
    check_injection: true
    check_credentials: true
    injection_sensitivity: medium
    pii_types:
      - ssn
      - credit_card
      - email
      - phone
      - address

Excluding false positives

If your application legitimately discusses security topics (a security training app, a documentation generator), you can adjust thresholds or disable specific checks:

scoring:
  security:
    threshold: 0.85        # Raise threshold for security-topic apps
    check_injection: false  # Disable injection check if prompts discuss injections