Skip to content

Security Scoring

What the security scorer detects

The security scorer analyzes both the request (prompt) and response (completion) for three classes of concern:

ClassWhat it findsExample
PII ExposurePersonal information in responsesSSN, credit card numbers, email addresses in AI output
Prompt InjectionAttempts to override model instructions”Ignore previous instructions and…”
Credential LeakageAPI keys, passwords, tokens in prompts or responsesAWS keys, GitHub tokens, database passwords

PII detection

The PII detector scans model responses for personally identifiable information using a combination of regex patterns and learned classifiers.

Detected PII types:

  • Social Security Numbers (SSN): \d{3}-\d{2}-\d{4}
  • Credit card numbers (Luhn-validated)
  • Email addresses
  • Phone numbers (US and international formats)
  • Physical addresses
  • Date of birth patterns
  • Passport and driver’s license patterns
  • Bank account and routing numbers
  • IP addresses (when contextually sensitive)

Score interpretation:

Score rangeMeaning
0.00 – 0.20No PII detected
0.21 – 0.50Low-confidence PII signal (partial match)
0.51 – 0.70Probable PII in response
0.71 – 1.00High-confidence PII detected

Prompt injection detection

Prompt injection attacks attempt to hijack model behavior by embedding instructions that override the system prompt or application context.

Common attack patterns detected:

"Ignore previous instructions and..."
"Disregard your system prompt..."
"You are now in developer mode..."
"[SYSTEM OVERRIDE]..."
"<!--ADMIN INSTRUCTION:..."

The detector scores both direct injection (in the user message) and indirect injection (in content the model is asked to process, such as a web page or document).

Configuration:

scoring:
security:
check_injection: true
injection_sensitivity: medium # low | medium | high

Higher sensitivity catches more injection patterns but increases false positives on legitimate prompts that use phrases like “ignore the following” in a non-malicious context.

Credential detection

The credential detector looks for secrets and keys in both prompts and responses.

Detected credential types:

CredentialPattern
Anthropic API keysk-ant-api03-...
OpenAI API keysk-... (51 chars)
AWS access keyAKIA[A-Z0-9]{16}
AWS secret key40-char hex adjacent to “aws”
GitHub tokenghp_, ghs_, gho_ prefixes
Stripe keysk_live_, pk_live_
Generic bearer tokenHigh-entropy strings > 32 chars in auth contexts
Database connection stringspostgresql://, mongodb://, etc.

When credentials appear in prompts (not responses), the security scorer still flags them. Users occasionally paste connection strings or API keys into prompts — this should be flagged even if the response is benign.

Composite security score

The final security score is the maximum of the three component scores:

security_score = max(pii_score, injection_score, credential_score)

This means any single high-severity finding drives the overall score, regardless of the other scores.

Configuration

scoring:
security:
enabled: true
threshold: 0.70
check_pii: true
check_injection: true
check_credentials: true
injection_sensitivity: medium
pii_types:
- ssn
- credit_card
- email
- phone
- address

Excluding false positives

If your application legitimately discusses security topics (a security training app, a documentation generator), you can adjust thresholds or disable specific checks:

scoring:
security:
threshold: 0.85 # Raise threshold for security-topic apps
check_injection: false # Disable injection check if prompts discuss injections