Custom Scanners

Custom scanners let you define organization-specific detection patterns using regular expressions. Use them when built-in scanners don’t cover your proprietary identifiers like employee IDs, project codes, or internal account numbers.

Key features:

Regex-based pattern matching with bounded quantifiers
Confidence tuning via keyword proximity (boost/reduce)
Validation rules with checksums and invalid patterns
Multi-capture group redaction

Custom scanners integrate automatically with policies using the custom: prefix (e.g., custom:employee_id).

For integrating custom scanners with policies, SIEM systems, and fleet deployment, see Custom Scanner Integration.

Quick Start

Add a custom scanner to your configuration file:

[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.85
description = "ACME Corp employee IDs"

Test your scanner:

# Validate configuration
sudo aquilon-dlp --config /etc/aquilon/config.toml --validate-config

# Scan a test file
echo "Employee ID: EMP-123456" > /tmp/test.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/test.txt

Discovering Built-in Scanners

Before creating a custom scanner, check if a built-in scanner already covers your use case. Aquilon DLP includes 30+ built-in scanners for common sensitive data types.

List Available Scanners

Use the CLI to see all available scanners (built-in and custom):

aquilon-dlp --list-scanners

Example output:

Built-in Scanners:
  ssn              - US Social Security Numbers
  credit_card      - Credit/debit card numbers (Visa, MC, Amex, etc.)
  email            - Email addresses
  phone            - Phone numbers (US, international)
  iban             - International Bank Account Numbers
  passport         - Passport numbers
  drivers_license  - Driver's license numbers
  ...

Custom Scanners:
  custom:employee_id   - ACME Corp employee IDs
  custom:project_code  - Internal project codes

Built-in Scanner Categories

Built-in scanners are organized by data type:

Category	Scanners	Use Case
PII	`ssn`, `email`, `phone`, `address`, `date_of_birth`	Personal data protection
Financial	`credit_card`, `iban`, `bank_account`, `aba_routing`	PCI DSS, financial compliance
Healthcare	`medical_record_number`, `npi`, `health_plan_id`	HIPAA compliance
Government	`passport`, `drivers_license`, `ein`	Identity documents
Technical	`api_key`, `private_key`, `database_connection`	Secret detection

For complete scanner-to-compliance mappings, see Policy Frameworks.

When to Create Custom Scanners

Create custom scanners when:

Organization-specific identifiers: Employee IDs, project codes, internal account numbers
Industry-specific formats: Your company’s unique document numbering scheme
Regional identifiers not built-in: Some EU national IDs require custom patterns

Configuration Reference

All fields for [[scanners]] entries:

Field	Required	Type	Description
`name`	Yes	String	Unique identifier (alphanumeric + underscore, max 64 chars). Referenced as `custom:{name}` in policies.
`regex`	Yes	String	Pattern to match. Must use bounded quantifiers (see Pattern Safety).
`redaction_pattern`	Yes	String	Template for redacting matches. X sequences map to capture group lengths.
`base_confidence`	Yes	Float	Base confidence score (0.0 - 1.0). Higher values = more confident the match is real.
`description`	No	String	Human-readable description for documentation.
`context_signals`	No	Array	Keywords attached to findings for classification (e.g., `["hr", "confidential"]`).
`confidence_boost`	No	Object	Boost confidence when positive keywords found nearby. See Confidence Tuning.
`confidence_reduce`	No	Object	Reduce confidence when negative keywords found nearby. See Confidence Tuning.
`validation`	No	Object	Additional validation rules. See Validation Rules.

Pattern Safety

All regex patterns must be bounded to prevent performance issues. Unbounded patterns like \d+, .*, or [A-Z]+ will be rejected.

# SAFE - bounded patterns
[[scanners]]
name = "fixed_length"
regex = "EMP-([0-9]{6})"           # Fixed length: exactly 6 digits
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.85

[[scanners]]
name = "max_length"
regex = "ID-([0-9]{1,20})"         # Maximum 20 digits
redaction_pattern = "ID-XXXX"
base_confidence = 0.85

[[scanners]]
name = "range_length"
regex = "CODE-([A-Z]{3,6})"        # 3 to 6 uppercase letters
redaction_pattern = "CODE-XXXX"
base_confidence = 0.85

Unsafe patterns that will be rejected:

\d+ (unbounded digits)
.* (unbounded anything)
[A-Z]+ (unbounded letters)
(.*) (unbounded capture)

Regex Escaping in TOML

TOML strings require backslash escaping. Use one of these approaches:

# Option 1: Escape backslashes (double them)
[[scanners]]
name = "escaped_digits"
regex = "ID-(\\d{6})"              # \d becomes \\d in double quotes
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85

# Option 2: Use literal strings (single quotes)
[[scanners]]
name = "literal_digits"
regex = 'ID-(\d{6})'               # No escaping needed in single quotes
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85

# Option 3: Use character classes (no escaping)
[[scanners]]
name = "char_class"
regex = "ID-([0-9]{6})"            # [0-9] instead of \d
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85

Confidence Tuning

Adjust confidence scores based on nearby keywords to reduce false positives and improve accuracy.

Boosting Confidence

Increase confidence when positive keywords appear near a match:

[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.70

[scanners.confidence_boost]
keywords = ["employee", "badge", "payroll", "personnel", "HR"]
boost_amount = 0.20
proximity = 200

Effect: When “employee” or “payroll” appears within 200 bytes, confidence increases from 0.70 to 0.90.

Reducing Confidence

Decrease confidence when negative keywords appear near a match:

[[scanners]]
name = "account_number"
regex = "ACC-([0-9]{8})"
redaction_pattern = "ACC-XXXXXXXX"
base_confidence = 0.80

[scanners.confidence_reduce]
keywords = ["example", "test", "fake", "sample", "demo"]
boost_amount = 0.50
proximity = 100

Effect: When “example” or “test” appears within 100 bytes, confidence decreases from 0.80 to 0.30.

Combining Boost and Reduce

Use both on the same scanner for nuanced confidence:

[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{3})-([0-9]{4})"
redaction_pattern = "PROJ-XXX-XXXX"
base_confidence = 0.65

[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal"]
boost_amount = 0.25
proximity = 150

[scanners.confidence_reduce]
keywords = ["example", "documentation", "template"]
boost_amount = 0.35
proximity = 100

Confidence calculation:

Context	Calculation	Result
No keywords nearby	0.65 (base)	0.65
“confidential” nearby	0.65 + 0.25	0.90
“template” nearby	0.65 - 0.35	0.30
Both nearby	Applied independently	Varies

Confidence Adjustment Fields

Field	Type	Description
`keywords`	Array	Words to search for in proximity to match
`boost_amount`	Float	Amount to add (boost) or subtract (reduce) from confidence (0.0 - 1.0)
`proximity`	Integer	Maximum distance in bytes to search for keywords (1 - 10000)

Validation Rules

Add validation rules to filter out false positives with checksums and pattern exclusions.

[[scanners]]
name = "company_account"
regex = "ACCT-([0-9]{10})"
redaction_pattern = "ACCT-XXXXXXXXXX"
base_confidence = 0.85

[scanners.validation]
min_confidence = 0.70
invalid_patterns = ["^ACCT-0{10}$", "^ACCT-1234567890$"]
validator = "luhn"

Validation Fields

Field	Type	Description
`min_confidence`	Float	Minimum confidence threshold. Matches below this are discarded.
`invalid_patterns`	Array	Regex patterns to reject (e.g., all zeros, test sequences).
`validator`	String	Checksum validator to apply: `luhn`, `mod10`, `mod11`, or `iban`.

Available Validators

Validator	Algorithm	Use Case
`luhn`	Luhn (mod 10)	Credit cards, IMEI numbers, some account numbers
`mod10`	Modulo 10	Various identifiers with check digits
`mod11`	Modulo 11	ISBN-10, some national IDs
`iban`	IBAN checksum	International Bank Account Numbers

Example: Filtering Test Data

[[scanners]]
name = "customer_id"
regex = "CUST-([0-9]{8})"
redaction_pattern = "CUST-XXXXXXXX"
base_confidence = 0.80

[scanners.validation]
# Reject common test patterns
invalid_patterns = [
    "^CUST-0{8}$",         # All zeros
    "^CUST-1{8}$",         # All ones
    "^CUST-12345678$",     # Sequential
    "^CUST-99999999$"      # All nines
]

Example: Luhn Checksum Validation

[[scanners]]
name = "loyalty_card"
regex = "([0-9]{4})([0-9]{4})([0-9]{4})([0-9]{4})"
redaction_pattern = "XXXX-XXXX-XXXX-XXXX"
base_confidence = 0.80
description = "16-digit loyalty card numbers with Luhn check"

[scanners.validation]
validator = "luhn"
invalid_patterns = ["^0{16}$", "^1{16}$"]

This configuration:

Matches any 16-digit number formatted as 4 groups
Validates it passes the Luhn checksum
Rejects all-zeros and all-ones patterns
Reports only valid matches

Redaction Patterns

Redaction patterns control how matched text appears in alerts and logs. X sequences map to capture groups.

Single Capture Group

[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"           # One capture group
redaction_pattern = "EMP-XXXXXX"   # 6 X's for the 6-digit capture
base_confidence = 0.85

Input	Redacted Output
`EMP-123456`	`EMP-XXXXXX`
`EMP-987654`	`EMP-XXXXXX`

Multiple Capture Groups

[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{3})-([0-9]{4})"     # Two capture groups
redaction_pattern = "PROJ-XXX-XXXX"       # 3 X's, then 4 X's
base_confidence = 0.90

Input	Redacted Output
`PROJ-ABC-1234`	`PROJ-XXX-XXXX`
`PROJ-XYZ-9999`	`PROJ-XXX-XXXX`

Variable Length Captures

For variable-length captures, use a fixed number of X’s as a placeholder:

[[scanners]]
name = "order_number"
regex = "ORD-([0-9]{4,10})"        # 4 to 10 digits
redaction_pattern = "ORD-XXXX"     # Fixed placeholder
base_confidence = 0.85

Input	Redacted Output
`ORD-1234`	`ORD-XXXX`
`ORD-1234567890`	`ORD-XXXX`

Redaction Best Practices

Match X count to expected capture length when possible
Use fixed placeholders for variable-length captures
Keep redaction patterns recognizable (preserve prefixes/formatting)
Don’t include actual data in the pattern string

Real-World Examples

Complete, production-ready configurations for common use cases.

Healthcare: Patient ID Detection

Detect patient identifiers with healthcare context boosting:

[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.60
description = "Healthcare patient identifiers"
context_signals = ["healthcare", "phi", "hipaa"]

[scanners.confidence_boost]
keywords = ["patient", "medical", "diagnosis", "treatment", "healthcare", "hospital", "clinic"]
boost_amount = 0.30
proximity = 250

[scanners.confidence_reduce]
keywords = ["example", "test", "sample", "demo", "mock"]
boost_amount = 0.40
proximity = 100

Why this works:

Low base confidence (0.60) prevents false positives on similar numeric patterns
Healthcare keywords boost confidence significantly when in medical context
Test/sample keywords reduce confidence to filter documentation examples
Context signals (phi, hipaa) integrate with SIEM for compliance workflows

Financial: Account Number with Validation

Detect account numbers using Luhn checksum validation:

[[scanners]]
name = "financial_account"
regex = "FA-([0-9]{12})"
redaction_pattern = "FA-XXXXXXXXXXXX"
base_confidence = 0.75
description = "Financial account numbers with check digit"
context_signals = ["financial", "pci", "account"]

[scanners.confidence_boost]
keywords = ["account", "balance", "transaction", "payment", "transfer", "deposit"]
boost_amount = 0.20
proximity = 200

[scanners.validation]
validator = "luhn"
min_confidence = 0.60
invalid_patterns = [
    "^FA-0{12}$",
    "^FA-123456789012$",
    "^FA-9{12}$"
]

Why this works:

Luhn validator rejects numbers that fail checksum (random digit sequences)
Invalid patterns filter known test data
Minimum confidence threshold adds another layer of filtering
Financial keywords boost real occurrences in transaction contexts

Engineering: Multi-Part Project Code

Detect complex identifiers with multiple capture groups:

[[scanners]]
name = "internal_project"
regex = "IPROJ-([A-Z]{2})-([0-9]{4})-([A-Z]{1})"
redaction_pattern = "IPROJ-XX-XXXX-X"
base_confidence = 0.80
description = "Internal project codes (region-number-phase)"
context_signals = ["internal", "project", "confidential"]

[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal", "proprietary"]
boost_amount = 0.15
proximity = 150

[scanners.confidence_reduce]
keywords = ["template", "example", "placeholder", "documentation"]
boost_amount = 0.30
proximity = 100

Pattern breakdown:

([A-Z]{2}) - Two-letter region code (e.g., US, EU, AP)
([0-9]{4}) - Four-digit project number
([A-Z]{1}) - Single-letter phase indicator (A-Z)

Redaction mapping:

Input	Output
`IPROJ-US-1234-A`	`IPROJ-XX-XXXX-X`
`IPROJ-EU-9999-C`	`IPROJ-XX-XXXX-X`

Legal: Document ID with Full Feature Set

Comprehensive scanner combining all advanced features:

[[scanners]]
name = "legal_document"
regex = "DOC-([A-Z]{3})-([0-9]{6})"
redaction_pattern = "DOC-XXX-XXXXXX"
base_confidence = 0.55
description = "Legal document identifiers"
context_signals = ["legal", "confidential", "privileged"]

[scanners.confidence_boost]
keywords = ["attorney", "legal", "privileged", "confidential", "counsel", "litigation"]
boost_amount = 0.35
proximity = 300

[scanners.confidence_reduce]
keywords = ["example", "sample", "test", "template", "draft"]
boost_amount = 0.45
proximity = 150

[scanners.validation]
min_confidence = 0.50
invalid_patterns = [
    "^DOC-AAA-000000$",
    "^DOC-XXX-[0-9]{6}$",
    "^DOC-[A-Z]{3}-123456$"
]

Confidence scenarios:

Context	Base	Boost	Reduce	Final
No keywords	0.55	—	—	0.55
“attorney-client” nearby	0.55	+0.35	—	0.90
“example document” nearby	0.55	—	-0.45	0.10 (rejected)
“confidential draft”	0.55	+0.35	-0.45	0.45 (rejected)

The low base confidence (0.55) combined with aggressive reduce (-0.45) ensures that example/template documents are filtered even when “confidential” appears nearby.

Testing Custom Scanners

Validate your scanner configuration before deployment.

Validate Configuration

Check for syntax errors and unsafe patterns:

sudo aquilon-dlp --config /etc/aquilon/config.toml --validate-config

Successful validation:

Configuration valid
Loaded 3 custom scanners:
  - patient_id (bounded regex, confidence 0.60)
  - financial_account (bounded regex, confidence 0.75, validator: luhn)
  - internal_project (bounded regex, confidence 0.80)

Failed validation (unsafe pattern):

Configuration error: Scanner 'bad_scanner' has unsafe regex pattern
  Pattern: "ID-\d+"
  Error: Unbounded repetition detected
  Suggestion: Use bounded quantifiers like {1,20} instead of +

Scan Test Files

Test your scanner against sample data:

# Create test file
cat > /tmp/scanner_test.txt << 'EOF'
Patient PAT-12345678 visited on 2024-01-15.
Financial account FA-123456789012 balance.
Project IPROJ-US-1234-A is confidential.
Legal document DOC-ABC-123456 under attorney review.
EOF

# Run scan
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/scanner_test.txt

Expected output:

Scanning: /tmp/scanner_test.txt
Results:
  [patient_id] PAT-XXXXXXXX (confidence: 0.60, line 1)
    Context signals: healthcare, phi, hipaa
  [financial_account] FA-XXXXXXXXXXXX (confidence: 0.75, line 2)
    Context signals: financial, pci, account
    Validation: luhn passed
  [internal_project] IPROJ-XX-XXXX-X (confidence: 0.80, line 3)
    Context signals: internal, project, confidential
  [legal_document] DOC-XXX-XXXXXX (confidence: 0.90, line 4)
    Context signals: legal, confidential, privileged
    Confidence boosted by: "attorney"

Summary: 4 findings in 1 file

Testing Confidence Adjustments

Verify boost and reduce behavior:

# Test with boost keywords
cat > /tmp/boost_test.txt << 'EOF'
Patient medical record: PAT-12345678
EOF
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/boost_test.txt
# Expected: confidence 0.90 (0.60 base + 0.30 boost from "patient", "medical")

# Test with reduce keywords
cat > /tmp/reduce_test.txt << 'EOF'
Example patient ID: PAT-12345678 (test data)
EOF
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/reduce_test.txt
# Expected: confidence 0.20 (0.60 base - 0.40 reduce from "example", "test")

Testing Validation Rules

Verify checksum validation:

# Valid Luhn number (passes checksum)
echo "FA-374245455400126" > /tmp/valid_luhn.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/valid_luhn.txt
# Expected: Match found

# Invalid Luhn number (fails checksum)
echo "FA-123456789012" > /tmp/invalid_luhn.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/invalid_luhn.txt
# Expected: No match (fails Luhn validation)

Using Policy Integration

Test custom scanners through policies:

# Policy referencing custom scanner
cat > /tmp/policy_test.toml << 'EOF'
watch_paths = ["/tmp"]
exclude_paths = []

[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.60

[policies]
enabled_policies = ["test_policy"]

[policies.policy_configs.test_policy]
enabled = true
scanners = ["custom:patient_id"]
min_confidence = 0.5

[work_queue]
max_queue_size = 10000
submit_timeout_secs = 5

[worker]
num_workers = 0

[resource_limits]
enabled = false

[metrics]
bind_address = "127.0.0.1"
port = 9000

[cache]
enabled = true
ttl_secs = 0

[scan]
max_scan_size_mb = 40
max_recursion_depth = 5
EOF

sudo aquilon-dlp --config /tmp/policy_test.toml --scan /tmp/scanner_test.txt

Note the custom: prefix when referencing custom scanners in policies.

Troubleshooting

Common issues and solutions when working with custom scanners.

Configuration Errors

Error Message	Cause	Solution
`Unsafe regex pattern: unbounded repetition`	Pattern uses `+`, `*`, or unbounded `{n,}`	Use bounded quantifiers: `{1,20}` instead of `+`, `{0,100}` instead of `*`
`Invalid regex syntax`	Malformed regular expression	Check TOML escaping: use `\\d` or `'single quotes'` or `[0-9]`
`Mismatched capture groups`	Regex capture count doesn’t match X sequences	Align capture groups with redaction X runs
`Scanner name already exists`	Duplicate `name` field	Each scanner needs a unique name
`Invalid base_confidence`	Value outside 0.0-1.0 range	Use values between 0.0 and 1.0

Pattern Not Matching

Symptom: Scanner configured but no matches found.

Diagnostic steps:

Test regex separately:

echo "EMP-123456" | grep -E "EMP-([0-9]{6})"

Check TOML escaping:

# These are all equivalent:
regex = "\\d{6}"      # Double backslash in double quotes
regex = '\d{6}'       # Single quotes (literal)
regex = "[0-9]{6}"    # Character class (recommended)

Verify file is being scanned:
- Check watch_paths includes the file location
- Check exclude_paths doesn’t exclude it
- Verify file size is under max_scan_size_mb
Check confidence threshold:
- If using policies, verify min_confidence isn’t filtering matches
- Check if confidence_reduce keywords are nearby

False Positives

Symptom: Scanner matches too many non-relevant patterns.

Solutions:

Add validation rules:

[scanners.validation]
invalid_patterns = ["^ACCT-0{10}$", "^ACCT-12345"]
min_confidence = 0.70

Use confidence reduce:

[scanners.confidence_reduce]
keywords = ["example", "test", "sample", "demo"]
boost_amount = 0.40
proximity = 100

Add checksum validation:

[scanners.validation]
validator = "luhn"  # or "mod10", "mod11", "iban"

Policy Integration Issues

Error Message	Cause	Solution
`Unknown scanner 'employee_id'`	Missing `custom:` prefix	Use `custom:employee_id` in policy `scanners` list
`Scanner 'custom:foo' not found`	Scanner not defined	Add `[[scanners]]` entry with `name = "foo"`
`Policy references disabled scanner`	Scanner defined but not enabled	Check scanner configuration is complete

Performance Issues

Symptom: Scanning is slow after adding custom scanners.

Solutions:

Check pattern complexity:
- Avoid nested alternations: (a|b|c) is fine, ((a|b)|(c|d)) is slow
- Avoid overlapping patterns: [A-Za-z] + [a-z] creates backtracking

Reduce proximity search:

[scanners.confidence_boost]
proximity = 100  # Smaller = faster (default is 200)

Simplify validation:
- invalid_patterns with simple patterns are fast
- Complex regex in invalid_patterns can slow scanning

Redaction Issues

Symptom: Redacted output looks wrong.

Issue	Cause	Solution
Partial redaction	Capture group mismatch	Ensure X count matches capture group length
`XXX` for variable data	Variable-length capture	Use fixed placeholder or document behavior
No prefix in output	Prefix not in pattern	Add prefix outside capture group: `PREFIX-([0-9]{6})`

Example fix:

# Wrong - captures everything including prefix
regex = "(EMP-[0-9]{6})"
redaction_pattern = "XXXXXXXXXX"  # Loses prefix

# Correct - captures only sensitive part
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"  # Preserves prefix

Best Practices

Guidelines for building effective, maintainable custom scanners.

Pattern Design

Always use bounded quantifiers
- {6} for fixed length
- {1,20} for variable length with maximum
- Never use +, *, or {n,} (unbounded)
Use character classes over escape sequences
- [0-9] instead of \d (avoids TOML escaping issues)
- [A-Za-z] instead of \w
- [^a-z] for negation

Capture only sensitive data

# Good: prefix preserved, only digits captured
regex = "EMP-([0-9]{6})"

# Bad: entire match captured
regex = "(EMP-[0-9]{6})"

Test patterns before deployment

echo "EMP-123456" | grep -E "EMP-([0-9]{6})"

Confidence Strategy

Start with low base confidence (0.50-0.70)
- Prevents over-alerting before context analysis
- Allows boost/reduce to have meaningful effect
Use boost for high-value context
- Domain-specific keywords that indicate real data
- Proximity 150-300 bytes for document context
Use reduce aggressively for noise
- Test, example, sample, demo, placeholder
- Proximity 50-150 bytes for nearby indicators

Document your confidence rationale

description = "Patient IDs: low base (0.60) + medical boost (0.30) = 0.90 in healthcare docs"

Validation Rules

Always add invalid_patterns for test data
- Common sequences: all zeros, all ones, sequential (123456)
- Known test values from documentation
Use checksums when available
- Financial accounts often have Luhn/mod10 digits
- Reduces false positives by 90%+
Set appropriate min_confidence
- 0.50-0.60 for high-recall (find everything)
- 0.70-0.80 for balanced precision/recall
- 0.85+ for high-precision (minimize false positives)

Organization and Maintenance

Use descriptive names

name = "patient_mrn"        # Good: specific
name = "id"                 # Bad: too generic

Always include description

description = "Medical Record Numbers: MRN-XXXXXXXX format, HIPAA-regulated"

Use context_signals for SIEM integration
```
context_signals = ["healthcare", "phi", "hipaa"]
```
These tags appear in alerts and enable filtering/routing in your SIEM.

Group related scanners

# Healthcare scanners
[[scanners]]
name = "patient_mrn"
# ...

[[scanners]]
name = "patient_ssn"
# ...

# Financial scanners
[[scanners]]
name = "account_number"
# ...

Performance Optimization

Order patterns by specificity
- Most specific patterns first (fewer false matches)
- Generic patterns last
Minimize proximity for boost/reduce
- Start with 100-150 bytes
- Increase only if needed for context

Avoid complex alternations

# Slow: nested alternations
regex = "((EMP|STAFF)-(ID|NUM))-([0-9]{6})"

# Fast: separate scanners
[[scanners]]
name = "emp_id"
regex = "EMP-ID-([0-9]{6})"

[[scanners]]
name = "staff_num"
regex = "STAFF-NUM-([0-9]{6})"

Security Considerations

Never log sensitive data in tests
- Use obviously fake test data
- Don’t use real examples in documentation
Review patterns for over-matching
- Simple patterns like [0-9]{9} match too broadly
- Always include prefix/format markers
Test with production-like data volume
- Performance issues emerge at scale
- Run against large sample files before deployment

Keyboard shortcuts

Aquilon DLP Documentation