Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Custom Scanners

Custom scanners let you define organization-specific detection patterns using regular expressions. Use them when built-in scanners don’t cover your proprietary identifiers like employee IDs, project codes, or internal account numbers.

Key features:

  • Regex-based pattern matching with bounded quantifiers
  • Confidence tuning via keyword proximity (boost/reduce)
  • Validation rules with checksums and invalid patterns
  • Multi-capture group redaction

Custom scanners integrate automatically with policies using the custom: prefix (e.g., custom:employee_id).

For integrating custom scanners with policies, SIEM systems, and fleet deployment, see Custom Scanner Integration.

Quick Start

Add a custom scanner to your configuration file:

[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.85
description = "ACME Corp employee IDs"

Test your scanner:

# Validate configuration
sudo aquilon-dlp --config /etc/aquilon/config.toml --validate-config

# Scan a test file
echo "Employee ID: EMP-123456" > /tmp/test.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/test.txt

Discovering Built-in Scanners

Before creating a custom scanner, check if a built-in scanner already covers your use case. Aquilon DLP includes 30+ built-in scanners for common sensitive data types.

List Available Scanners

Use the CLI to see all available scanners (built-in and custom):

aquilon-dlp --list-scanners

Example output:

Built-in Scanners:
  ssn              - US Social Security Numbers
  credit_card      - Credit/debit card numbers (Visa, MC, Amex, etc.)
  email            - Email addresses
  phone            - Phone numbers (US, international)
  iban             - International Bank Account Numbers
  passport         - Passport numbers
  drivers_license  - Driver's license numbers
  ...

Custom Scanners:
  custom:employee_id   - ACME Corp employee IDs
  custom:project_code  - Internal project codes

Built-in Scanner Categories

Built-in scanners are organized by data type:

CategoryScannersUse Case
PIIssn, email, phone, address, date_of_birthPersonal data protection
Financialcredit_card, iban, bank_account, aba_routingPCI DSS, financial compliance
Healthcaremedical_record_number, npi, health_plan_idHIPAA compliance
Governmentpassport, drivers_license, einIdentity documents
Technicalapi_key, private_key, database_connectionSecret detection

For complete scanner-to-compliance mappings, see Policy Frameworks.

When to Create Custom Scanners

Create custom scanners when:

  • Organization-specific identifiers: Employee IDs, project codes, internal account numbers
  • Industry-specific formats: Your company’s unique document numbering scheme
  • Regional identifiers not built-in: Some EU national IDs require custom patterns

Configuration Reference

All fields for [[scanners]] entries:

FieldRequiredTypeDescription
nameYesStringUnique identifier (alphanumeric + underscore, max 64 chars). Referenced as custom:{name} in policies.
regexYesStringPattern to match. Must use bounded quantifiers (see Pattern Safety).
redaction_patternYesStringTemplate for redacting matches. X sequences map to capture group lengths.
base_confidenceYesFloatBase confidence score (0.0 - 1.0). Higher values = more confident the match is real.
descriptionNoStringHuman-readable description for documentation.
context_signalsNoArrayKeywords attached to findings for classification (e.g., ["hr", "confidential"]).
confidence_boostNoObjectBoost confidence when positive keywords found nearby. See Confidence Tuning.
confidence_reduceNoObjectReduce confidence when negative keywords found nearby. See Confidence Tuning.
validationNoObjectAdditional validation rules. See Validation Rules.

Pattern Safety

All regex patterns must be bounded to prevent performance issues. Unbounded patterns like \d+, .*, or [A-Z]+ will be rejected.

# SAFE - bounded patterns
[[scanners]]
name = "fixed_length"
regex = "EMP-([0-9]{6})"           # Fixed length: exactly 6 digits
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.85

[[scanners]]
name = "max_length"
regex = "ID-([0-9]{1,20})"         # Maximum 20 digits
redaction_pattern = "ID-XXXX"
base_confidence = 0.85

[[scanners]]
name = "range_length"
regex = "CODE-([A-Z]{3,6})"        # 3 to 6 uppercase letters
redaction_pattern = "CODE-XXXX"
base_confidence = 0.85

Unsafe patterns that will be rejected:

  • \d+ (unbounded digits)
  • .* (unbounded anything)
  • [A-Z]+ (unbounded letters)
  • (.*) (unbounded capture)

Regex Escaping in TOML

TOML strings require backslash escaping. Use one of these approaches:

# Option 1: Escape backslashes (double them)
[[scanners]]
name = "escaped_digits"
regex = "ID-(\\d{6})"              # \d becomes \\d in double quotes
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85

# Option 2: Use literal strings (single quotes)
[[scanners]]
name = "literal_digits"
regex = 'ID-(\d{6})'               # No escaping needed in single quotes
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85

# Option 3: Use character classes (no escaping)
[[scanners]]
name = "char_class"
regex = "ID-([0-9]{6})"            # [0-9] instead of \d
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85

Confidence Tuning

Adjust confidence scores based on nearby keywords to reduce false positives and improve accuracy.

Boosting Confidence

Increase confidence when positive keywords appear near a match:

[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.70

[scanners.confidence_boost]
keywords = ["employee", "badge", "payroll", "personnel", "HR"]
boost_amount = 0.20
proximity = 200

Effect: When “employee” or “payroll” appears within 200 bytes, confidence increases from 0.70 to 0.90.

Reducing Confidence

Decrease confidence when negative keywords appear near a match:

[[scanners]]
name = "account_number"
regex = "ACC-([0-9]{8})"
redaction_pattern = "ACC-XXXXXXXX"
base_confidence = 0.80

[scanners.confidence_reduce]
keywords = ["example", "test", "fake", "sample", "demo"]
boost_amount = 0.50
proximity = 100

Effect: When “example” or “test” appears within 100 bytes, confidence decreases from 0.80 to 0.30.

Combining Boost and Reduce

Use both on the same scanner for nuanced confidence:

[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{3})-([0-9]{4})"
redaction_pattern = "PROJ-XXX-XXXX"
base_confidence = 0.65

[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal"]
boost_amount = 0.25
proximity = 150

[scanners.confidence_reduce]
keywords = ["example", "documentation", "template"]
boost_amount = 0.35
proximity = 100

Confidence calculation:

ContextCalculationResult
No keywords nearby0.65 (base)0.65
“confidential” nearby0.65 + 0.250.90
“template” nearby0.65 - 0.350.30
Both nearbyApplied independentlyVaries

Confidence Adjustment Fields

FieldTypeDescription
keywordsArrayWords to search for in proximity to match
boost_amountFloatAmount to add (boost) or subtract (reduce) from confidence (0.0 - 1.0)
proximityIntegerMaximum distance in bytes to search for keywords (1 - 10000)

Validation Rules

Add validation rules to filter out false positives with checksums and pattern exclusions.

[[scanners]]
name = "company_account"
regex = "ACCT-([0-9]{10})"
redaction_pattern = "ACCT-XXXXXXXXXX"
base_confidence = 0.85

[scanners.validation]
min_confidence = 0.70
invalid_patterns = ["^ACCT-0{10}$", "^ACCT-1234567890$"]
validator = "luhn"

Validation Fields

FieldTypeDescription
min_confidenceFloatMinimum confidence threshold. Matches below this are discarded.
invalid_patternsArrayRegex patterns to reject (e.g., all zeros, test sequences).
validatorStringChecksum validator to apply: luhn, mod10, mod11, or iban.

Available Validators

ValidatorAlgorithmUse Case
luhnLuhn (mod 10)Credit cards, IMEI numbers, some account numbers
mod10Modulo 10Various identifiers with check digits
mod11Modulo 11ISBN-10, some national IDs
ibanIBAN checksumInternational Bank Account Numbers

Example: Filtering Test Data

[[scanners]]
name = "customer_id"
regex = "CUST-([0-9]{8})"
redaction_pattern = "CUST-XXXXXXXX"
base_confidence = 0.80

[scanners.validation]
# Reject common test patterns
invalid_patterns = [
    "^CUST-0{8}$",         # All zeros
    "^CUST-1{8}$",         # All ones
    "^CUST-12345678$",     # Sequential
    "^CUST-99999999$"      # All nines
]

Example: Luhn Checksum Validation

[[scanners]]
name = "loyalty_card"
regex = "([0-9]{4})([0-9]{4})([0-9]{4})([0-9]{4})"
redaction_pattern = "XXXX-XXXX-XXXX-XXXX"
base_confidence = 0.80
description = "16-digit loyalty card numbers with Luhn check"

[scanners.validation]
validator = "luhn"
invalid_patterns = ["^0{16}$", "^1{16}$"]

This configuration:

  1. Matches any 16-digit number formatted as 4 groups
  2. Validates it passes the Luhn checksum
  3. Rejects all-zeros and all-ones patterns
  4. Reports only valid matches

Redaction Patterns

Redaction patterns control how matched text appears in alerts and logs. X sequences map to capture groups.

Single Capture Group

[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"           # One capture group
redaction_pattern = "EMP-XXXXXX"   # 6 X's for the 6-digit capture
base_confidence = 0.85
InputRedacted Output
EMP-123456EMP-XXXXXX
EMP-987654EMP-XXXXXX

Multiple Capture Groups

[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{3})-([0-9]{4})"     # Two capture groups
redaction_pattern = "PROJ-XXX-XXXX"       # 3 X's, then 4 X's
base_confidence = 0.90
InputRedacted Output
PROJ-ABC-1234PROJ-XXX-XXXX
PROJ-XYZ-9999PROJ-XXX-XXXX

Variable Length Captures

For variable-length captures, use a fixed number of X’s as a placeholder:

[[scanners]]
name = "order_number"
regex = "ORD-([0-9]{4,10})"        # 4 to 10 digits
redaction_pattern = "ORD-XXXX"     # Fixed placeholder
base_confidence = 0.85
InputRedacted Output
ORD-1234ORD-XXXX
ORD-1234567890ORD-XXXX

Redaction Best Practices

  1. Match X count to expected capture length when possible
  2. Use fixed placeholders for variable-length captures
  3. Keep redaction patterns recognizable (preserve prefixes/formatting)
  4. Don’t include actual data in the pattern string

Real-World Examples

Complete, production-ready configurations for common use cases.

Healthcare: Patient ID Detection

Detect patient identifiers with healthcare context boosting:

[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.60
description = "Healthcare patient identifiers"
context_signals = ["healthcare", "phi", "hipaa"]

[scanners.confidence_boost]
keywords = ["patient", "medical", "diagnosis", "treatment", "healthcare", "hospital", "clinic"]
boost_amount = 0.30
proximity = 250

[scanners.confidence_reduce]
keywords = ["example", "test", "sample", "demo", "mock"]
boost_amount = 0.40
proximity = 100

Why this works:

  • Low base confidence (0.60) prevents false positives on similar numeric patterns
  • Healthcare keywords boost confidence significantly when in medical context
  • Test/sample keywords reduce confidence to filter documentation examples
  • Context signals (phi, hipaa) integrate with SIEM for compliance workflows

Financial: Account Number with Validation

Detect account numbers using Luhn checksum validation:

[[scanners]]
name = "financial_account"
regex = "FA-([0-9]{12})"
redaction_pattern = "FA-XXXXXXXXXXXX"
base_confidence = 0.75
description = "Financial account numbers with check digit"
context_signals = ["financial", "pci", "account"]

[scanners.confidence_boost]
keywords = ["account", "balance", "transaction", "payment", "transfer", "deposit"]
boost_amount = 0.20
proximity = 200

[scanners.validation]
validator = "luhn"
min_confidence = 0.60
invalid_patterns = [
    "^FA-0{12}$",
    "^FA-123456789012$",
    "^FA-9{12}$"
]

Why this works:

  • Luhn validator rejects numbers that fail checksum (random digit sequences)
  • Invalid patterns filter known test data
  • Minimum confidence threshold adds another layer of filtering
  • Financial keywords boost real occurrences in transaction contexts

Engineering: Multi-Part Project Code

Detect complex identifiers with multiple capture groups:

[[scanners]]
name = "internal_project"
regex = "IPROJ-([A-Z]{2})-([0-9]{4})-([A-Z]{1})"
redaction_pattern = "IPROJ-XX-XXXX-X"
base_confidence = 0.80
description = "Internal project codes (region-number-phase)"
context_signals = ["internal", "project", "confidential"]

[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal", "proprietary"]
boost_amount = 0.15
proximity = 150

[scanners.confidence_reduce]
keywords = ["template", "example", "placeholder", "documentation"]
boost_amount = 0.30
proximity = 100

Pattern breakdown:

  • ([A-Z]{2}) - Two-letter region code (e.g., US, EU, AP)
  • ([0-9]{4}) - Four-digit project number
  • ([A-Z]{1}) - Single-letter phase indicator (A-Z)

Redaction mapping:

InputOutput
IPROJ-US-1234-AIPROJ-XX-XXXX-X
IPROJ-EU-9999-CIPROJ-XX-XXXX-X

Comprehensive scanner combining all advanced features:

[[scanners]]
name = "legal_document"
regex = "DOC-([A-Z]{3})-([0-9]{6})"
redaction_pattern = "DOC-XXX-XXXXXX"
base_confidence = 0.55
description = "Legal document identifiers"
context_signals = ["legal", "confidential", "privileged"]

[scanners.confidence_boost]
keywords = ["attorney", "legal", "privileged", "confidential", "counsel", "litigation"]
boost_amount = 0.35
proximity = 300

[scanners.confidence_reduce]
keywords = ["example", "sample", "test", "template", "draft"]
boost_amount = 0.45
proximity = 150

[scanners.validation]
min_confidence = 0.50
invalid_patterns = [
    "^DOC-AAA-000000$",
    "^DOC-XXX-[0-9]{6}$",
    "^DOC-[A-Z]{3}-123456$"
]

Confidence scenarios:

ContextBaseBoostReduceFinal
No keywords0.550.55
“attorney-client” nearby0.55+0.350.90
“example document” nearby0.55-0.450.10 (rejected)
“confidential draft”0.55+0.35-0.450.45 (rejected)

The low base confidence (0.55) combined with aggressive reduce (-0.45) ensures that example/template documents are filtered even when “confidential” appears nearby.

Testing Custom Scanners

Validate your scanner configuration before deployment.

Validate Configuration

Check for syntax errors and unsafe patterns:

sudo aquilon-dlp --config /etc/aquilon/config.toml --validate-config

Successful validation:

Configuration valid
Loaded 3 custom scanners:
  - patient_id (bounded regex, confidence 0.60)
  - financial_account (bounded regex, confidence 0.75, validator: luhn)
  - internal_project (bounded regex, confidence 0.80)

Failed validation (unsafe pattern):

Configuration error: Scanner 'bad_scanner' has unsafe regex pattern
  Pattern: "ID-\d+"
  Error: Unbounded repetition detected
  Suggestion: Use bounded quantifiers like {1,20} instead of +

Scan Test Files

Test your scanner against sample data:

# Create test file
cat > /tmp/scanner_test.txt << 'EOF'
Patient PAT-12345678 visited on 2024-01-15.
Financial account FA-123456789012 balance.
Project IPROJ-US-1234-A is confidential.
Legal document DOC-ABC-123456 under attorney review.
EOF

# Run scan
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/scanner_test.txt

Expected output:

Scanning: /tmp/scanner_test.txt
Results:
  [patient_id] PAT-XXXXXXXX (confidence: 0.60, line 1)
    Context signals: healthcare, phi, hipaa
  [financial_account] FA-XXXXXXXXXXXX (confidence: 0.75, line 2)
    Context signals: financial, pci, account
    Validation: luhn passed
  [internal_project] IPROJ-XX-XXXX-X (confidence: 0.80, line 3)
    Context signals: internal, project, confidential
  [legal_document] DOC-XXX-XXXXXX (confidence: 0.90, line 4)
    Context signals: legal, confidential, privileged
    Confidence boosted by: "attorney"

Summary: 4 findings in 1 file

Testing Confidence Adjustments

Verify boost and reduce behavior:

# Test with boost keywords
cat > /tmp/boost_test.txt << 'EOF'
Patient medical record: PAT-12345678
EOF
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/boost_test.txt
# Expected: confidence 0.90 (0.60 base + 0.30 boost from "patient", "medical")

# Test with reduce keywords
cat > /tmp/reduce_test.txt << 'EOF'
Example patient ID: PAT-12345678 (test data)
EOF
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/reduce_test.txt
# Expected: confidence 0.20 (0.60 base - 0.40 reduce from "example", "test")

Testing Validation Rules

Verify checksum validation:

# Valid Luhn number (passes checksum)
echo "FA-374245455400126" > /tmp/valid_luhn.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/valid_luhn.txt
# Expected: Match found

# Invalid Luhn number (fails checksum)
echo "FA-123456789012" > /tmp/invalid_luhn.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/invalid_luhn.txt
# Expected: No match (fails Luhn validation)

Using Policy Integration

Test custom scanners through policies:

# Policy referencing custom scanner
cat > /tmp/policy_test.toml << 'EOF'
watch_paths = ["/tmp"]
exclude_paths = []

[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.60

[policies]
enabled_policies = ["test_policy"]

[policies.policy_configs.test_policy]
enabled = true
scanners = ["custom:patient_id"]
min_confidence = 0.5

[work_queue]
max_queue_size = 10000
submit_timeout_secs = 5

[worker]
num_workers = 0

[resource_limits]
enabled = false

[metrics]
bind_address = "127.0.0.1"
port = 9000

[cache]
enabled = true
ttl_secs = 0

[scan]
max_scan_size_mb = 40
max_recursion_depth = 5
EOF

sudo aquilon-dlp --config /tmp/policy_test.toml --scan /tmp/scanner_test.txt

Note the custom: prefix when referencing custom scanners in policies.

Troubleshooting

Common issues and solutions when working with custom scanners.

Configuration Errors

Error MessageCauseSolution
Unsafe regex pattern: unbounded repetitionPattern uses +, *, or unbounded {n,}Use bounded quantifiers: {1,20} instead of +, {0,100} instead of *
Invalid regex syntaxMalformed regular expressionCheck TOML escaping: use \\d or 'single quotes' or [0-9]
Mismatched capture groupsRegex capture count doesn’t match X sequencesAlign capture groups with redaction X runs
Scanner name already existsDuplicate name fieldEach scanner needs a unique name
Invalid base_confidenceValue outside 0.0-1.0 rangeUse values between 0.0 and 1.0

Pattern Not Matching

Symptom: Scanner configured but no matches found.

Diagnostic steps:

  1. Test regex separately:

    echo "EMP-123456" | grep -E "EMP-([0-9]{6})"
    
  2. Check TOML escaping:

    # These are all equivalent:
    regex = "\\d{6}"      # Double backslash in double quotes
    regex = '\d{6}'       # Single quotes (literal)
    regex = "[0-9]{6}"    # Character class (recommended)
    
  3. Verify file is being scanned:

    • Check watch_paths includes the file location
    • Check exclude_paths doesn’t exclude it
    • Verify file size is under max_scan_size_mb
  4. Check confidence threshold:

    • If using policies, verify min_confidence isn’t filtering matches
    • Check if confidence_reduce keywords are nearby

False Positives

Symptom: Scanner matches too many non-relevant patterns.

Solutions:

  1. Add validation rules:

    [scanners.validation]
    invalid_patterns = ["^ACCT-0{10}$", "^ACCT-12345"]
    min_confidence = 0.70
    
  2. Use confidence reduce:

    [scanners.confidence_reduce]
    keywords = ["example", "test", "sample", "demo"]
    boost_amount = 0.40
    proximity = 100
    
  3. Add checksum validation:

    [scanners.validation]
    validator = "luhn"  # or "mod10", "mod11", "iban"
    

Policy Integration Issues

Error MessageCauseSolution
Unknown scanner 'employee_id'Missing custom: prefixUse custom:employee_id in policy scanners list
Scanner 'custom:foo' not foundScanner not definedAdd [[scanners]] entry with name = "foo"
Policy references disabled scannerScanner defined but not enabledCheck scanner configuration is complete

Performance Issues

Symptom: Scanning is slow after adding custom scanners.

Solutions:

  1. Check pattern complexity:

    • Avoid nested alternations: (a|b|c) is fine, ((a|b)|(c|d)) is slow
    • Avoid overlapping patterns: [A-Za-z] + [a-z] creates backtracking
  2. Reduce proximity search:

    [scanners.confidence_boost]
    proximity = 100  # Smaller = faster (default is 200)
    
  3. Simplify validation:

    • invalid_patterns with simple patterns are fast
    • Complex regex in invalid_patterns can slow scanning

Redaction Issues

Symptom: Redacted output looks wrong.

IssueCauseSolution
Partial redactionCapture group mismatchEnsure X count matches capture group length
XXX for variable dataVariable-length captureUse fixed placeholder or document behavior
No prefix in outputPrefix not in patternAdd prefix outside capture group: PREFIX-([0-9]{6})

Example fix:

# Wrong - captures everything including prefix
regex = "(EMP-[0-9]{6})"
redaction_pattern = "XXXXXXXXXX"  # Loses prefix

# Correct - captures only sensitive part
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"  # Preserves prefix

Best Practices

Guidelines for building effective, maintainable custom scanners.

Pattern Design

  1. Always use bounded quantifiers

    • {6} for fixed length
    • {1,20} for variable length with maximum
    • Never use +, *, or {n,} (unbounded)
  2. Use character classes over escape sequences

    • [0-9] instead of \d (avoids TOML escaping issues)
    • [A-Za-z] instead of \w
    • [^a-z] for negation
  3. Capture only sensitive data

    # Good: prefix preserved, only digits captured
    regex = "EMP-([0-9]{6})"
    
    # Bad: entire match captured
    regex = "(EMP-[0-9]{6})"
    
  4. Test patterns before deployment

    echo "EMP-123456" | grep -E "EMP-([0-9]{6})"
    

Confidence Strategy

  1. Start with low base confidence (0.50-0.70)

    • Prevents over-alerting before context analysis
    • Allows boost/reduce to have meaningful effect
  2. Use boost for high-value context

    • Domain-specific keywords that indicate real data
    • Proximity 150-300 bytes for document context
  3. Use reduce aggressively for noise

    • Test, example, sample, demo, placeholder
    • Proximity 50-150 bytes for nearby indicators
  4. Document your confidence rationale

    description = "Patient IDs: low base (0.60) + medical boost (0.30) = 0.90 in healthcare docs"
    

Validation Rules

  1. Always add invalid_patterns for test data

    • Common sequences: all zeros, all ones, sequential (123456)
    • Known test values from documentation
  2. Use checksums when available

    • Financial accounts often have Luhn/mod10 digits
    • Reduces false positives by 90%+
  3. Set appropriate min_confidence

    • 0.50-0.60 for high-recall (find everything)
    • 0.70-0.80 for balanced precision/recall
    • 0.85+ for high-precision (minimize false positives)

Organization and Maintenance

  1. Use descriptive names

    name = "patient_mrn"        # Good: specific
    name = "id"                 # Bad: too generic
    
  2. Always include description

    description = "Medical Record Numbers: MRN-XXXXXXXX format, HIPAA-regulated"
    
  3. Use context_signals for SIEM integration

    context_signals = ["healthcare", "phi", "hipaa"]
    

    These tags appear in alerts and enable filtering/routing in your SIEM.

  4. Group related scanners

    # Healthcare scanners
    [[scanners]]
    name = "patient_mrn"
    # ...
    
    [[scanners]]
    name = "patient_ssn"
    # ...
    
    # Financial scanners
    [[scanners]]
    name = "account_number"
    # ...
    

Performance Optimization

  1. Order patterns by specificity

    • Most specific patterns first (fewer false matches)
    • Generic patterns last
  2. Minimize proximity for boost/reduce

    • Start with 100-150 bytes
    • Increase only if needed for context
  3. Avoid complex alternations

    # Slow: nested alternations
    regex = "((EMP|STAFF)-(ID|NUM))-([0-9]{6})"
    
    # Fast: separate scanners
    [[scanners]]
    name = "emp_id"
    regex = "EMP-ID-([0-9]{6})"
    
    [[scanners]]
    name = "staff_num"
    regex = "STAFF-NUM-([0-9]{6})"
    

Security Considerations

  1. Never log sensitive data in tests

    • Use obviously fake test data
    • Don’t use real examples in documentation
  2. Review patterns for over-matching

    • Simple patterns like [0-9]{9} match too broadly
    • Always include prefix/format markers
  3. Test with production-like data volume

    • Performance issues emerge at scale
    • Run against large sample files before deployment