Custom Scanners
Custom scanners let you define organization-specific detection patterns using regular expressions. Use them when built-in scanners don’t cover your proprietary identifiers like employee IDs, project codes, or internal account numbers.
Key features:
- Regex-based pattern matching with bounded quantifiers
- Confidence tuning via keyword proximity (boost/reduce)
- Validation rules with checksums and invalid patterns
- Multi-capture group redaction
Custom scanners integrate automatically with policies using the custom: prefix (e.g., custom:employee_id).
For integrating custom scanners with policies, SIEM systems, and fleet deployment, see Custom Scanner Integration.
Quick Start
Add a custom scanner to your configuration file:
[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.85
description = "ACME Corp employee IDs"
Test your scanner:
# Validate configuration
sudo aquilon-dlp --config /etc/aquilon/config.toml --validate-config
# Scan a test file
echo "Employee ID: EMP-123456" > /tmp/test.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/test.txt
Discovering Built-in Scanners
Before creating a custom scanner, check if a built-in scanner already covers your use case. Aquilon DLP includes 30+ built-in scanners for common sensitive data types.
List Available Scanners
Use the CLI to see all available scanners (built-in and custom):
aquilon-dlp --list-scanners
Example output:
Built-in Scanners:
ssn - US Social Security Numbers
credit_card - Credit/debit card numbers (Visa, MC, Amex, etc.)
email - Email addresses
phone - Phone numbers (US, international)
iban - International Bank Account Numbers
passport - Passport numbers
drivers_license - Driver's license numbers
...
Custom Scanners:
custom:employee_id - ACME Corp employee IDs
custom:project_code - Internal project codes
Built-in Scanner Categories
Built-in scanners are organized by data type:
| Category | Scanners | Use Case |
|---|---|---|
| PII | ssn, email, phone, address, date_of_birth | Personal data protection |
| Financial | credit_card, iban, bank_account, aba_routing | PCI DSS, financial compliance |
| Healthcare | medical_record_number, npi, health_plan_id | HIPAA compliance |
| Government | passport, drivers_license, ein | Identity documents |
| Technical | api_key, private_key, database_connection | Secret detection |
For complete scanner-to-compliance mappings, see Policy Frameworks.
When to Create Custom Scanners
Create custom scanners when:
- Organization-specific identifiers: Employee IDs, project codes, internal account numbers
- Industry-specific formats: Your company’s unique document numbering scheme
- Regional identifiers not built-in: Some EU national IDs require custom patterns
Configuration Reference
All fields for [[scanners]] entries:
| Field | Required | Type | Description |
|---|---|---|---|
name | Yes | String | Unique identifier (alphanumeric + underscore, max 64 chars). Referenced as custom:{name} in policies. |
regex | Yes | String | Pattern to match. Must use bounded quantifiers (see Pattern Safety). |
redaction_pattern | Yes | String | Template for redacting matches. X sequences map to capture group lengths. |
base_confidence | Yes | Float | Base confidence score (0.0 - 1.0). Higher values = more confident the match is real. |
description | No | String | Human-readable description for documentation. |
context_signals | No | Array | Keywords attached to findings for classification (e.g., ["hr", "confidential"]). |
confidence_boost | No | Object | Boost confidence when positive keywords found nearby. See Confidence Tuning. |
confidence_reduce | No | Object | Reduce confidence when negative keywords found nearby. See Confidence Tuning. |
validation | No | Object | Additional validation rules. See Validation Rules. |
Pattern Safety
All regex patterns must be bounded to prevent performance issues. Unbounded patterns like \d+, .*, or [A-Z]+ will be rejected.
# SAFE - bounded patterns
[[scanners]]
name = "fixed_length"
regex = "EMP-([0-9]{6})" # Fixed length: exactly 6 digits
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.85
[[scanners]]
name = "max_length"
regex = "ID-([0-9]{1,20})" # Maximum 20 digits
redaction_pattern = "ID-XXXX"
base_confidence = 0.85
[[scanners]]
name = "range_length"
regex = "CODE-([A-Z]{3,6})" # 3 to 6 uppercase letters
redaction_pattern = "CODE-XXXX"
base_confidence = 0.85
Unsafe patterns that will be rejected:
\d+(unbounded digits).*(unbounded anything)[A-Z]+(unbounded letters)(.*)(unbounded capture)
Regex Escaping in TOML
TOML strings require backslash escaping. Use one of these approaches:
# Option 1: Escape backslashes (double them)
[[scanners]]
name = "escaped_digits"
regex = "ID-(\\d{6})" # \d becomes \\d in double quotes
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85
# Option 2: Use literal strings (single quotes)
[[scanners]]
name = "literal_digits"
regex = 'ID-(\d{6})' # No escaping needed in single quotes
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85
# Option 3: Use character classes (no escaping)
[[scanners]]
name = "char_class"
regex = "ID-([0-9]{6})" # [0-9] instead of \d
redaction_pattern = "ID-XXXXXX"
base_confidence = 0.85
Confidence Tuning
Adjust confidence scores based on nearby keywords to reduce false positives and improve accuracy.
Boosting Confidence
Increase confidence when positive keywords appear near a match:
[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX"
base_confidence = 0.70
[scanners.confidence_boost]
keywords = ["employee", "badge", "payroll", "personnel", "HR"]
boost_amount = 0.20
proximity = 200
Effect: When “employee” or “payroll” appears within 200 bytes, confidence increases from 0.70 to 0.90.
Reducing Confidence
Decrease confidence when negative keywords appear near a match:
[[scanners]]
name = "account_number"
regex = "ACC-([0-9]{8})"
redaction_pattern = "ACC-XXXXXXXX"
base_confidence = 0.80
[scanners.confidence_reduce]
keywords = ["example", "test", "fake", "sample", "demo"]
boost_amount = 0.50
proximity = 100
Effect: When “example” or “test” appears within 100 bytes, confidence decreases from 0.80 to 0.30.
Combining Boost and Reduce
Use both on the same scanner for nuanced confidence:
[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{3})-([0-9]{4})"
redaction_pattern = "PROJ-XXX-XXXX"
base_confidence = 0.65
[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal"]
boost_amount = 0.25
proximity = 150
[scanners.confidence_reduce]
keywords = ["example", "documentation", "template"]
boost_amount = 0.35
proximity = 100
Confidence calculation:
| Context | Calculation | Result |
|---|---|---|
| No keywords nearby | 0.65 (base) | 0.65 |
| “confidential” nearby | 0.65 + 0.25 | 0.90 |
| “template” nearby | 0.65 - 0.35 | 0.30 |
| Both nearby | Applied independently | Varies |
Confidence Adjustment Fields
| Field | Type | Description |
|---|---|---|
keywords | Array | Words to search for in proximity to match |
boost_amount | Float | Amount to add (boost) or subtract (reduce) from confidence (0.0 - 1.0) |
proximity | Integer | Maximum distance in bytes to search for keywords (1 - 10000) |
Validation Rules
Add validation rules to filter out false positives with checksums and pattern exclusions.
[[scanners]]
name = "company_account"
regex = "ACCT-([0-9]{10})"
redaction_pattern = "ACCT-XXXXXXXXXX"
base_confidence = 0.85
[scanners.validation]
min_confidence = 0.70
invalid_patterns = ["^ACCT-0{10}$", "^ACCT-1234567890$"]
validator = "luhn"
Validation Fields
| Field | Type | Description |
|---|---|---|
min_confidence | Float | Minimum confidence threshold. Matches below this are discarded. |
invalid_patterns | Array | Regex patterns to reject (e.g., all zeros, test sequences). |
validator | String | Checksum validator to apply: luhn, mod10, mod11, or iban. |
Available Validators
| Validator | Algorithm | Use Case |
|---|---|---|
luhn | Luhn (mod 10) | Credit cards, IMEI numbers, some account numbers |
mod10 | Modulo 10 | Various identifiers with check digits |
mod11 | Modulo 11 | ISBN-10, some national IDs |
iban | IBAN checksum | International Bank Account Numbers |
Example: Filtering Test Data
[[scanners]]
name = "customer_id"
regex = "CUST-([0-9]{8})"
redaction_pattern = "CUST-XXXXXXXX"
base_confidence = 0.80
[scanners.validation]
# Reject common test patterns
invalid_patterns = [
"^CUST-0{8}$", # All zeros
"^CUST-1{8}$", # All ones
"^CUST-12345678$", # Sequential
"^CUST-99999999$" # All nines
]
Example: Luhn Checksum Validation
[[scanners]]
name = "loyalty_card"
regex = "([0-9]{4})([0-9]{4})([0-9]{4})([0-9]{4})"
redaction_pattern = "XXXX-XXXX-XXXX-XXXX"
base_confidence = 0.80
description = "16-digit loyalty card numbers with Luhn check"
[scanners.validation]
validator = "luhn"
invalid_patterns = ["^0{16}$", "^1{16}$"]
This configuration:
- Matches any 16-digit number formatted as 4 groups
- Validates it passes the Luhn checksum
- Rejects all-zeros and all-ones patterns
- Reports only valid matches
Redaction Patterns
Redaction patterns control how matched text appears in alerts and logs. X sequences map to capture groups.
Single Capture Group
[[scanners]]
name = "employee_id"
regex = "EMP-([0-9]{6})" # One capture group
redaction_pattern = "EMP-XXXXXX" # 6 X's for the 6-digit capture
base_confidence = 0.85
| Input | Redacted Output |
|---|---|
EMP-123456 | EMP-XXXXXX |
EMP-987654 | EMP-XXXXXX |
Multiple Capture Groups
[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{3})-([0-9]{4})" # Two capture groups
redaction_pattern = "PROJ-XXX-XXXX" # 3 X's, then 4 X's
base_confidence = 0.90
| Input | Redacted Output |
|---|---|
PROJ-ABC-1234 | PROJ-XXX-XXXX |
PROJ-XYZ-9999 | PROJ-XXX-XXXX |
Variable Length Captures
For variable-length captures, use a fixed number of X’s as a placeholder:
[[scanners]]
name = "order_number"
regex = "ORD-([0-9]{4,10})" # 4 to 10 digits
redaction_pattern = "ORD-XXXX" # Fixed placeholder
base_confidence = 0.85
| Input | Redacted Output |
|---|---|
ORD-1234 | ORD-XXXX |
ORD-1234567890 | ORD-XXXX |
Redaction Best Practices
- Match X count to expected capture length when possible
- Use fixed placeholders for variable-length captures
- Keep redaction patterns recognizable (preserve prefixes/formatting)
- Don’t include actual data in the pattern string
Real-World Examples
Complete, production-ready configurations for common use cases.
Healthcare: Patient ID Detection
Detect patient identifiers with healthcare context boosting:
[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.60
description = "Healthcare patient identifiers"
context_signals = ["healthcare", "phi", "hipaa"]
[scanners.confidence_boost]
keywords = ["patient", "medical", "diagnosis", "treatment", "healthcare", "hospital", "clinic"]
boost_amount = 0.30
proximity = 250
[scanners.confidence_reduce]
keywords = ["example", "test", "sample", "demo", "mock"]
boost_amount = 0.40
proximity = 100
Why this works:
- Low base confidence (0.60) prevents false positives on similar numeric patterns
- Healthcare keywords boost confidence significantly when in medical context
- Test/sample keywords reduce confidence to filter documentation examples
- Context signals (
phi,hipaa) integrate with SIEM for compliance workflows
Financial: Account Number with Validation
Detect account numbers using Luhn checksum validation:
[[scanners]]
name = "financial_account"
regex = "FA-([0-9]{12})"
redaction_pattern = "FA-XXXXXXXXXXXX"
base_confidence = 0.75
description = "Financial account numbers with check digit"
context_signals = ["financial", "pci", "account"]
[scanners.confidence_boost]
keywords = ["account", "balance", "transaction", "payment", "transfer", "deposit"]
boost_amount = 0.20
proximity = 200
[scanners.validation]
validator = "luhn"
min_confidence = 0.60
invalid_patterns = [
"^FA-0{12}$",
"^FA-123456789012$",
"^FA-9{12}$"
]
Why this works:
- Luhn validator rejects numbers that fail checksum (random digit sequences)
- Invalid patterns filter known test data
- Minimum confidence threshold adds another layer of filtering
- Financial keywords boost real occurrences in transaction contexts
Engineering: Multi-Part Project Code
Detect complex identifiers with multiple capture groups:
[[scanners]]
name = "internal_project"
regex = "IPROJ-([A-Z]{2})-([0-9]{4})-([A-Z]{1})"
redaction_pattern = "IPROJ-XX-XXXX-X"
base_confidence = 0.80
description = "Internal project codes (region-number-phase)"
context_signals = ["internal", "project", "confidential"]
[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal", "proprietary"]
boost_amount = 0.15
proximity = 150
[scanners.confidence_reduce]
keywords = ["template", "example", "placeholder", "documentation"]
boost_amount = 0.30
proximity = 100
Pattern breakdown:
([A-Z]{2})- Two-letter region code (e.g., US, EU, AP)([0-9]{4})- Four-digit project number([A-Z]{1})- Single-letter phase indicator (A-Z)
Redaction mapping:
| Input | Output |
|---|---|
IPROJ-US-1234-A | IPROJ-XX-XXXX-X |
IPROJ-EU-9999-C | IPROJ-XX-XXXX-X |
Legal: Document ID with Full Feature Set
Comprehensive scanner combining all advanced features:
[[scanners]]
name = "legal_document"
regex = "DOC-([A-Z]{3})-([0-9]{6})"
redaction_pattern = "DOC-XXX-XXXXXX"
base_confidence = 0.55
description = "Legal document identifiers"
context_signals = ["legal", "confidential", "privileged"]
[scanners.confidence_boost]
keywords = ["attorney", "legal", "privileged", "confidential", "counsel", "litigation"]
boost_amount = 0.35
proximity = 300
[scanners.confidence_reduce]
keywords = ["example", "sample", "test", "template", "draft"]
boost_amount = 0.45
proximity = 150
[scanners.validation]
min_confidence = 0.50
invalid_patterns = [
"^DOC-AAA-000000$",
"^DOC-XXX-[0-9]{6}$",
"^DOC-[A-Z]{3}-123456$"
]
Confidence scenarios:
| Context | Base | Boost | Reduce | Final |
|---|---|---|---|---|
| No keywords | 0.55 | — | — | 0.55 |
| “attorney-client” nearby | 0.55 | +0.35 | — | 0.90 |
| “example document” nearby | 0.55 | — | -0.45 | 0.10 (rejected) |
| “confidential draft” | 0.55 | +0.35 | -0.45 | 0.45 (rejected) |
The low base confidence (0.55) combined with aggressive reduce (-0.45) ensures that example/template documents are filtered even when “confidential” appears nearby.
Testing Custom Scanners
Validate your scanner configuration before deployment.
Validate Configuration
Check for syntax errors and unsafe patterns:
sudo aquilon-dlp --config /etc/aquilon/config.toml --validate-config
Successful validation:
Configuration valid
Loaded 3 custom scanners:
- patient_id (bounded regex, confidence 0.60)
- financial_account (bounded regex, confidence 0.75, validator: luhn)
- internal_project (bounded regex, confidence 0.80)
Failed validation (unsafe pattern):
Configuration error: Scanner 'bad_scanner' has unsafe regex pattern
Pattern: "ID-\d+"
Error: Unbounded repetition detected
Suggestion: Use bounded quantifiers like {1,20} instead of +
Scan Test Files
Test your scanner against sample data:
# Create test file
cat > /tmp/scanner_test.txt << 'EOF'
Patient PAT-12345678 visited on 2024-01-15.
Financial account FA-123456789012 balance.
Project IPROJ-US-1234-A is confidential.
Legal document DOC-ABC-123456 under attorney review.
EOF
# Run scan
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/scanner_test.txt
Expected output:
Scanning: /tmp/scanner_test.txt
Results:
[patient_id] PAT-XXXXXXXX (confidence: 0.60, line 1)
Context signals: healthcare, phi, hipaa
[financial_account] FA-XXXXXXXXXXXX (confidence: 0.75, line 2)
Context signals: financial, pci, account
Validation: luhn passed
[internal_project] IPROJ-XX-XXXX-X (confidence: 0.80, line 3)
Context signals: internal, project, confidential
[legal_document] DOC-XXX-XXXXXX (confidence: 0.90, line 4)
Context signals: legal, confidential, privileged
Confidence boosted by: "attorney"
Summary: 4 findings in 1 file
Testing Confidence Adjustments
Verify boost and reduce behavior:
# Test with boost keywords
cat > /tmp/boost_test.txt << 'EOF'
Patient medical record: PAT-12345678
EOF
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/boost_test.txt
# Expected: confidence 0.90 (0.60 base + 0.30 boost from "patient", "medical")
# Test with reduce keywords
cat > /tmp/reduce_test.txt << 'EOF'
Example patient ID: PAT-12345678 (test data)
EOF
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/reduce_test.txt
# Expected: confidence 0.20 (0.60 base - 0.40 reduce from "example", "test")
Testing Validation Rules
Verify checksum validation:
# Valid Luhn number (passes checksum)
echo "FA-374245455400126" > /tmp/valid_luhn.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/valid_luhn.txt
# Expected: Match found
# Invalid Luhn number (fails checksum)
echo "FA-123456789012" > /tmp/invalid_luhn.txt
sudo aquilon-dlp --config /etc/aquilon/config.toml --scan /tmp/invalid_luhn.txt
# Expected: No match (fails Luhn validation)
Using Policy Integration
Test custom scanners through policies:
# Policy referencing custom scanner
cat > /tmp/policy_test.toml << 'EOF'
watch_paths = ["/tmp"]
exclude_paths = []
[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.60
[policies]
enabled_policies = ["test_policy"]
[policies.policy_configs.test_policy]
enabled = true
scanners = ["custom:patient_id"]
min_confidence = 0.5
[work_queue]
max_queue_size = 10000
submit_timeout_secs = 5
[worker]
num_workers = 0
[resource_limits]
enabled = false
[metrics]
bind_address = "127.0.0.1"
port = 9000
[cache]
enabled = true
ttl_secs = 0
[scan]
max_scan_size_mb = 40
max_recursion_depth = 5
EOF
sudo aquilon-dlp --config /tmp/policy_test.toml --scan /tmp/scanner_test.txt
Note the custom: prefix when referencing custom scanners in policies.
Troubleshooting
Common issues and solutions when working with custom scanners.
Configuration Errors
| Error Message | Cause | Solution |
|---|---|---|
Unsafe regex pattern: unbounded repetition | Pattern uses +, *, or unbounded {n,} | Use bounded quantifiers: {1,20} instead of +, {0,100} instead of * |
Invalid regex syntax | Malformed regular expression | Check TOML escaping: use \\d or 'single quotes' or [0-9] |
Mismatched capture groups | Regex capture count doesn’t match X sequences | Align capture groups with redaction X runs |
Scanner name already exists | Duplicate name field | Each scanner needs a unique name |
Invalid base_confidence | Value outside 0.0-1.0 range | Use values between 0.0 and 1.0 |
Pattern Not Matching
Symptom: Scanner configured but no matches found.
Diagnostic steps:
-
Test regex separately:
echo "EMP-123456" | grep -E "EMP-([0-9]{6})" -
Check TOML escaping:
# These are all equivalent: regex = "\\d{6}" # Double backslash in double quotes regex = '\d{6}' # Single quotes (literal) regex = "[0-9]{6}" # Character class (recommended) -
Verify file is being scanned:
- Check
watch_pathsincludes the file location - Check
exclude_pathsdoesn’t exclude it - Verify file size is under
max_scan_size_mb
- Check
-
Check confidence threshold:
- If using policies, verify
min_confidenceisn’t filtering matches - Check if
confidence_reducekeywords are nearby
- If using policies, verify
False Positives
Symptom: Scanner matches too many non-relevant patterns.
Solutions:
-
Add validation rules:
[scanners.validation] invalid_patterns = ["^ACCT-0{10}$", "^ACCT-12345"] min_confidence = 0.70 -
Use confidence reduce:
[scanners.confidence_reduce] keywords = ["example", "test", "sample", "demo"] boost_amount = 0.40 proximity = 100 -
Add checksum validation:
[scanners.validation] validator = "luhn" # or "mod10", "mod11", "iban"
Policy Integration Issues
| Error Message | Cause | Solution |
|---|---|---|
Unknown scanner 'employee_id' | Missing custom: prefix | Use custom:employee_id in policy scanners list |
Scanner 'custom:foo' not found | Scanner not defined | Add [[scanners]] entry with name = "foo" |
Policy references disabled scanner | Scanner defined but not enabled | Check scanner configuration is complete |
Performance Issues
Symptom: Scanning is slow after adding custom scanners.
Solutions:
-
Check pattern complexity:
- Avoid nested alternations:
(a|b|c)is fine,((a|b)|(c|d))is slow - Avoid overlapping patterns:
[A-Za-z]+[a-z]creates backtracking
- Avoid nested alternations:
-
Reduce proximity search:
[scanners.confidence_boost] proximity = 100 # Smaller = faster (default is 200) -
Simplify validation:
invalid_patternswith simple patterns are fast- Complex regex in
invalid_patternscan slow scanning
Redaction Issues
Symptom: Redacted output looks wrong.
| Issue | Cause | Solution |
|---|---|---|
| Partial redaction | Capture group mismatch | Ensure X count matches capture group length |
XXX for variable data | Variable-length capture | Use fixed placeholder or document behavior |
| No prefix in output | Prefix not in pattern | Add prefix outside capture group: PREFIX-([0-9]{6}) |
Example fix:
# Wrong - captures everything including prefix
regex = "(EMP-[0-9]{6})"
redaction_pattern = "XXXXXXXXXX" # Loses prefix
# Correct - captures only sensitive part
regex = "EMP-([0-9]{6})"
redaction_pattern = "EMP-XXXXXX" # Preserves prefix
Best Practices
Guidelines for building effective, maintainable custom scanners.
Pattern Design
-
Always use bounded quantifiers
{6}for fixed length{1,20}for variable length with maximum- Never use
+,*, or{n,}(unbounded)
-
Use character classes over escape sequences
[0-9]instead of\d(avoids TOML escaping issues)[A-Za-z]instead of\w[^a-z]for negation
-
Capture only sensitive data
# Good: prefix preserved, only digits captured regex = "EMP-([0-9]{6})" # Bad: entire match captured regex = "(EMP-[0-9]{6})" -
Test patterns before deployment
echo "EMP-123456" | grep -E "EMP-([0-9]{6})"
Confidence Strategy
-
Start with low base confidence (0.50-0.70)
- Prevents over-alerting before context analysis
- Allows boost/reduce to have meaningful effect
-
Use boost for high-value context
- Domain-specific keywords that indicate real data
- Proximity 150-300 bytes for document context
-
Use reduce aggressively for noise
- Test, example, sample, demo, placeholder
- Proximity 50-150 bytes for nearby indicators
-
Document your confidence rationale
description = "Patient IDs: low base (0.60) + medical boost (0.30) = 0.90 in healthcare docs"
Validation Rules
-
Always add invalid_patterns for test data
- Common sequences: all zeros, all ones, sequential (123456)
- Known test values from documentation
-
Use checksums when available
- Financial accounts often have Luhn/mod10 digits
- Reduces false positives by 90%+
-
Set appropriate min_confidence
- 0.50-0.60 for high-recall (find everything)
- 0.70-0.80 for balanced precision/recall
- 0.85+ for high-precision (minimize false positives)
Organization and Maintenance
-
Use descriptive names
name = "patient_mrn" # Good: specific name = "id" # Bad: too generic -
Always include description
description = "Medical Record Numbers: MRN-XXXXXXXX format, HIPAA-regulated" -
Use context_signals for SIEM integration
context_signals = ["healthcare", "phi", "hipaa"]These tags appear in alerts and enable filtering/routing in your SIEM.
-
Group related scanners
# Healthcare scanners [[scanners]] name = "patient_mrn" # ... [[scanners]] name = "patient_ssn" # ... # Financial scanners [[scanners]] name = "account_number" # ...
Performance Optimization
-
Order patterns by specificity
- Most specific patterns first (fewer false matches)
- Generic patterns last
-
Minimize proximity for boost/reduce
- Start with 100-150 bytes
- Increase only if needed for context
-
Avoid complex alternations
# Slow: nested alternations regex = "((EMP|STAFF)-(ID|NUM))-([0-9]{6})" # Fast: separate scanners [[scanners]] name = "emp_id" regex = "EMP-ID-([0-9]{6})" [[scanners]] name = "staff_num" regex = "STAFF-NUM-([0-9]{6})"
Security Considerations
-
Never log sensitive data in tests
- Use obviously fake test data
- Don’t use real examples in documentation
-
Review patterns for over-matching
- Simple patterns like
[0-9]{9}match too broadly - Always include prefix/format markers
- Simple patterns like
-
Test with production-like data volume
- Performance issues emerge at scale
- Run against large sample files before deployment