Custom Scanner Integration

This guide covers integrating custom scanners with policies, SIEM systems, and fleet deployments. For creating custom scanners, see Custom Scanners.

Combining Built-in and Custom Scanners

Custom scanners work alongside built-in scanners in policies. The key difference is the naming convention:

Built-in scanners: Use the scanner name directly (e.g., ssn, email, iban)
Custom scanners: Use the custom: prefix (e.g., custom:employee_id)

[policies]
enabled_policies = ["data_protection"]

[policies.policy_configs.data_protection]
enabled = true
settings = { confidence_threshold = "0.7" }

For advanced policy composition (AND/OR rules, thresholds), see Policy Frameworks.

A common scenario is extending GDPR compliance with organization-specific identifiers. This example combines built-in EU data scanners with custom project codes:

# Custom scanner for internal project codes
[[scanners]]
name = "project_code"
regex = "PROJ-([A-Z]{2})-([0-9]{4})"
redaction_pattern = "PROJ-XX-XXXX"
base_confidence = 0.80
context_signals = ["internal", "confidential"]

[scanners.confidence_boost]
keywords = ["confidential", "restricted", "internal"]
boost_amount = 0.15
proximity = 150

[scanners.confidence_reduce]
keywords = ["example", "template", "documentation"]
boost_amount = 0.30
proximity = 100

[policies]
enabled_policies = ["gdpr_extended"]

[policies.policy_configs.gdpr_extended]
enabled = true
settings = { confidence_threshold = "0.7" }

For complete GDPR scanner mappings and compliance guidance, see GDPR Compliance.

Reducing False Positives from Test Files

Development environments often contain test fixtures with fake sensitive data. Use these strategies to reduce false positives.

Path-Based Exclusions

Exclude entire directories from scanning using global exclude_paths:

watch_paths = ["/home/%%", "/var/data/%%"]

exclude_paths = [
    # Test directories
    "/home/*/projects/*/tests/%%",
    "/home/*/projects/*/test/%%",
    "/home/*/projects/*/__tests__/%%",

    # Test fixtures and mock data
    "/home/*/projects/*/fixtures/%%",
    "/home/*/projects/*/mock-data/%%",
    "/home/*/projects/*/testdata/%%",

    # Build artifacts
    "/home/*/projects/*/node_modules/%%",
    "/home/*/projects/*/target/%%",
    "/home/*/projects/*/.git/%%"
]

Keyword-Based Confidence Reduction

For files that can’t be excluded by path, use confidence_reduce to lower confidence when test-related keywords appear nearby:

[[scanners]]
name = "customer_id"
regex = "CUST-([0-9]{8})"
redaction_pattern = "CUST-XXXXXXXX"
base_confidence = 0.80

[scanners.confidence_reduce]
keywords = [
    # Test indicators
    "test", "spec", "mock", "fake", "dummy",
    # Documentation indicators
    "example", "sample", "demo", "placeholder",
    # Development indicators
    "fixture", "seed", "factory"
]
boost_amount = 0.50
proximity = 100

With base_confidence = 0.80 and boost_amount = 0.50, matches near test keywords drop to 0.30 confidence, which typically falls below policy thresholds.

For detailed confidence tuning patterns, see Custom Scanners - Confidence Tuning. For global configuration options, see Configuration.

SIEM Integration

Custom scanner findings flow to your SIEM through the OSQuery aquilon_dlp_alerts table.

How context_signals Flow to Alerts

The context_signals you define on custom scanners appear in alert metadata:

[[scanners]]
name = "patient_id"
regex = "PAT-([0-9]{8})"
redaction_pattern = "PAT-XXXXXXXX"
base_confidence = 0.85
context_signals = ["healthcare", "phi", "hipaa"]  # These flow to alerts

Key Alert Fields for Custom Scanners

Query custom scanner alerts via OSQuery:

SELECT
    timestamp,
    path,
    scanner,
    confidence,
    policy,
    severity,
    context
FROM aquilon_dlp_alerts
WHERE scanner LIKE 'custom:%'
ORDER BY timestamp DESC
LIMIT 100;

The context JSON field contains context_signals for SIEM filtering and routing.

Splunk Integration Example

Schedule OSQuery to export alerts, then query in Splunk:

index=osquery sourcetype=osquery:results name=aquilon_dlp_alerts
| spath input=columns.context
| search context_signals="*healthcare*"
| stats count by scanner, severity, policy

For complete SIEM integration including Elastic Stack, see Monitoring - SIEM Integration. For the full alert schema, see API Integration.

Fleet Deployment

Deploy custom scanner configurations across your fleet using MDM or configuration management tools.

Centralized Configuration

Create a base configuration with your custom scanners and policies
Deploy via MDM (Jamf, Intune, Kandji) to managed devices
Verify deployment using OSQuery fleet queries

Example verification query to confirm custom scanners are active:

SELECT
    name,
    version,
    status
FROM aquilon_dlp_status
WHERE status = 'running';

Deployment Resources

MDM deployment guide: MDM Deployment - PPPC profiles, staged rollout
Enterprise patterns: Enterprise Deployment - pilot groups, success metrics
macOS requirements: macOS Installation - Full Disk Access, entitlements

Performance Considerations

Custom scanners add minimal overhead, but keep these guidelines in mind for large fleets:

Scanner Count

10-20 custom scanners: Negligible performance impact
20-50 custom scanners: Monitor scan latency metrics
50+ custom scanners: Consider splitting into multiple policies by use case

Proximity Search Tuning

Large proximity values in confidence_boost/confidence_reduce increase memory usage per scan:

[scanners.confidence_boost]
keywords = ["confidential"]
boost_amount = 0.20
proximity = 100    # Recommended: 100-200 bytes
# proximity = 1000  # Avoid: increases memory per match

Monitoring Performance

Track scanner performance via Prometheus metrics:

aquilon_scan_duration_seconds - Per-file scan time
aquilon_scanner_matches_total - Matches by scanner name
aquilon_queue_depth - Work queue backlog

For metrics setup, see Monitoring.

Keyboard shortcuts

Aquilon DLP Documentation