Skip to content

AEGISX cloud pipeline

← projects

ics-security · cloud

Can OT security telemetry reach the cloud without the pipeline becoming the weak link?

Alert flood at incident peak

100+ emails / 5 min

cloud/docs/adr-001-alert-aggregation.md, incident summary

Stuck records drained to DLQ

379

cloud/docs/adr-002-stream-resilience.md, verification

Post-hardening e2e errors

0

adr-001/002 verification — 4 records, 2 aggregated alerts, ~600 ms

Python · AWS Kinesis · Lambda · S3 · SNS · CloudFormation · 2026

Problem

AEGISX is a university capstone: a smart-grid industrial-control-system security platform built by a team of seven, spanning OT encryption, a Modbus firewall, an LSTM anomaly detector, a data diode, and an operator dashboard. Everything scored stayed on the factory floor. My slice was the cloud: take the security events the on-prem stack produces — every Modbus frame relayed through an AES-256-GCM proxy pair, scored live by the LSTM before it leaves the premises — and stream them into AWS storage and alerting that an operator can trust. The risk is obvious: a monitoring pipeline that floods, drops, or stalls is itself a security failure.

Approach

I owned the cloud subtree end to end and treated the boundary as a contract: a canonical 14-field event schema, validated in the processor, with invalid records quarantined rather than dropped. Architecture decisions are recorded as ADRs — authored, as the documents themselves state, by me with an AI pair — and the infrastructure is captured as a validated CloudFormation template after live console drift proved that memory is not documentation. Verification came before claims: unit tests on the aggregation logic, a live end-to-end injection, and a test runbook.

Architecture

on-prem (team stack) Modbus traffic AES-256-GCM proxy firewall · RTT tracking LSTM scoring on-prem, pre-cloud producer 4 workers · backoff aws ap-southeast-2 (my slice) Kinesis single shard Lambda validate · classify S3 data lake + quarantine SNS alerts aggregated / batch SQS DLQ 14-day retention on failure Timestream planned, not built SageMaker planned, not built CloudFront planned, not built

Scored events leave the producer with exponential-backoff retries, land in a Kinesis stream, and trigger a Python 3.12 Lambda in batches of up to 100: validate against the schema, classify each record as normal, anomaly, or tamper, write everything to a Hive-partitioned S3 data lake (invalid records to a quarantine prefix), and publish one aggregated SNS alert per batch — tamper unconditionally, anomalies only at confidence 75 or above. Failures retry three times with batch bisection, age out after six hours, and land in a dead-letter queue. The diagram is honest about scope: the analytics layer — Timestream, SageMaker retraining, CloudFront — stayed on the drawing board.

Results

The pipeline’s defining result came from breaking it. A console-drifted three-second Lambda timeout (the template said thirty), per-event SNS publishing, and Kinesis’s default infinite retries combined into a self-sustaining loop: roughly nineteen errored invocations a minute and more than a hundred alert emails every five minutes, ended by an emergency disable of the event source mapping. The fix is recorded in two ADRs: batch-aggregated alerts behind a confidence gate, bounded retries with bisection, a six-hour record age cap, and the dead-letter queue. Verification followed: ten unit tests on the aggregation logic; a live four-event injection producing four S3 objects, two aggregated alerts, and zero errors in about 600 ms; and 379 stuck records drained cleanly into the DLQ. The README’s own status line is kept honest — demo-validated, but not yet production-hardened.

Reflection

The incident taught more than the build. Cloud defaults are a threat model: infinite retries turned one slow function into an alert flood. Console drift is real — the live timeout silently disagreed with the template until it mattered, which is why the CloudFormation template now documents intent even where resources were first provisioned by hand. Honest status language (“demo-validated”) survived into the docs because overclaiming in a security project is its own kind of vulnerability. Left open at handover: migrating off a personal AWS account, completing the live OT-bridge integration, and the planned analytics layer. The ADRs credit their authorship plainly — me, with an AI pair — the same working arrangement that built this site.