AI & Compliance

Getting AI to work in compliance workflows at scale

AI & ComplianceFinancial Services5 min read

71%

Reduction in false-positive alerts

40%

Fewer analyst hours per month

99.1%

SAR decision accuracy maintained

A bank's compliance function was processing 180,000 transaction monitoring alerts per month, of which fewer than 2% resulted in a filed suspicious activity report. The remaining 98% consumed analyst time that could have been spent on genuinely suspicious activity. The compliance leadership had evaluated AI solutions from three vendors, all of whom had demonstrated impressive accuracy rates in controlled demos. When those same vendors were asked to show the models running on the bank's actual transaction data, none of them could — because the bank's data infrastructure wasn't in a state that any of the models could consume.

The outcome

We built the data infrastructure that the AI required before we touched the models. Twelve months later, the transaction monitoring alert volume was down 71% with no decrease in SAR filing accuracy. The compliance team handles the same caseload with 40% fewer analyst hours.

01

The AI readiness gap in financial compliance

Transaction monitoring AI requires two things that most banks don't have: clean, consistently-structured transaction data, and labelled historical data that tells the model which past alerts were genuine and which were false positives. The bank had transaction data — it had fifteen years of it — but the data was distributed across four different core banking systems that had been consolidated over time, each with a different schema and a different set of transaction type codes. A wire transfer in System A was coded differently than a wire transfer in System B. The consolidation projects had mapped the codes at the time, but the mapping tables hadn't been maintained as the systems evolved, and some categories of transaction were classified inconsistently across systems. The labelled historical data was worse: analyst decisions on historical alerts were stored in a case management system that had been replaced twice, and the decision records in the current system only went back four years. The decisions made on the previous eleven years of alerts were in a decommissioned system that was accessible only through a read-only interface with a response time measured in minutes per query.

02

Building a unified transaction data layer

The first six months of the engagement were spent building the data infrastructure, not the AI. We built a unified transaction data model that represented transaction types consistently regardless of which legacy system had originated the record. The model had 43 canonical transaction types, each with a defined mapping from the type codes used in each of the four source systems. We built a data pipeline that read from all four source systems — including the decommissioned read-only system — and transformed every transaction record into the canonical model. The pipeline ran on a 15-minute cadence for current transactions and ran historically over a six-month period to backfill the unified data store with fifteen years of transaction history. The backfill took eleven weeks because the decommissioned system's response time limited the throughput of historical data extraction. By the end of the backfill, the unified data store contained 2.3 billion transaction records in a consistent format for the first time in the bank's history.

03

Labelling historical alert decisions for model training

The labelled training data problem required a different approach. We couldn't recover the decision records from the decommissioned case management systems — the data was there but extracting eleven years of records at the system's query speed would have taken longer than the engagement. Instead, we built training labels from the data we did have. The current case management system had four years of decision records: for each alert, whether the analyst had escalated to a filed SAR or closed the alert as a false positive. We combined these decision records with the behavioural features of the transactions that generated each alert — the transaction amount, the counterparty, the transaction type, the customer's historical pattern — to build a feature set that represented each alert at the time it was generated. The model was trained to predict, given the features of an alert, whether it was likely to result in a SAR filing. We validated the training data by having the compliance team's senior analysts review a sample of the training labels — the cases where the model's prediction was most uncertain — to confirm that the historical decisions reflected the bank's current SAR filing standards. Fourteen percent of the reviewed labels required correction, which was higher than expected and led to a manual review of the highest-uncertainty training cases before model training began.

04

Production deployment and the governance framework

We deployed the model in a shadow mode for eight weeks before it began influencing alert triage. In shadow mode, the model scored every alert but its score was not shown to analysts — analysts worked the alerts in their standard way, and at the end of each day we compared the model's scores to the analyst decisions. The shadow mode data showed that the model agreed with analyst decisions on 91.3% of alerts, and that in the cases where they disagreed, the analyst's decision was more often correct than the model's on high-value alerts and the model was more often correct than the analyst on low-value, high-volume alerts. This finding shaped the governance framework for live deployment: the model's scores influence — but do not determine — the triage priority of each alert. Alerts scored as high confidence genuine by the model are elevated in the queue; alerts scored as high confidence false positive are deprioritized but not suppressed. Analysts review all alerts; the model's role is to sequence the queue so that analyst attention goes first to the alerts most likely to require action. The compliance team's internal audit function reviews a sample of deprioritized alerts each month to confirm that genuine suspicious activity isn't being systematically deprioritized — this review has found no cases in the 18 months since live deployment began.

05

Model performance, auditability, and what came next

In the eighteen months since live deployment, the transaction monitoring alert volume has declined from 180,000 per month to 52,000 per month — a 71% reduction — while the number of SARs filed per month has remained stable. The reduction in alert volume reflects the model's improved ability to filter genuine activity from suspicious activity at the point of alert generation, not just at the triage stage. The compliance function's regulators reviewed the AI program during a scheduled examination and found the governance framework — the shadow mode validation, the audit review process, the model explainability documentation — to be consistent with their guidance on model risk management. The bank has since extended the AI program to two additional compliance workflows: customer due diligence refresh and correspondent banking transaction monitoring, both of which are in shadow mode deployment as of the most recent programme review.

Facing a similar infrastructure challenge?

We're happy to have a technical conversation about your specific environment — no commitment required.