Fraud detection is not a single model.
It is a decision system that has to operate under latency pressure, incomplete information, adversarial behavior, and messy business trade-offs.
If the system is too loose, fraud losses rise.
If the system is too aggressive, good customers get blocked, revenue drops, support queues explode, and trust erodes.
That tension is what makes fraud detection a system design problem rather than just a data science problem.
This guide designs a production fraud detection platform.
Problem Statement
Build a platform that evaluates risky actions in real time and helps risk teams investigate suspicious activity.
Examples:
- card payment authorization
- account signup
- password reset abuse
- coupon abuse
- refund fraud
- payout fraud
- account takeover
- bot-driven checkout attempts
The platform should:
- score actions in milliseconds
- combine deterministic rules with model output
- support manual review when needed
- learn from confirmed outcomes
- keep an audit trail for decisions
Requirements
Functional requirements:
- ingest online events in real time
- score transactions synchronously for critical flows
- support rules and ML model decisions together
- fetch historical and behavioral features
- support allow / review / deny outcomes
- support merchant or tenant-specific policies
- expose case management for analysts
- ingest feedback labels such as chargebacks and confirmed fraud
- support replay and backtesting
Non-functional requirements:
- p99 decision latency low enough for checkout or login
- high availability
- explainable decisions
- protection against duplicate event processing
- graceful degradation if some features are unavailable
- strict auditability
- low false-positive rate
- scalable feature retrieval and aggregation
The core challenge is not just "detect fraud."
It is making fast, explainable, business-safe decisions with incomplete and adversarial data.
Decision Outcomes
The platform usually returns one of three results:
ALLOW
REVIEW
DENY
Examples:
- low-risk trusted customer payment ->
ALLOW - first-time high-value transaction from unusual device ->
REVIEW - stolen card pattern or impossible geo velocity ->
DENY
Do not reduce the system to a binary allow/deny model unless your business can tolerate blunt decisions. Review is often what keeps false positives from wrecking revenue.
High-Level Architecture
Client Action
|
v
Risk API
|
+--> Feature Fetch
| +--> online feature store
| +--> hot aggregates / counters
| +--> historical profile lookup
|
+--> Rules Engine
|
+--> ML Scoring Service
|
+--> Decision Combiner
|
v
ALLOW / REVIEW / DENY
|
+--> Event Log
+--> Analyst Queue
+--> Training / Feedback Pipeline
The request path should remain small and deterministic:
- collect request context
- fetch essential features
- evaluate rules
- score model
- combine into final decision
- log everything
Example Flow: Payment Fraud Check
- checkout service calls risk API with payment attempt
- risk API computes derived fields such as amount bucket and local hour
- online features are loaded:
- card attempts in last 10 minutes
- device velocity
- user account age
- historical chargeback rate
- IP reputation
- rules engine checks hard constraints
- model returns fraud score, say
0.87 - decision combiner applies policy:
- score > 0.95 -> deny
- score between 0.75 and 0.95 -> review
- high-risk rule with override -> deny
- result returned to checkout in under 150ms
Data Inputs
Fraud systems usually combine several data classes.
1. Request context
- user id
- phone
- card fingerprint
- device fingerprint
- IP
- amount
- currency
- merchant
- SKU / category
- geolocation
2. Historical user features
- account age
- successful transaction count
- recent failed attempts
- prior refunds
- known device count
3. Shared risk features
- IP reputation
- BIN / issuer country
- card country mismatch
- email domain age
- device velocity across many accounts
4. Feedback labels
- chargeback received
- analyst confirmed fraud
- customer reported unauthorized activity
- trusted order / safe event
Without labels, the system cannot improve.
Online vs Offline Features
Some features are naturally real time:
- attempts from this IP in last 5 minutes
- number of cards seen on this device in last hour
- transaction count for this user today
Some are batch-driven:
- customer lifetime value
- chargeback rate over 90 days
- merchant-level dispute trend
- device risk profile from previous weeks
The fraud platform needs both.
Common architecture:
Streaming events -> online counters / feature store
Historical warehouse -> offline features / training datasets
That is why feature freshness and point-in-time correctness matter.
API Design
POST /v1/risk/evaluate
Idempotency-Key: pay-attempt-ord_123
{
"tenantId": "merchant_42",
"eventType": "payment_attempt",
"eventId": "evt_991",
"userId": "user_123",
"amount": 12999,
"currency": "INR",
"paymentMethod": {
"type": "card",
"cardFingerprint": "cf_77",
"bin": "411111",
"issuerCountry": "US"
},
"device": {
"deviceId": "dev_88",
"ip": "103.44.11.19",
"userAgent": "Mozilla/5.0"
},
"metadata": {
"checkoutId": "chk_55",
"cartValue": 12999,
"shippingCountry": "IN",
"billingCountry": "US"
}
}
Response:
{
"decision": "REVIEW",
"riskScore": 0.87,
"reasonCodes": [
"HIGH_DEVICE_VELOCITY",
"CARD_COUNTRY_MISMATCH",
"NEW_ACCOUNT_HIGH_VALUE"
],
"reviewQueue": "payments_high_risk"
}
Reason codes are not decoration. They are necessary for analysts, support, backtesting, and trust.
Event and Decision Storage
Raw request log
CREATE TABLE fraud_events (
event_id TEXT PRIMARY KEY,
tenant_id TEXT NOT NULL,
event_type TEXT NOT NULL,
user_id TEXT,
device_id TEXT,
ip INET,
amount_minor BIGINT,
currency TEXT,
payload JSONB NOT NULL,
received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
Decision log
CREATE TABLE fraud_decisions (
decision_id UUID PRIMARY KEY,
event_id TEXT NOT NULL REFERENCES fraud_events(event_id),
tenant_id TEXT NOT NULL,
decision TEXT NOT NULL,
risk_score NUMERIC(5,4),
policy_version TEXT NOT NULL,
model_version TEXT,
reason_codes JSONB NOT NULL,
latency_ms INT NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
CREATE INDEX idx_fraud_decisions_tenant_created
ON fraud_decisions (tenant_id, created_at DESC);
Feedback table
CREATE TABLE fraud_feedback (
feedback_id UUID PRIMARY KEY,
event_id TEXT NOT NULL,
tenant_id TEXT NOT NULL,
label TEXT NOT NULL, -- fraud, legitimate, chargeback, safe
source TEXT NOT NULL, -- analyst, chargeback_feed, customer_report
confidence NUMERIC(5,4),
metadata JSONB NOT NULL DEFAULT '{}'::jsonb,
created_at TIMESTAMPTZ NOT NULL DEFAULT now()
);
These three tables give you:
- traceability
- replay inputs
- model training labels
- decision auditing
Rules Engine
Fraud platforms should not be rules-only, but rules remain essential.
Why rules matter:
- immediate response to new attack patterns
- strong hard-stop protections
- deterministic explanations
- business overrides for specific merchants or flows
Examples:
- deny if IP is on explicit blocklist
- deny if device seen on >20 accounts in 10 minutes
- review if amount > threshold and account age < 1 day
- allow trusted VIP user below risk limit
Example rule model:
{
"ruleId": "new_account_high_value_review",
"eventType": "payment_attempt",
"priority": 200,
"condition": {
"all": [
{ "field": "account_age_hours", "op": "<", "value": 24 },
{ "field": "amount_minor", "op": ">", "value": 100000 }
]
},
"action": "REVIEW",
"reasonCode": "NEW_ACCOUNT_HIGH_VALUE"
}
Simple rule languages are easier to operate than overly clever ones.
Model Scoring
Rules are good for known patterns.
Models help with:
- weighted combination of many weak signals
- detecting non-obvious interactions
- adapting to evolving patterns
A typical scoring service:
input features -> feature vector -> model -> fraud score between 0 and 1
Example Python-style scoring pseudocode:
def score(features: dict) -> float:
vector = [
features["ip_txn_count_10m"],
features["device_accounts_24h"],
features["user_account_age_hours"],
features["amount_zscore"],
features["country_mismatch_flag"],
features["successful_payments_30d"],
features["chargeback_rate_90d"],
]
return model.predict_proba([vector])[0][1]
The model is only as good as:
- feature quality
- label quality
- point-in-time correctness
- safe thresholds
Feature Retrieval
The biggest operational risk in fraud systems is not often the model itself.
It is the feature path.
If feature retrieval is slow or inconsistent, the decision path breaks.
Typical feature sources:
- Redis for hot counters
- online feature store for materialized features
- relational DB for account profile
- external enrichment for IP reputation or BIN data
Good rule:
- external calls should be optional or precomputed
- the synchronous path should avoid long network chains
Example aggregation keys:
ip:103.44.11.19:txn_count_10m
device:dev_88:distinct_accounts_24h
user:user_123:failed_payments_1d
card:cf_77:merchant_attempts_1h
Real-Time Counters
Fraud systems rely heavily on short-window velocity checks.
Examples:
- 8 cards used on same device in 5 minutes
- 30 failed OTP attempts from same IP in 10 minutes
- 5 payout attempts from new bank accounts in 1 hour
These are usually tracked with Redis or streaming state stores.
Redis example:
-- increment counter with TTL
local current = redis.call('INCR', KEYS[1])
if current == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return current
Usage:
KEYS[1] = "ip:103.44.11.19:txn_count_600s"
ARGV[1] = 600
This gives you cheap velocity features for the scoring path.
Decision Combiner
Do not let the model directly decide everything.
You usually want a combiner like:
- hard rules first
- model score second
- business exceptions last
Example:
public RiskDecision combine(RuleOutcome rules, ModelOutcome model, Policy policy) {
if (rules.hardDeny()) {
return RiskDecision.deny(rules.reasonCodes());
}
if (rules.forceAllow()) {
return RiskDecision.allow(rules.reasonCodes());
}
double score = model.score();
if (score >= policy.denyThreshold()) {
return RiskDecision.deny(model.reasonCodes());
}
if (score >= policy.reviewThreshold()) {
return RiskDecision.review(model.reasonCodes());
}
return RiskDecision.allow(model.reasonCodes());
}
This keeps business control explicit instead of buried in opaque model behavior.
Latency Budget
For payments and login, the synchronous path must be tight.
Example budget:
Request parsing 5 ms
Feature fetch 30 ms
Rules evaluation 5 ms
Model scoring 20 ms
Decision logging async 5 ms
Safety margin 15 ms
-------------------------------
Total 80 ms
If your risk API depends on six downstream services, this budget will not survive real traffic.
Degradation Strategy
What happens if some parts of the risk stack are unavailable?
Possible failures:
- Redis unavailable
- model service slow
- external enrichment down
- feature store lagging
Your policy should define fallback behavior by event type.
Examples:
- low-value signup: fail open or soft review
- card payout: fail closed or review
- password reset: maybe extra challenge instead of full denial
A fraud platform must be explicit about fail-open vs fail-closed by flow.
Case Management
Review queues are part of the platform, not an afterthought.
Analysts need:
- event details
- feature snapshot at decision time
- decision reason codes
- linked user/device/IP history
- action tools: mark fraud, mark safe, escalate
Example case table:
CREATE TABLE fraud_cases (
case_id UUID PRIMARY KEY,
tenant_id TEXT NOT NULL,
event_id TEXT NOT NULL,
status TEXT NOT NULL, -- open, investigating, resolved
queue TEXT NOT NULL,
assignee TEXT,
decision_snapshot JSONB NOT NULL,
created_at TIMESTAMPTZ NOT NULL DEFAULT now(),
updated_at TIMESTAMPTZ NOT NULL DEFAULT now(),
resolved_at TIMESTAMPTZ
);
Do not make analysts query five systems to understand one payment attempt.
Feedback Loop
Fraud systems improve only if decisions are joined with outcomes later.
Examples of outcome sources:
- chargeback feed
- merchant dispute
- analyst label
- customer complaint
- manual allowlist decision
This data should feed:
- rule tuning
- threshold tuning
- model retraining
- precision / recall tracking
Without outcome joins, you are operating blind.
Offline Training and Backtesting
Any serious fraud platform needs replay.
Questions risk teams will ask:
- what would have happened if deny threshold were 0.92 instead of 0.95?
- how many good users would have been reviewed?
- what chargeback loss would have been prevented?
- what if this new rule had been active last month?
That requires:
- stored input events
- stored feature snapshots or reproducible feature generation
- stored model and policy versions
Backtesting is how you improve without gambling on production.
Multi-Tenant Design
If the platform serves many merchants or products:
- isolate data by tenant
- allow tenant-specific thresholds and rule overrides
- support tenant-specific feature configurations
- prevent one noisy tenant from overwhelming shared counters or review queues
Different businesses have different fraud tolerances.
A digital wallet, an airline, and a food-delivery platform should not share the exact same decision thresholds.
Common Failure Modes
1. Feature leakage
Training data accidentally uses future information.
Result:
- offline metrics look amazing
- production performance is much worse
Fix:
- point-in-time correct feature generation
2. Duplicate event scoring
Same event is processed multiple times.
Result:
- duplicate review cases
- inflated counters
Fix:
- idempotency key and dedupe on event id
3. Model drift
Attack patterns evolve and model performance decays.
Fix:
- track approval rate, fraud rate, review rate, precision, recall proxies
4. Over-aggressive rules
A new rule blocks too many good users.
Fix:
- rule shadow mode
- staged rollout
- tenant-level blast radius control
5. Slow external enrichments
IP or device intelligence provider becomes slow.
Fix:
- precompute where possible
- timeout aggressively
- continue with degraded policy
Observability
Metrics to track:
- decision latency p50 / p95 / p99
- allow / review / deny rate
- per-tenant approval rate
- model score distribution
- feature fetch timeout rate
- rule match counts
- review queue backlog
- confirmed fraud by segment
- chargeback rate over time
Important dashboards:
- decision distribution by tenant
- false-positive proxies after new rule rollout
- model score drift
- hot IP / device patterns
Fraud systems need both engineering monitoring and business monitoring.
Example End-to-End Service
@Service
public class FraudDecisionService {
public RiskDecision evaluate(FraudRequest request) {
FraudEvent event = rawEventStore.save(request);
FeatureBundle features = featureService.load(event);
RuleOutcome ruleOutcome = rulesEngine.evaluate(event, features);
ModelOutcome modelOutcome = modelService.score(features);
RiskDecision decision = decisionCombiner.combine(
ruleOutcome,
modelOutcome,
policyService.policyFor(event.tenantId(), event.eventType())
);
decisionLogStore.save(event, features, ruleOutcome, modelOutcome, decision);
if (decision.requiresReview()) {
caseService.openCase(event, features, decision);
}
return decision;
}
}
That is the conceptual core: event, features, rules, model, decision, logging, and case creation.
What I Would Build First
Phase 1:
- real-time risk API
- deterministic rules engine
- hot counters in Redis
- decision log and analyst queue
Phase 2:
- model scoring service
- online feature store
- feedback ingestion
- shadow evaluation for new policies
Phase 3:
- automated retraining pipeline
- sophisticated device graph features
- merchant-specific policy customization
- replay and backtesting UI
This order matters. Teams often rush into fancy ML before they have high-quality event logs, counters, and analyst feedback.
Production Checklist
- idempotent event ingestion
- low-latency feature path
- hard rules supported
- model version tracked
- decision reason codes stored
- analyst case tooling available
- feedback labels ingested
- shadow mode supported for rule/model rollout
- fail-open / fail-closed policy defined per event type
- replay and backtesting path exists
Final Takeaway
A fraud detection platform is a real-time decision and learning system.
Its job is not to maximize model score accuracy in isolation.
Its job is to reduce fraud loss while protecting legitimate user experience, staying explainable, and remaining safe under constant change.
If you design it well, risk teams move fast without blind spots.
If you design it poorly, you get the worst of both worlds: fraud still gets through, and good users get blocked.
