Skip to content

Example: Customer API Platform

About This Example

This is a fictional but realistic Solution Architecture Document for Meridian Financial Services’ Customer API Platform. It demonstrates the Architecture Description Standard at Comprehensive depth — the highest level of documentation rigour. Every section is completed with realistic content to show what a mature, well-documented SAD looks like for a Tier 1 Critical, regulated financial services API platform.

Fictional company: Meridian Financial Services (MFS) — a mid-sized UK retail bank. Fictional solution: Customer API Platform — a cloud-native REST API providing account and transaction data to partner fintech applications under Open Banking regulations.


FieldValue
Document TitleSolution Architecture Document — Customer API Platform
Application / Solution NameCustomer API Platform (CAP)
Application IDAPP-0472
Author(s)Fred Bloggs (Lead Solution Architect)
OwnerFred Bloggs
Version2.1
StatusApproved
Created Date2024-09-15
Last Updated2025-11-20
ClassificationInternal — Restricted
VersionDateAuthor / EditorDescription of Change
0.12024-09-15Fred BloggsInitial draft with executive summary and logical view
0.22024-09-28Fred BloggsAdded physical view, data view, security view
0.32024-10-10Joe BloggsSecurity review feedback incorporated
0.52024-10-22Fred Bloggs, Jane DoeAdded quality attributes, governance, lifecycle
1.02024-11-05Fred BloggsFirst approved version following ARB review
1.12025-01-15Fred BloggsUpdated cost model following reserved instance purchase
1.22025-03-20Fred BloggsAdded fraud detection integration (Phase 2)
2.02025-08-01Fred BloggsMajor revision: EKS upgrade to 1.29, Graviton migration, updated capacity projections
2.12025-11-20Fred BloggsUpdated DR testing results, refreshed cost analysis
NameRoleContribution Type
Fred BloggsLead Solution ArchitectAuthor
Joe BloggsPrincipal Security ArchitectReviewer
Jane DoeData ArchitectReviewer
Tom BloggsSRE LeadReviewer
Dr. Helen ZhaoCTOApprover
Marcus DoeCISOApprover
Alice DoeHead of ComplianceApprover
Dave BloggsARB ChairApprover

This SAD describes the architecture of the Customer API Platform (CAP), Meridian Financial Services’ Open Banking and partner API solution. It replaces the legacy SOAP-based Partner Integration Layer (PIL) and provides secure, high-performance RESTful APIs exposing account information and transaction data to authorised third-party providers (TPPs) and partner fintech applications.

In scope:

  • API Gateway and all microservices (Account, Transaction, Auth, Notification)
  • AWS infrastructure across all environments (dev, test, staging, production, DR)
  • Integration with core banking system, fraud detection, and notification services
  • Security architecture including OAuth 2.0, mTLS, and encryption
  • Operational tooling (monitoring, alerting, logging, tracing)

Out of scope:

  • Core banking system internals (documented in SAD APP-0102)
  • Mobile banking application (documented in SAD APP-0389)
  • Partner onboarding business processes (documented in OPS-0055)
  • Detailed API specification (maintained in Swagger/OpenAPI on internal developer portal)

Related documents:

  • Core Banking Modernisation SAD (APP-0102)
  • MFS Information Security Policy (SEC-POL-001)
  • Open Banking Implementation Plan (PROG-0088)

The Customer API Platform (CAP) is a cloud-native, microservices-based REST API platform that exposes account information and transaction data to authorised partner fintech applications and third-party providers. It is Meridian Financial Services’ primary channel for Open Banking compliance and strategic partner integrations.

CAP replaces the legacy SOAP-based Partner Integration Layer, which suffered from poor scalability, high latency, and an inability to meet the performance and security requirements of the UK Open Banking standard. The new platform is built on AWS using containerised microservices orchestrated by Amazon EKS, fronted by AWS API Gateway, and secured with OAuth 2.0 and mutual TLS.

DriverDescriptionPriority
Regulatory compliance (PSD2 / Open Banking)UK Competition and Markets Authority (CMA) mandate to provide open APIs for account information and payment initiation to authorised TPPsHigh
Legacy platform end-of-lifeThe existing SOAP-based Partner Integration Layer is on unsupported middleware (Oracle SOA Suite 11g) with known security vulnerabilitiesHigh
Partner ecosystem growthStrategic initiative to onboard 25+ fintech partners over the next 18 months, requiring a modern, scalable API platformHigh
Operational cost reductionCurrent platform requires 3 FTEs for manual operational support; target is to reduce to 1 FTE with automationMedium
Developer experiencePartner developers report a 4-week average onboarding time with the SOAP platform; target is under 3 days with self-service APIsMedium
QuestionResponse
Which organisational strategy or initiative does this solution support?MFS Digital Transformation Programme (DTP-2024), specifically Workstream 3: Open Banking & Partner Ecosystem
Has this solution been reviewed against the organisation’s capability model?Yes — mapped to API Management, Identity & Access Management, and Data Integration capabilities
Does this solution duplicate any existing capability?No — replaces the legacy Partner Integration Layer (PIL) which will be decommissioned
CapabilityShared Service / PlatformReused?Justification (if not reused)
Identity & Access (Internal)Okta (corporate SSO)YesUsed for internal admin and developer portal access
Identity & Access (External)ForgeRock Identity GatewayNoDoes not support the financial-grade OAuth 2.0 profile (FAPI) required for Open Banking; using AWS API Gateway with custom authoriser
API ManagementAWS API GatewayYesCorporate-approved API management platform
Monitoring & LoggingSplunk EnterpriseYesCorporate SIEM and log aggregation platform
CI/CDGitHub ActionsYesCorporate standard CI/CD platform
Messaging / NotificationsAmazon SESYesCorporate-approved email notification service
Container PlatformAmazon EKSYesCorporate-approved container orchestration platform
  • Customer API Platform microservices: API Gateway configuration, Account Service, Transaction Service, Auth Service, Notification Service
  • AWS infrastructure: EKS cluster, RDS PostgreSQL, ElastiCache Redis, S3, CloudFront, WAF, Shield
  • All environments: development, test, staging, production, DR
  • Integration with core banking (Oracle DB via Direct Connect), fraud detection (Featurespace ARIC), notification service (SES)
  • Partner authentication and authorisation (OAuth 2.0, mTLS)
  • Internal authentication (Okta SSO)
  • Operational tooling: Splunk, Grafana, PagerDuty, Jaeger
  • Core banking system modifications (separate project PROJ-0102)
  • Partner onboarding portal front-end (separate project PROJ-0115)
  • Payment initiation APIs (Phase 3, planned for 2026-Q2)
  • Mobile banking app integration (separate SAD APP-0389)

The current Partner Integration Layer (PIL) was built in 2016 on Oracle SOA Suite 11g, hosted on-premises in MFS’ Slough data centre. It provides SOAP/XML interfaces to 8 existing partner integrations.

Key limitations:

  • Performance: Average response time of 1.2 seconds (P95: 3.8 seconds), far exceeding the Open Banking 1-second mandate
  • Scalability: Vertically scaled on two physical servers; cannot handle projected 5,000 req/s demand
  • Security: Does not support OAuth 2.0 or mTLS as required by Open Banking security profile
  • Supportability: Oracle SOA Suite 11g reached end-of-support in 2022; two critical CVEs remain unpatched
  • Cost: Annual licensing and support costs of GBP 280,000 plus 3 FTEs for manual operations
  • Onboarding: Partner onboarding requires 4 weeks of manual configuration and testing

What is being retained: Core banking Oracle database (read replicas will be consumed via new integration layer) What is being replaced: Oracle SOA Suite middleware, SOAP/XML interfaces, on-premises hosting What is being decommissioned: PIL application servers (post 6-month parallel-run period)

Decision / ConstraintRationaleImpact
AWS as hosting platformCorporate cloud-first strategy mandates AWS; existing enterprise agreementAll infrastructure on AWS
EKS for container orchestrationExisting team skills in Kubernetes; corporate-approved platformMicroservices deployed as Kubernetes pods
PostgreSQL over DynamoDBRelational data model for financial data; strong consistency requirements; team expertiseRDS PostgreSQL for Account and Transaction data
Event-driven notification patternDecouple notification logic from core API processing; support multiple channelsAmazon EventBridge + SQS for async notifications
Data must remain in UKFCA and data sovereignty requirementseu-west-2 (London) primary; eu-west-1 (Ireland) DR only for non-PII data
FieldValue
Project NameCustomer API Platform (Open Banking)
Project Code / IDPROJ-0098
Project ManagerNelly Bloggs
Estimated Solution Cost (Capex)GBP 1,200,000
Estimated Solution Cost (Opex)GBP 384,000 per annum
Target Go-Live Date2025-03-01 (Phase 1 — achieved); 2025-09-01 (Phase 2 — achieved)

Selected criticality: Tier 1: Critical

Justification: The Customer API Platform is a regulatory obligation under PSD2/Open Banking. Service failure would result in:

  • Breach of CMA Open Banking mandate, with potential regulatory fines
  • Disruption to 25+ partner fintech applications serving over 200,000 end customers
  • Reputational damage to MFS’ position as a trusted Open Banking provider
  • Revenue loss from partner transaction fees (estimated GBP 45,000 per hour of downtime)

StakeholderRole / GroupKey ConcernsRelevant Views
Dr. Helen ZhaoCTOStrategic alignment, technology direction, cost justificationExecutive Summary, Cost
Marcus DoeCISOThreat model, data protection, PCI-DSS compliance, incident responseSecurity View, Governance
Alice DoeHead of ComplianceOpen Banking compliance, FCA regulations, audit trail, data sovereigntySecurity View, Data View, Governance
Fred BloggsLead Solution ArchitectDesign integrity, standards compliance, technical debt, scalabilityAll views
Joe BloggsPrincipal Security ArchitectAuthentication, encryption, network security, penetration testingSecurity View, Physical View
Jane DoeData ArchitectData classification, PII handling, data sovereignty, retentionData View
Tom BloggsSRE LeadObservability, incident response, reliability, on-callQuality Attributes, Lifecycle
Amir DoeDevelopment LeadComponent design, API contracts, CI/CD, developer experienceLogical View, Integration & Data Flow, Lifecycle
Nelly BloggsProject ManagerDelivery timeline, cost, dependencies, risksExecutive Summary, Governance
Sally DoePartner ManagerPartner onboarding experience, API availability, SLA commitmentsIntegration & Data Flow, Reliability
External API consumersPartner fintech developersAPI documentation, latency, uptime, versioning, error handlingIntegration & Data Flow, Performance
Dave BloggsARB ChairArchitecture standards compliance, reuse assessment, governanceAll views
Finance teamFinance & ProcurementCost forecasting, reserved instance optimisation, budget adherenceCost Optimisation
ConcernStakeholder(s)Addressed In
Regulatory compliance (PSD2, Open Banking)Alice Doe, Dr. Helen Zhao1. Executive Summary, 2.3 Compliance, 3.5 Security View, 6. Governance
Data protection and PII handlingMarcus Doe, Jane Doe3.4 Data View, 3.5 Security View
Platform availability and SLATom Bloggs, Sally Doe, External API consumers4.2 Reliability, 5.5 Operations & Support
API performance and latencyAmir Doe, External API consumers4.3 Performance, 3.2 Integration & Data Flow
Cost-effectiveness and budgetDr. Helen Zhao, Finance team, Nelly Bloggs4.4 Cost Optimisation
Security posture and threat mitigationMarcus Doe, Joe Bloggs3.5 Security View
Partner onboarding and developer experienceSally Doe, External API consumers3.2 Integration & Data Flow, 3.6 Scenarios
Operational supportabilityTom Bloggs4.1 Operational Excellence, 5.5 Operations & Support
Scalability for growthFred Bloggs, Dr. Helen Zhao4.2 Reliability, 4.3 Performance, 3.3 Physical View
Migration from legacy PILNelly Bloggs, Amir Doe1.5 Current State, 5.2 Service Transition
Vendor lock-inFred Bloggs, Dave Bloggs3.1 Logical View, 5.10 Exit Planning
Regulation / StandardApplicabilityImpact on Design
PSD2 / Open Banking (UK)Mandatory — MFS is a CMA-designated bankMust provide Open Banking APIs conforming to OBIE specifications; strong customer authentication (SCA) required
PCI-DSS v4.0Applicable — platform processes cardholder transaction dataNetwork segmentation, encryption, access controls, audit logging, vulnerability management
UK GDPR / Data Protection Act 2018Applicable — platform processes customer PIIData minimisation, right to erasure support, DPIA completed, lawful basis documented
FCA SYSC 13 (Operational Resilience)Applicable — platform supports important business serviceRTO/RPO targets, impact tolerance testing, scenario-based resilience testing
ISO 27001MFS is certified; platform must conformInformation security controls, risk assessment, access management
  • Yes — the platform supports PSD2-regulated account information services (AIS) provided to authorised third-party providers.
StandardVersionApplicability
MFS Information Security Policy4.2All sections — security controls, access management
MFS Data Classification Standard2.0Data View — classification of all data stores
OBIE API Specification3.1.11Integration & Data Flow View — API contracts
MFS Cloud Security Standard1.3Physical View, Security View — AWS security controls
NIST Cybersecurity Framework2.0Security View — threat model and controls mapping

graph TD
  Partners[Partner Apps] --> APIGW[API Gateway]
  Admins[Internal Admins] --> APIGW
  APIGW --> AuthSvc[Auth Service]
  APIGW --> AcctSvc[Account Service]
  APIGW --> TxnSvc[Transaction Service]
  AcctSvc --> RDS[RDS PostgreSQL]
  TxnSvc --> RDS
  AcctSvc --> Redis[ElastiCache Redis]
  TxnSvc --> Redis
  AcctSvc --> EB[EventBridge]
  TxnSvc --> EB
  EB --> NotifSvc[Notification Service]
  NotifSvc --> SES[Amazon SES]
  NotifSvc --> SNS[Amazon SNS]
  AcctSvc -- Direct Connect --> CoreBank[Core Banking Oracle DB]
  TxnSvc -- API --> Fraud[Fraud Detection]
ComponentTypeDescriptionTechnologyOwner
API GatewayManaged ServiceEntry point for all external API requests; handles rate limiting, request validation, API key management, and request routingAWS API Gateway (REST)Platform Team
Auth ServiceMicroserviceHandles OAuth 2.0 token issuance, mTLS validation, scope enforcement, and consent management for TPPsJava 21 (Spring Boot 3.3) on EKSAPI Team
Account ServiceMicroserviceProvides account information endpoints (balances, details, standing orders, direct debits) conforming to OBIE specJava 21 (Spring Boot 3.3) on EKSAPI Team
Transaction ServiceMicroserviceProvides transaction history endpoints with filtering, pagination, and enrichmentJava 21 (Spring Boot 3.3) on EKSAPI Team
Notification ServiceMicroserviceProcesses event-driven notifications to partners (webhooks) and internal teams (email, Slack)Node.js 20 (Express) on EKSAPI Team
PostgreSQL (Accounts DB)DatabaseStores account metadata, consent records, and partner registration dataAmazon RDS PostgreSQL 16 (Multi-AZ)DBA Team
PostgreSQL (Transactions DB)DatabaseStores transaction data replicated from core banking, plus API audit recordsAmazon RDS PostgreSQL 16 (Multi-AZ)DBA Team
Redis CacheCacheCaches frequently accessed account data and rate limiting state; reduces load on core bankingAmazon ElastiCache Redis 7.x (cluster mode)Platform Team
Event BusMessagingDecouples notification and audit event processing from synchronous API flowsAmazon EventBridge + SQSPlatform Team
Audit Log StoreObject StorageLong-term storage of API audit logs for compliance (7-year retention)Amazon S3 (Glacier Deep Archive for aged data)Platform Team
Core Banking AdapterIntegration ComponentReads from core banking Oracle DB read replicas via JDBC; transforms data to platform domain modelJava 21 library within Account/Transaction ServicesAPI Team
Service IDService NameCapability IDCapability Name
SVC-001Account Information ServiceCAP-AISOpen Banking Account Information
SVC-002Transaction History ServiceCAP-TXNTransaction Data Retrieval
SVC-003Partner AuthenticationCAP-AUTHTPP Authentication & Consent
SVC-004Event NotificationCAP-NOTIFYPartner Webhook Notifications
Application NameApplication IDImpact TypeChange DetailsComments
Core Banking SystemAPP-0102UseRead-only access to Oracle DB read replicas via Direct ConnectNo changes to core banking; new read replica provisioned
Fraud Detection (Featurespace ARIC)APP-0310UseConsume fraud scoring API for high-value transaction requestsExisting API; new integration client
Partner Onboarding PortalAPP-0456CreateNew web portal for partner self-service registration and API key managementDependent on CAP Auth Service APIs
Legacy Partner Integration LayerAPP-0198DecommissionWill be retired after 6-month parallel runMigration of 8 existing partners required
Corporate Splunk InstanceAPP-0067UseAll logs and security events forwarded to SplunkExisting HEC endpoints used
PagerDutyAPP-0089UseAlerting integration for P1/P2 incidentsExisting service; new integration configured
PatternWhere AppliedRationale
API GatewayAWS API Gateway fronting all servicesCentralised rate limiting, authentication, request validation, and API versioning; decouples clients from internal service topology
MicroservicesAccount, Transaction, Auth, Notification ServicesIndependent scaling, deployment, and failure isolation for services with different performance profiles
Event-Driven ArchitectureNotification Service, audit loggingDecouples async processing (webhooks, emails, audit writes) from synchronous API response path; improves P95 latency
CQRS (partial)Transaction ServiceRead-optimised query model populated from core banking CDC stream; separates read path from authoritative write path in core banking
Circuit BreakerCore Banking Adapter, Fraud Detection clientPrevents cascade failures when downstream dependencies are degraded; implemented via Resilience4j
Strangler FigMigration from legacy PILGradual migration of partner traffic from SOAP to REST APIs using API Gateway routing rules
SidecarEnvoy proxy on each podConsistent mTLS termination, observability, and traffic management across all services
Cache-AsideAccount Service with RedisReduces latency and load on core banking for frequently accessed account data (balance lookups)

3.1.6 Technology & Vendor Lock-in Assessment

Section titled “3.1.6 Technology & Vendor Lock-in Assessment”
Component / ServiceVendor / TechnologyLock-in LevelMitigationPortability Notes
API GatewayAWS API GatewayModerateOpenAPI specs are portable; routing logic is declarativeCould migrate to Kong or Apigee with moderate effort; API contracts remain unchanged
EKSAWS (Kubernetes)LowStandard Kubernetes manifests; Helm charts usedPortable to any Kubernetes cluster (AKS, GKE, self-hosted)
RDS PostgreSQLAWS (PostgreSQL)LowStandard PostgreSQL; no AWS-specific extensions usedPortable to any PostgreSQL host; pg_dump for migration
ElastiCache RedisAWS (Redis)LowStandard Redis protocolPortable to any Redis deployment
EventBridgeAWS EventBridgeModerateEvent schema documented in JSON Schema; consumers use SQSWould require replacement with another event bus (e.g., Azure Event Grid, Kafka)
S3AWS S3LowStandard object storage APIPortable to any S3-compatible storage (MinIO, Azure Blob with S3 gateway)
IAM & KMSAWS IAM / KMSHighCore to security architecture; deeply integratedWould require significant re-engineering for alternative cloud; mitigated by Terraform IaC
QuestionResponse
Caching to avoid recomputation / repeated downstream callsYes — ElastiCache Redis used for session state, partner JWT verification keys, and short-lived rate-limiter counters; ~85% cache hit rate on partner authentication, eliminating ~12M Cognito calls per month
Batch processes consolidated rather than continuously pollingYes — transaction enrichment runs as nightly batch (00:30 UTC) rather than per-event; Featurespace ARIC fraud signals consumed via webhook (push) rather than polling
Async / event-driven patterns to flatten peak loadYes — EventBridge + SQS for transaction events, partner notifications, and audit log shipping; consumer pods scale on queue depth via Karpenter, releasing capacity when idle
Heavy framework choices weighed against lighter alternativesConsidered — Spring Boot retained for the core API (existing team skill, mature ecosystem); Lambda evaluated and rejected for synchronous APIs (cold-start latency would breach P95 < 200ms SLA)

graph LR
  Partner[Partner App] -- TLS 1.3 + mTLS --> APIGW[API Gateway]
  APIGW --> Auth[Auth Service]
  Auth --> Svc[Account/Transaction Service]
  Svc --> Adapter[Core Banking Adapter]
  Adapter -- Direct Connect --> Oracle[Oracle DB Replica]
  Svc -- async --> EB[EventBridge]
  EB --> SQS[SQS]
  SQS --> Notif[Notification Service]
  Notif --> SES[SES / Webhooks]
  EB -- audit --> AuditQ[SQS audit]
  AuditQ --> S3[S3 audit logs]
Source ComponentDestination ComponentProtocol / EncryptionAuthentication MethodPurpose
API GatewayAuth ServiceHTTPS / TLS 1.3IAM (service-to-service)Token validation and scope checking
API GatewayAccount ServiceHTTPS / TLS 1.3IAM (service-to-service)Route authenticated account requests
API GatewayTransaction ServiceHTTPS / TLS 1.3IAM (service-to-service)Route authenticated transaction requests
Account ServicePostgreSQL (Accounts DB)JDBC / TLS 1.3IAM DB authenticationRead/write account metadata and consent records
Transaction ServicePostgreSQL (Transactions DB)JDBC / TLS 1.3IAM DB authenticationRead transaction data and audit records
Account ServiceElastiCache RedisRedis protocol / TLS 1.3AUTH token (rotated via Secrets Manager)Cache-aside for account balance lookups
Account ServiceCore Banking AdapterIn-process (library)N/ATransform and proxy core banking data
Transaction ServiceCore Banking AdapterIn-process (library)N/ATransform and proxy core banking data
Account ServiceEventBridgeHTTPS / TLS 1.3IAMPublish audit and notification events
Transaction ServiceEventBridgeHTTPS / TLS 1.3IAMPublish audit and notification events
EventBridgeSQS (Notification Queue)AWS internal / encryptedIAMRoute notification events to processing queue
EventBridgeSQS (Audit Queue)AWS internal / encryptedIAMRoute audit events to audit processing
Notification ServiceSQS (Notification Queue)HTTPS / TLS 1.3IAMConsume notification events
Notification ServiceSESHTTPS / TLS 1.3IAMSend email notifications
Source ApplicationDestination ApplicationProtocol / EncryptionAuthenticationSecurity ProxyPurpose
Partner fintech appsCAP API GatewayHTTPS / TLS 1.3 + mTLSOAuth 2.0 (FAPI profile)AWS WAF, Shield AdvancedAccount and transaction API requests
CAP (Core Banking Adapter)Core Banking Oracle DBJDBC / TLS 1.2Oracle DB credentials (Secrets Manager)N/A (Direct Connect private link)Read account and transaction data from read replicas
CAP (Transaction Service)Featurespace ARICHTTPS / TLS 1.3API key + IP allowlistNAT Gateway (fixed IP)Fraud score requests for high-value transactions
CAP (Notification Service)Partner webhook endpointsHTTPS / TLS 1.3HMAC-SHA256 signed payloadsNAT GatewayEvent notifications to partners
Internal administratorsCAP admin APIsHTTPS / TLS 1.3Okta SSO (OIDC)Corporate VPNPartner management, configuration, monitoring
User TypeAccess MethodAuthenticationProtocol
Partner fintech applicationsREST APIOAuth 2.0 (client credentials with FAPI profile) + mTLSHTTPS / TLS 1.3
Internal administratorsWeb portal (React SPA) via corporate networkOkta SSO (OIDC) with MFAHTTPS / TLS 1.3
SRE / Operationskubectl, AWS Console, Grafana dashboards via VPNOkta SSO + AWS IAM Identity CentreHTTPS / TLS 1.3, SSH (bastion)
API / InterfaceTypeDirectionFormatVersionDocumentation
Account Information APIRESTExposedJSONv3.1 (OBIE compliant)Internal developer portal (Swagger)
Transaction History APIRESTExposedJSONv3.1 (OBIE compliant)Internal developer portal (Swagger)
Consent Management APIRESTExposedJSONv1.0 (internal)Internal developer portal (Swagger)
Partner Webhook NotificationsREST (callback)Exposed (outbound)JSONv1.0 (internal)Partner integration guide
Core Banking Data APIJDBCConsumedSQL/ResultSetN/ADBA team wiki
Featurespace ARIC Fraud APIRESTConsumedJSONv2.4Featurespace developer docs
Splunk HTTP Event CollectorRESTConsumedJSONN/ASplunk docs
PagerDuty Events APIRESTConsumedJSONv2PagerDuty docs

graph TD
  R53[Route 53] --> WAF[AWS WAF + Shield]
  WAF --> CF[CloudFront]
  CF --> APIGW[API Gateway]
  subgraph Primary[eu-west-2 London - 2 AZs]
      subgraph Public[Public Subnets]
          NLB[NLB]
          NAT[NAT Gateways]
      end
      subgraph Private[Private Subnets]
          EKS[EKS Node Groups]
          RDS[RDS PostgreSQL Multi-AZ]
          ElastiCache[ElastiCache Redis]
      end
      subgraph Isolated[Isolated Subnets]
          DX[Direct Connect Gateway]
      end
  end
  APIGW --> NLB
  NLB --> EKS
  EKS --> RDS
  EKS --> ElastiCache
  DX -- to on-premises --> OnPrem[Core Banking]
  subgraph DR[eu-west-1 DR Pilot Light]
      RDSDR[RDS Replica]
  end
  RDS -- cross-region replica --> RDSDR
AttributeSelection
Hosting Venue TypeCloud (primary) with on-premises connectivity (core banking)
Hosting Region(s)UK (eu-west-2 London — primary), Ireland (eu-west-1 — DR)
Service ModelPaaS (EKS, RDS, ElastiCache) and SaaS (API Gateway, EventBridge, S3)
Cloud ProviderAWS
Account / Subscription TypeMFS Production AWS Organisation — Workload Account (cap-prod-001)
AttributeDetail
Container PlatformAmazon EKS 1.29
Base Image(s)amazoncorretto:21-alpine (Java services), node:20-alpine (Notification Service)
Cluster Size3 managed node groups: system (3 nodes), application (6-12 nodes, auto-scaling), monitoring (2 nodes)
Node Instance Typem7g.xlarge (Graviton3, 4 vCPU, 16 GB RAM) for application nodes; m7g.large for system and monitoring
Pod Resource LimitsAccount Service: 1 vCPU / 2 GB RAM; Transaction Service: 1.5 vCPU / 3 GB RAM; Auth Service: 0.5 vCPU / 1 GB RAM; Notification Service: 0.5 vCPU / 1 GB RAM
Pod Replicas (Production)Account Service: 4-8 (HPA); Transaction Service: 4-10 (HPA); Auth Service: 3-6 (HPA); Notification Service: 2-4 (HPA)
  • Anti-Malware — Amazon GuardDuty (runtime monitoring on EKS)
  • Endpoint Detection and Response (EDR) — CrowdStrike Falcon sensor on EKS nodes
  • Vulnerability Management — Amazon Inspector (continuous scanning of container images and EKS nodes)
  • Other: AWS Systems Manager Agent for patching and compliance
QuestionResponse
Is this an Internet-facing application?Yes — API Gateway is Internet-facing for partner access
Outbound Internet connectivity required?Yes — for partner webhook delivery and Featurespace ARIC API calls (via NAT Gateway with fixed Elastic IPs)
Cloud-to-on-premises connectivity required?Yes — AWS Direct Connect (1 Gbps dedicated, with VPN backup) to MFS Slough data centre for core banking Oracle DB access
Wireless networking required?No
Third-party / co-location connectivity required?No — third-party integrations (Featurespace) are over public Internet via TLS
Cloud network peering required?Yes — VPC peering to MFS Shared Services VPC (for Splunk forwarding, Okta agent)
AttributeSelection
User access methodAPI (partner applications), Web (HTTPS) for internal admin portal
User locationsEnd-customers (Internet, globally), Internal (UK offices, remote VPN)
Administrator access methodBastion Host (SSH), AWS Console (via IAM Identity Centre), kubectl (via EKS OIDC)
VPN requiredYes — for administrator access only (Cisco AnyConnect corporate VPN)
Direct Connect / ExpressRouteYes — AWS Direct Connect 1 Gbps to Slough data centre
ProtocolUsed?Purpose
HTTPS (TLS 1.2+)YesAll API traffic (TLS 1.3 enforced where possible; TLS 1.2 minimum for legacy partners)
SFTPNoN/A
ODBC / JDBCYesCore banking Oracle DB connectivity via JDBC over TLS
TCP (other)YesRedis protocol (port 6379) within VPC, encrypted in transit
gRPCNoN/A
WebSocketNoN/A
MetricValue
Peak egress bandwidth to Internet500 Mb/s
Peak ingress bandwidth from Internet200 Mb/s
Peak bandwidth between on-prem and cloud800 Mb/s (over 1 Gbps Direct Connect)
Traffic characteristicsBurst — significant peaks during business hours (08:00-18:00 UK), month-end, and salary payment dates
QoS requirementsAPI responses must not be queued or throttled below SLA thresholds
Network performance expectations< 5ms latency within VPC; < 10ms to core banking via Direct Connect
ControlImplementedDetail
DDoS ProtectionYesAWS Shield Advanced on API Gateway, NLB, and CloudFront
Rate LimitingYesAPI Gateway: 100 req/s per partner (burst 200); global: 5,000 req/s
Source IP RestrictionsYesmTLS required for all API access; optional IP allowlisting for partners who request it
Web Application Firewall (WAF)YesAWS WAF v2 with OWASP Top 10 managed rule group, rate-based rules, SQL injection and XSS rules
Client Verification ControlsYesmTLS with partner-specific client certificates; FAPI-compliant OAuth 2.0
File Upload ProtectionNoAPI does not accept file uploads
EnvironmentDescriptionCount & VenueCompute Solution
DevelopmentDeveloper workstations and shared dev cluster1x AWS (eu-west-2)EKS (2 nodes, m7g.large), RDS db.t4g.medium
Test / QAAutomated integration and contract testing1x AWS (eu-west-2)EKS (3 nodes, m7g.large), RDS db.t4g.large
Staging / Pre-ProductionProduction-mirror for release validation and performance testing1x AWS (eu-west-2)EKS (4 nodes, m7g.xlarge), RDS db.r7g.large
ProductionLive service environment1x AWS (eu-west-2), Multi-AZEKS (6-12 nodes, m7g.xlarge), RDS db.r7g.xlarge Multi-AZ
DRDisaster recovery (pilot light)1x AWS (eu-west-1)EKS (2 nodes, scaled up during failover), RDS read replica
  • No — production and non-production environments are in separate AWS accounts with no direct connectivity. Data flows between environments only through the CI/CD pipeline (GitHub Actions deploying to each environment in sequence).

Partner applications access the API programmatically; there are no end-user compute or BYOD requirements. Internal administrators use standard corporate Windows 11 laptops via VPN.

Not applicable — no IoT devices are part of this solution.

QuestionResponse
Hosting regions chosen for low carbon intensityeu-west-2 (London) chosen primarily for UK data residency. AWS London is on track for 100% renewable energy matching by 2025 (AWS commitment). DR region eu-west-1 (Ireland) operates at lower carbon intensity than the AWS European average.
Non-production environments auto-shutdown out of hoursYes — dev and staging EKS clusters scale to zero application pods 19:00-07:00 weekdays and all weekend (system pods remain). Non-prod RDS instances paused on the same schedule. Estimated saving: 62% of non-prod compute and 55% of non-prod RDS spend.
Compute family chosen for performance-per-wattYes — Graviton3 (c7g.xlarge / m7g.xlarge) throughout. AWS published data shows ~60% better performance-per-watt vs equivalent x86 m6i; Graviton3 was the dominant factor in the 2025-Q3 cost reduction.
Auto-scaling configured to release capacity when idleYes — Karpenter consolidates underutilised nodes within 5 minutes of becoming idle; HPA scales pods on CPU and queue depth; idle workloads return resources to the pool rather than being held.
DR strategy proportionate to recovery objectiveWarm standby in eu-west-1 (RDS read replica + S3 cross-region replication; EKS cluster scaled to minimum). Hot active-active was considered and rejected: would have doubled compute footprint for an RTO improvement (4h -> 1h) that the business sponsor confirmed was unnecessary.

Data NameStore TechnologyAuthoritative?Retention PeriodData SizeClassificationPersonal Data?Encryption LevelKey Management
Account metadataRDS PostgreSQL 16No (core banking is authoritative)Refreshed daily; 90-day history50 GBRestrictedYes (name, sort code, account number)Application (column-level for PII) + Storage (AES-256)AWS KMS (CMK with auto-rotation)
Transaction dataRDS PostgreSQL 16No (core banking is authoritative)2 years rolling500 GB (growing 15 GB/month)RestrictedYes (payee names, transaction descriptions)Application (column-level for PII) + Storage (AES-256)AWS KMS (CMK with auto-rotation)
Consent recordsRDS PostgreSQL 16Yes7 years from consent expiry10 GBRestrictedYes (customer ID, TPP ID, consent scope)Application + Storage (AES-256)AWS KMS (CMK with auto-rotation)
Partner registration dataRDS PostgreSQL 16YesLife of partner + 3 years1 GBInternalNo (organisation data only)Storage (AES-256)AWS KMS (CMK with auto-rotation)
Cached account balancesElastiCache Redis 7.xNo (cache, not authoritative)TTL: 60 seconds2 GB (in-memory)RestrictedYes (account balances)In-transit (TLS) + At-rest (encryption enabled)AWS KMS (ElastiCache-managed)
API audit logsS3 (Standard, then Glacier)Yes7 years200 GB/yearRestrictedYes (customer IDs in request context)Storage (SSE-S3 with bucket key)AWS-managed keys (SSE-S3)
Application logsS3 via Fluent BitNo (copy, forwarded to Splunk)90 days in S3; 1 year in Splunk50 GB/yearInternalNo (PII redacted in logging framework)Storage (SSE-S3)AWS-managed keys
EKS cluster metricsAmazon Managed PrometheusNo90 days20 GBInternalNoStorage (AWS-managed)AWS-managed keys
AttributeDetail
Storage ProductAmazon RDS (PostgreSQL), Amazon S3, Amazon ElastiCache
Storage SizeRDS: 1 TB provisioned IOPS (gp3); S3: estimated 2 TB over 7 years; ElastiCache: 2 x cache.r7g.large (26 GB)
Storage TypeBlock (RDS EBS gp3), Object (S3), In-memory (ElastiCache)
ReplicationRDS: synchronous Multi-AZ standby + asynchronous cross-region read replica (DR); S3: cross-region replication for audit logs; ElastiCache: cluster mode with replicas
Minimum RPO15 minutes (continuous backup with RDS point-in-time recovery)
Classification LevelData TypesHandling Requirements
PublicAPI documentation, partner onboarding guidesOpen access, no special controls
InternalApplication logs (PII-redacted), partner registration data, infrastructure metricsInternal access controls, standard encryption at rest
RestrictedAccount data, transaction data, consent records, audit logsEncrypted at rest and in transit, field-level encryption for PII, access-controlled and audited, 7-year retention for consent/audit data
StageDescriptionControls
Creation / IngestionAccount and transaction data replicated from core banking Oracle DB via CDC (nightly batch + near-real-time CDC for balances); consent records created via Auth ServiceSchema validation, data type enforcement, PII field identification and tagging at ingestion
ProcessingAPI requests query PostgreSQL; PII fields decrypted only at point of use within service; response payloads assembled and returnedColumn-level decryption in application code; no PII in logs; request/response audit events emitted
StoragePostgreSQL (Multi-AZ, gp3 IOPS), Redis (in-memory with persistence), S3 (audit logs)AES-256 encryption at rest (KMS CMK), TLS in transit, automated backups (daily full, continuous WAL archiving)
Sharing / TransferAPI responses to authorised partners; audit logs to Splunk; notification events to partners via webhooksTLS 1.3 in transit, OAuth 2.0 scope enforcement, HMAC-signed webhook payloads, PII minimisation in responses
ArchivalAudit logs transitioned from S3 Standard to S3 Glacier after 1 year, then Glacier Deep Archive after 3 yearsS3 lifecycle policies, retrieval SLA: 12 hours from Glacier, 48 hours from Deep Archive
Deletion / PurgingTransaction data purged after 2 years (rolling); consent records purged 7 years after expiry; Redis cache TTL-based evictionPostgreSQL scheduled jobs (pg_cron); S3 lifecycle expiration rules; deletion logged in audit trail
Assessment TypeIDStatusLink
DPIADPIA-2024-047Completed, approved by DPOConfluence: /compliance/dpia-047
PIAPIA-2024-031CompletedConfluence: /compliance/pia-031
ApproachSelected
Sensitive data is masked (describe method below)[x]

Production data used in staging environment only, with all PII fields masked using a deterministic tokenisation approach (Delphix DataVault). Account numbers, names, and addresses are replaced with realistic synthetic data. Test and development environments use entirely synthetic data generated by the API team.

  • Yes — checksums (SHA-256) are computed for all data replicated from core banking and validated on ingestion. Transaction amounts are verified using double-entry accounting reconciliation jobs that run hourly, comparing aggregated balances against core banking source of truth.
  • No — no data is stored on end-user devices. All data is served via API and is not cached client-side (Cache-Control: no-store headers applied to all API responses containing customer data).
DestinationData TypeClassificationTransfer MethodProtection
Authorised TPPs (partner fintech apps)Account balances, transaction historyRestrictedREST API over HTTPS / TLS 1.3OAuth 2.0 scope enforcement, mTLS, PII minimisation, consent-based access only
Featurespace ARICTransaction metadata (no PII)InternalREST API over HTTPS / TLS 1.3API key authentication, IP allowlist, PII stripped before transmission
Splunk (corporate instance)Application and security logsInternalHTTPS (Splunk HEC)PII redacted at source by logging framework; TLS 1.3 in transit
  • Yes — all customer data (PII and transaction data) must remain within the United Kingdom (eu-west-2 London region). The DR region (eu-west-1 Ireland) stores only non-PII operational data (metrics, redacted logs). Cross-region replication for RDS is configured to exclude PII columns (custom replication using CDC with PII filtering). Audit logs in S3 are replicated to eu-west-1 with PII fields encrypted using a region-specific KMS key that prevents decryption outside eu-west-2.
QuestionResponse
Retention periods minimised to regulator + business needYes — transaction data retained for 7 years (FCA SYSC requirement); audit logs 7 years; access logs 13 months (regulatory minimum); ephemeral session data ≤ 24 hours. Lifecycle policies enforce expiry automatically; no “indefinite” retention.
Older data tiered to cold/archive storageYes — audit logs and transaction archives transition S3 Standard → Standard-IA (30 days) → Glacier Instant Retrieval (90 days) → Glacier Deep Archive (1 year). RDS snapshots > 35 days exported to S3 Glacier. ~78% of historical data sits in archive tiers.
Unused or duplicate replicas identified and removedYes — weekly orphaned-snapshot job; quarterly review of read replicas (currently 2, justified by read traffic distribution). No legacy unused buckets (verified via AWS Trusted Advisor).
Compression applied to reduce storage and transferYes — Brotli compression on HTTPS responses (~70% reduction on JSON payloads); gzip on S3 audit log uploads; Parquet (with Snappy) for analytics exports to Snowflake.
Cross-region replication justified by recovery requirementYes — only audit logs and operational metrics replicate cross-region. Customer PII does not (data sovereignty + reduced cross-region transfer carbon cost). DR for the RDS primary is via daily encrypted backup snapshots restored on-demand, not continuous replication.
Large data transfers scheduled to off-peak windowsYes — nightly Snowflake export runs 02:00-04:00 UTC; weekly partner reconciliation transfers run Sunday 03:00 UTC; both deliberately scheduled when UK grid carbon intensity is lowest (per carbonintensity.org.uk historical data).

QuestionResponse
Does the solution support regulated activities?Yes — PSD2 Account Information Services (AIS) and PCI-DSS-scoped transaction data processing
Is the solution SaaS or third-party hosted?No — self-managed on AWS (IaaS/PaaS)
Has a third-party risk assessment been completed?Yes — AWS: MFS-TRA-2023-012 (approved); Featurespace: MFS-TRA-2024-008 (approved)
Impact CategoryBusiness Impact if Compromised
ConfidentialityCritical — exposure of customer financial data would trigger mandatory FCA notification, potential regulatory fines (up to 4% of annual turnover under GDPR), and severe reputational damage
IntegrityHigh — manipulation of transaction or balance data could lead to incorrect financial reporting and partner disputes
AvailabilityCritical — outage breaches CMA Open Banking mandate and SLA commitments to 25+ partners; estimated GBP 45,000/hour revenue impact
Non-RepudiationHigh — inability to prove API request/response authenticity could undermine dispute resolution with partners and regulators

A STRIDE-based threat model was conducted (reference: SEC-TM-2024-019). Key threats:

ThreatAttack VectorLikelihoodImpactMitigation
Stolen OAuth token used by unauthorised partyToken theft via compromised partner applicationMediumHighShort-lived tokens (5 min expiry), token binding to mTLS certificate, refresh token rotation, token revocation endpoint
API abuse / data scrapingCompromised partner credentials used for bulk data extractionMediumHighRate limiting (100 req/s per partner), anomaly detection via WAF, consent-scoped data access, audit log monitoring
SQL injectionMalformed API parameters targeting PostgreSQLLowCriticalParameterised queries only (no dynamic SQL), WAF SQL injection rules, SAST scanning in CI/CD
DDoS attack on API endpointVolumetric or application-layer DDoSMediumHighAWS Shield Advanced, WAF rate-based rules, API Gateway throttling, CloudFront edge absorption
Insider threat (admin misuse)Privileged administrator accesses customer dataLowCriticalJIT access via CyberArk, all admin actions logged and alerted, segregation of duties, quarterly access reviews
Man-in-the-middle on Direct ConnectInterception of core banking data in transitLowCriticalTLS 1.2 encryption on JDBC connections over Direct Connect; private VLAN; MACsec on Direct Connect
Container escapeCompromised container breaks out to hostLowHighRead-only root filesystem, non-root containers, Kubernetes pod security standards (restricted), Falco runtime detection
Access TypeRole(s)Destination(s)Authentication MethodCredential Protection
Internal admin portalPlatform Admin, Partner Manager, Compliance ViewerAdmin API, Grafana, partner management UIOkta SSO (OIDC) with MFA (FIDO2/push)Okta credential policies (90-day rotation, 16-char min)
SRE / OperationsSRE Engineer, DBAEKS (kubectl), AWS Console, RDS, bastion hostOkta SSO via AWS IAM Identity Centre; SSH via bastion with short-lived certificatesCyberArk for privileged sessions; SSH certificates (8-hour validity)
Service accountsCI/CD pipeline, monitoring agentsEKS API, AWS services, SplunkIAM roles (IRSA for EKS pods), GitHub OIDC for CI/CDNo long-lived credentials; IAM roles with least privilege
Access TypeRole(s)Destination(s)Authentication MethodCredential Protection
Partner applicationsTPP (Third-Party Provider)CAP API GatewayOAuth 2.0 client credentials (FAPI profile) + mTLSClient certificates issued by MFS PKI (2048-bit RSA, 1-year validity); client secrets stored in partner’s own secret management
Partner developersDeveloperDeveloper portal (documentation)API key (read-only documentation access)API keys rotated annually; rate-limited to 10 req/s
ControlResponse
Does the application use SSO or group-wide authentication?Yes — Okta SSO for all internal access; OAuth 2.0 for external partner access
What is the unique identifier for user accounts?Internal: Okta user ID (email-based); External: TPP registration ID (OBIE-assigned)
What is the authentication flow?Internal: OIDC authorization code flow with PKCE; External: OAuth 2.0 client credentials with FAPI-compliant token request over mTLS
How are credentials issued to users?Internal: Okta provisioning from Active Directory; External: client certificate and secret issued during partner onboarding
What are the credential complexity rules?Internal: Okta policy (16-char min, complexity required); External: 2048-bit RSA certificates, 256-bit client secrets
What are the credential rotation rules?Internal: 90-day password rotation; External: annual certificate renewal, client secret rotation supported
What are the account lockout rules?Internal: 5 failed attempts, 30-minute lockout; External: 10 failed auth attempts, automatic partner notification and 1-hour lockout
How can users reset forgotten credentials?Internal: Okta self-service with MFA verification; External: partner contacts MFS API Support team
ControlResponse
How are sessions established after authentication?Internal: OIDC session cookie (HttpOnly, Secure, SameSite=Strict), 8-hour max session; External: OAuth 2.0 access tokens (JWT, 5-minute expiry) with refresh tokens (24-hour expiry)
How are session tokens protected against misuse?JWTs are signed (RS256) and optionally encrypted (A256GCM); token binding to mTLS certificate thumbprint prevents token replay; refresh tokens are single-use with rotation
What are the session timeout and concurrency limits?Internal: 30-minute idle timeout, 8-hour absolute; External: access tokens 5-minute absolute, no concurrency limits on stateless API access
Access TypeRole / ScopeEntitlement StoreProvisioning Process
Business Users (internal admin)Platform Admin, Partner Manager, Compliance Viewer, Read-OnlyOkta groups mapped to Kubernetes RBAC and application rolesOkta group membership managed by line managers via ServiceNow request
Technology Users (SRE)SRE Engineer (full), DBA (database only), Developer (non-prod only)AWS IAM Identity Centre permission sets + Kubernetes RBACIAM Identity Centre permission sets assigned via Terraform; JIT elevation via CyberArk
Service AccountsScoped IAM roles per service (least privilege)AWS IAM policies attached to IRSA rolesTerraform-managed; reviewed quarterly
External PartnersOAuth 2.0 scopes: accounts:read, transactions:read, consent:manageOAuth 2.0 token claims, enforced by Auth ServiceScopes assigned during partner onboarding; consent per customer
ControlResponse
Account re-certification processQuarterly access review by Platform Admin; annual review by CISO office for all privileged accounts
Segregation of duties controlsDevelopers cannot deploy to production (CI/CD pipeline enforces); DBAs cannot modify application code; Compliance Viewers have read-only access
Delegated authorisation capabilitiesPartner access is consent-based: customers authorise specific TPPs to access their data via the consent flow; consent is time-limited and revocable
Account TypeManagement Approach
OS privileged accounts (root/admin)EKS managed nodes: no SSH access by default; SSM Session Manager for emergency access with audit trail; root disabled
Infrastructure / platform adminAWS IAM Identity Centre with JIT privilege elevation via CyberArk; 4-hour maximum session; all actions CloudTrail-logged
Application adminAdmin API protected by Okta SSO + MFA; admin actions audited; no direct database access (all operations via admin API)

3.5.3 Network Security & Perimeter Protection

Section titled “3.5.3 Network Security & Perimeter Protection”
ControlImplementation
Network segmentationVPC with public, private, and isolated subnets across 2 AZs; security groups per service (allow only required ports/protocols); NACLs as secondary layer; EKS pods use Calico network policies for pod-to-pod segmentation
Ingress filteringAWS WAF v2 (OWASP Top 10 rules, rate limiting, geo-restriction to permitted countries), Shield Advanced, API Gateway throttling; NLB in public subnet routes to API Gateway
Egress filteringNAT Gateway with fixed Elastic IPs for outbound (partner webhooks, Featurespace); egress security groups restrict destinations to known endpoints; VPC Flow Logs for monitoring
Encryption in transitTLS 1.3 enforced for partner API traffic; TLS 1.2 minimum for all other connections; certificates managed by AWS Certificate Manager (ACM) for public endpoints; private CA for internal mTLS
AttributeDetail
Encryption deployment levelStorage (all data stores) + Application (field-level for PII columns)
Key typeSymmetric (AES-256 for storage and field-level encryption)
Algorithm / cipher / key lengthAES-256-GCM (field-level), AES-256 (RDS, S3, ElastiCache)
Key generation methodAWS KMS (HSM-backed, FIPS 140-2 Level 3)
Key storageAWS KMS (customer-managed keys per data classification)
Key rotation scheduleAnnual automatic rotation (KMS-managed); field-level encryption keys rotated semi-annually with re-encryption job
AttributeDetail
Secret storeAWS Secrets Manager (database credentials, API keys); Kubernetes Secrets (encrypted with KMS via EKS envelope encryption) for pod configuration
Secret distributionRetrieved on-demand by services at runtime via Secrets Manager SDK; Kubernetes Secrets mounted as volumes (not environment variables)
Secret protection on hostMemory only — secrets are never written to disk; Kubernetes Secrets encrypted at rest in etcd via KMS
Secret rotationAutomatic — Secrets Manager Lambda rotation for RDS credentials (30-day cycle); partner API keys rotated annually via partner onboarding portal

3.5.5 Security Monitoring & Threat Detection

Section titled “3.5.5 Security Monitoring & Threat Detection”
CapabilityImplementation
Security event loggingAll API requests logged with partner ID, IP, timestamp, requested scopes, response status; authentication events (success/failure); authorisation decisions; admin actions. Logs forwarded to Splunk via Fluent Bit
SIEM integrationSplunk Enterprise (corporate instance) — all security events forwarded via HTTP Event Collector (HEC); custom Splunk correlation rules for anomaly detection
Infrastructure event detectionAWS GuardDuty (EKS runtime monitoring, S3 protection, malware scanning); AWS CloudTrail (all API calls); VPC Flow Logs (network anomaly detection); Falco (container runtime security)
Security alertingSplunk alerts for: failed authentication spikes (>10 in 5 min per partner), unusual data access patterns, privilege escalation attempts, WAF rule triggers. Alerts routed to PagerDuty (P1: immediate page; P2: 15-min response)

UC-01: Partner Retrieves Account Balance

AttributeDetail
Actor(s)Partner fintech application (authorised TPP)
TriggerPartner app sends GET /accounts/{accountId}/balance request
Pre-conditionsPartner has valid OAuth 2.0 access token with accounts:read scope; customer has granted consent to this TPP for this account
Main Flow1. Partner sends HTTPS request with Bearer token and mTLS client certificate to API Gateway. 2. API Gateway validates request structure and routes to Auth Service. 3. Auth Service validates OAuth token, verifies mTLS certificate binding, checks consent record in PostgreSQL. 4. Auth Service returns authorisation decision to API Gateway. 5. API Gateway routes to Account Service. 6. Account Service checks Redis cache for balance (60s TTL). 7. Cache hit: return cached balance. Cache miss: Account Service queries core banking read replica via JDBC, caches result, returns balance. 8. API Gateway returns JSON response to partner. 9. Audit event emitted to EventBridge.
Post-conditionsPartner receives account balance; audit log records the access; cache updated if miss occurred
Views InvolvedLogical (services), Integration & Data Flow (API flow), Physical (EKS, RDS, Redis, Direct Connect), Data (account data, cache), Security (OAuth, mTLS, consent, audit)

UC-02: Rate Limit Exceeded

AttributeDetail
Actor(s)Partner fintech application
TriggerPartner exceeds 100 req/s rate limit
Pre-conditionsPartner is authenticated and making valid requests
Main Flow1. Partner sends request to API Gateway. 2. API Gateway rate-limiting check identifies partner has exceeded 100 req/s quota. 3. API Gateway returns HTTP 429 Too Many Requests with Retry-After header. 4. Rate limit event logged and counted. 5. If sustained (>5 min), Splunk alert triggers notification to Partner Manager. 6. Notification Service sends email to partner’s registered technical contact.
Post-conditionsPartner receives 429 response; partner is notified; rate limit event logged for analysis
Views InvolvedLogical (API Gateway, Notification Service), Integration & Data Flow (rate limiting flow), Security (abuse detection), Operational Excellence (alerting)

UC-03: Fraud Alert Triggered During Transaction Retrieval

AttributeDetail
Actor(s)Partner fintech application, Featurespace ARIC
TriggerPartner requests transaction history for an account flagged for suspected fraud
Pre-conditionsPartner has valid token with transactions:read scope; account has active fraud flag in Featurespace
Main Flow1. Partner sends GET /accounts/{accountId}/transactions. 2. Request authenticated and authorised as per UC-01 flow. 3. Transaction Service queries Featurespace ARIC fraud scoring API for account risk score. 4. ARIC returns high-risk score (>0.85). 5. Transaction Service applies fraud response policy: returns limited transaction data (last 30 days only, no pending transactions), adds X-Fraud-Review: true header. 6. High-priority security event emitted to EventBridge. 7. Splunk alert fires immediately; PagerDuty pages on-call fraud analyst. 8. Notification Service sends webhook to MFS internal fraud team channel (Slack).
Post-conditionsPartner receives restricted data set; fraud team alerted; full audit trail recorded; account flagged for manual review
Views InvolvedLogical (Transaction Service, Notification Service), Integration & Data Flow (Featurespace integration), Security (fraud detection, data restriction), Operational Excellence (alerting, escalation)

3.6.2 Architecture Decision Records (ADRs)

Section titled “3.6.2 Architecture Decision Records (ADRs)”

ADR-001: EKS over ECS for Container Orchestration

FieldContent
StatusAccepted
Date2024-10-01
ContextThe platform requires a container orchestration solution to run microservices. Both Amazon EKS (managed Kubernetes) and Amazon ECS (AWS-native container service) were evaluated.
DecisionUse Amazon EKS (Kubernetes).
Alternatives ConsideredECS Fargate: Lower operational overhead, but limited pod-level networking control and no support for Envoy sidecar injection (Istio/Linkerd) needed for mTLS mesh. ECS on EC2: More control but still lacks Kubernetes ecosystem (Helm, Argo CD, Calico network policies). Self-managed Kubernetes on EC2: Maximum control but unacceptable operational burden for a 6-person platform team.
ConsequencesPositive: rich ecosystem (Helm, Argo CD, Calico, Prometheus), strong portability to other clouds, existing team Kubernetes skills. Negative: higher operational complexity than ECS Fargate, Kubernetes version upgrade overhead every 12-14 months.
Quality Attribute TradeoffsOperational Excellence: increased complexity (negative) offset by richer observability tooling (positive). Reliability: Kubernetes self-healing (positive). Cost: slightly higher than ECS Fargate due to node management (negative). Portability: significantly better (positive).

ADR-002: PostgreSQL over DynamoDB for Primary Data Store

FieldContent
StatusAccepted
Date2024-10-05
ContextThe platform needs a primary data store for account metadata, transaction data, and consent records. The data is relational (accounts have transactions, consent links customers to TPPs and accounts) and requires strong consistency for financial accuracy.
DecisionUse Amazon RDS PostgreSQL 16.
Alternatives ConsideredDynamoDB: Excellent scalability and operational simplicity, but poor fit for relational queries (joins across accounts/transactions/consent), no native support for field-level encryption patterns used for PII, and team has limited DynamoDB experience. Aurora PostgreSQL: Considered, but standard RDS PostgreSQL meets performance requirements at lower cost; Aurora’s distributed storage overhead is unnecessary at current data volumes.
ConsequencesPositive: strong relational model for financial data, excellent ecosystem (pg_cron, pgcrypto for field-level encryption), team expertise, straightforward backup/recovery. Negative: vertical scaling limits (mitigated by read replicas and Redis caching), operational overhead of PostgreSQL tuning.
Quality Attribute TradeoffsPerformance: adequate for 5,000 req/s with caching layer (neutral). Reliability: Multi-AZ provides HA (positive). Cost: lower than Aurora at current scale (positive). Portability: standard PostgreSQL, highly portable (positive).

ADR-003: Event-Driven Architecture for Notifications and Audit

FieldContent
StatusAccepted
Date2024-10-08
ContextThe platform must send notifications (partner webhooks, internal alerts, compliance emails) and write audit logs. These operations must not increase API response latency.
DecisionUse Amazon EventBridge with SQS for asynchronous notification and audit processing.
Alternatives ConsideredSynchronous processing: Simple but adds 50-100ms to every API response for audit writes and notification dispatch; unacceptable for P95 < 200ms target. Amazon SNS + SQS: Works but lacks EventBridge’s content-based filtering and schema registry. Apache Kafka (MSK): Powerful but over-engineered for current throughput (5,000 events/s); operational overhead of Kafka cluster management not justified.
ConsequencesPositive: API response latency unaffected by notification/audit processing, natural decoupling enables independent scaling of Notification Service, EventBridge schema registry aids contract evolution. Negative: eventual consistency for audit logs (acceptable: audit logs are written within seconds), added infrastructure complexity.
Quality Attribute TradeoffsPerformance: significant improvement in P95 latency (positive). Reliability: event replay capability aids recovery (positive). Cost: EventBridge pricing is consumption-based, cost-effective at current volumes (positive). Operational Excellence: additional component to monitor (negative, mitigated by managed service).

Log TypeEvents LoggedLocal StorageRetention PeriodRemote Services
Application logsAPI request/response metadata (no PII), service errors, business events, performance metricsstdout/stderr (container)Ephemeral (container lifecycle)Fluent Bit —> S3 (90 days) + Splunk (1 year)
Data store logsPostgreSQL slow queries (>100ms), connection events, error logsRDS log files7 days (RDS)CloudWatch Logs —> Splunk
Infrastructure logsEKS control plane logs, node-level system logs, VPC Flow LogsCloudWatch Logs90 days (CloudWatch)Splunk (security-relevant subset)
Security event logsAuthentication success/failure, authorisation decisions, admin actions, WAF blocks, GuardDuty findingsCloudWatch Logs + S37 years (S3) + 1 year (Splunk)Splunk (all security events), PagerDuty (critical alerts)

4.1.2 Observability — Monitoring & Alerting

Section titled “4.1.2 Observability — Monitoring & Alerting”
Alert CategoryTrigger ConditionNotification MethodRecipient
API error rate (5xx)> 1% of requests over 5 minutesPagerDuty (P1)SRE on-call
API latency (P95)> 500ms over 5 minutesPagerDuty (P2)SRE on-call
Authentication failure spike> 10 failures per partner in 5 minutesPagerDuty (P2) + SlackSRE on-call + Security team
Database connection pool exhaustion> 80% pool utilisationPagerDuty (P2)SRE on-call + DBA
EKS node not readyAny node NotReady for > 2 minutesPagerDuty (P2)SRE on-call
Certificate expiry approaching< 30 days to expirySlack + EmailPlatform team
Cost anomaly> 20% increase in daily spendEmailPlatform team + Finance
Partner rate limit sustained breachPartner exceeds limit for > 5 minutesSlack + EmailPartner Manager
Disk utilisation (RDS)> 80% storage usedPagerDuty (P3) + SlackDBA + SRE
Fraud alert (high-risk score)ARIC score > 0.85PagerDuty (P1) + SlackFraud team + SRE
CapabilityToolCoverage
Application Performance MonitoringGrafana (with Prometheus data source)All microservices (request rate, error rate, duration — RED metrics)
Infrastructure MonitoringAmazon CloudWatch + Prometheus (via Amazon Managed Prometheus)EKS cluster, RDS, ElastiCache, API Gateway, S3, VPC
Log AggregationSplunk Enterprise (corporate)All application, infrastructure, and security logs
Distributed TracingJaeger (deployed on EKS monitoring node group)All microservices — full request tracing from API Gateway to core banking
DashboardsGrafana (6 dashboards)API overview, per-partner metrics, infrastructure health, cost, SLA compliance, security events
Alerting & Incident ManagementPagerDutyAll P1-P3 alerts; integrated with Splunk and CloudWatch
QuestionResponse
What metrics are collected for capacity monitoring?CPU utilisation, memory utilisation, pod count, HPA scaling events, RDS connections, RDS storage, Redis memory, API Gateway request count, EKS node count
How are capacity trends analysed?Weekly automated report from Grafana (30-day trend); monthly capacity review meeting with SRE and Platform team; quarterly projection against growth model
Are capacity thresholds and alerts configured?Yes — alerts at 70% (warning) and 85% (critical) for CPU, memory, storage, and connection pools
Is there a capacity planning process?Yes — annual capacity plan updated quarterly; aligned with partner onboarding forecast from business development team
ProcedureDescriptionOwnerDocumentation
Incident responseP1: 15-min response, P2: 30-min response; follow ITIL incident management; post-incident review within 48 hoursSRE Lead (Tom Bloggs)Confluence: /ops/runbooks/incident-response
Change managementAll changes via GitHub PR; production deploys require 2 approvals; change window: Tuesday-Thursday; emergency change process for P1 fixesSRE LeadConfluence: /ops/runbooks/change-management
Escalation pathsL1: SRE on-call —> L2: SRE Lead —> L3: Solution Architect —> L4: CTO. Security incidents: CISO notified immediately for P1SRE LeadConfluence: /ops/runbooks/escalation
On-call rotation24x7, 1-week rotation across 6 SRE engineers; secondary on-call for DBA coverageSRE LeadPagerDuty schedule: cap-production
Partner communicationStatus page updates within 15 minutes of confirmed incident; post-incident report to affected partners within 5 business daysPartner Manager (Sally Doe)Confluence: /ops/runbooks/partner-comms
Database maintenanceWeekly vacuum/analyse (automated via pg_cron); monthly index review; quarterly RDS minor version assessmentDBA teamConfluence: /ops/runbooks/database-maintenance

4.2.1 Geographic Footprint & Disaster Recovery

Section titled “4.2.1 Geographic Footprint & Disaster Recovery”
QuestionResponse
Is the application deployed across multiple hosting venues for continuity?Yes — primary in eu-west-2 (London) with DR in eu-west-1 (Ireland) using active-passive (pilot light) configuration
What is the DR strategy?Active-Passive (pilot light): DR region has EKS cluster with minimum nodes (2), RDS read replica (promoted during failover), and pre-configured EventBridge rules. Scaled up during failover.
Are there data sovereignty requirements affecting geographic choices?Yes — PII must remain in UK (eu-west-2). DR region stores non-PII data only. Failover for PII-containing services requires manual approval from Compliance.
AttributeResponse
Scaling capabilityFull auto-scaling (Horizontal Pod Autoscaler on all services; Karpenter for EKS node auto-scaling)
Scaling detailsHPA scales pods based on CPU (target 60%) and custom metrics (request queue depth). Karpenter provisions new Graviton nodes within 90 seconds. API Gateway has no scaling limits. RDS: read replicas can be added; vertical scaling requires brief downtime (planned maintenance window). ElastiCache: cluster mode with automatic resharding.
AttributeResponse
Dependencies adequately sized?Yes (confirmed) — core banking read replicas tested at 3x current peak load; Featurespace ARIC SLA guarantees 10,000 req/s
Dependency detailsCore banking Oracle read replicas: 2 replicas in eu-west-2, confirmed to handle 15,000 queries/s. Direct Connect: 1 Gbps dedicated with VPN backup. Featurespace ARIC: SLA-backed at 10,000 req/s with <100ms P95.
  • Yes
    • Component failures: Each microservice runs 4+ replicas across 2 AZs; Kubernetes automatically reschedules failed pods. Pod disruption budgets ensure minimum 2 replicas during rolling updates.
    • Graceful degradation: If core banking is unavailable, Account Service returns cached data from Redis (with staleness indicator). If Featurespace ARIC is unavailable, Transaction Service returns full data without fraud scoring (with logged exception).
    • Circuit breaker patterns: Resilience4j circuit breakers on Core Banking Adapter (open after 5 consecutive failures, half-open after 30s) and Featurespace client (open after 3 failures, half-open after 15s).
    • Health checks: Kubernetes liveness probes (HTTP /health/live, 10s interval), readiness probes (HTTP /health/ready, 5s interval, checks DB connectivity). Failed readiness removes pod from service.
    • Testing practices: Monthly chaos testing with Gremlin (pod kill, AZ failure simulation, network latency injection). Quarterly DR failover drill. Annual game day exercise simulating multi-component failure.
Component / DependencyFailure ModeDetection MethodRecovery BehaviourUser Impact
Single EKS podPod crash or OOMKubernetes liveness probe failureAutomatic restart (restartPolicy: Always); traffic redirected to healthy podsTransparent (in-flight request may receive 503, retry expected)
Entire Availability ZoneAZ outageCloudWatch AZ health checks, EKS node statusKarpenter launches replacement nodes in healthy AZ within 90 seconds; pods rescheduled automaticallyBrief degraded performance (30-90 seconds) while pods reschedule
RDS primary instanceDatabase failureRDS Multi-AZ automatic health checkAutomatic failover to standby (60-120 seconds); application reconnects via DNS endpoint60-120 second interruption; connection pool recovers automatically
ElastiCache RedisCache node failureRedis cluster health checkAutomatic failover to replica; cluster mode redistributes slotsBrief cache miss spike; requests fall through to database (increased latency for 30-60 seconds)
Core Banking (Oracle DB)Read replica unavailableJDBC connection timeout (5s), circuit breakerCircuit breaker opens; Account Service returns cached data from Redis with X-Data-Freshness: stale headerDegraded: stale data returned (up to 60s old); partners notified via status page
Featurespace ARICAPI timeout or errorHTTP timeout (2s), circuit breakerCircuit breaker opens; Transaction Service returns unscored data with X-Fraud-Check: bypassed header; security alert raisedDegraded: full data returned without fraud filtering; manual fraud review triggered
Direct ConnectLink failureCloudWatch Direct Connect metrics, BGP session monitoringAutomatic failover to site-to-site VPN backup (pre-configured, 30s convergence)Increased latency to core banking (5-15ms additional); throughput reduced
API GatewayService disruptionRoute 53 health checksDNS failover to DR region (if activated); CloudFront serves cached error page during brief disruptionPotential 1-5 minute disruption during regional failover
AttributeDetail
Backup strategyRDS: automated snapshots + continuous WAL archiving (point-in-time recovery); S3: versioning enabled on all buckets; EKS: Velero backup of Kubernetes resources and persistent volumes
Backup product/serviceAWS RDS Automated Backups, AWS Backup (for cross-account/cross-region copies), Velero (EKS)
Backup typeFull (daily RDS snapshot) + Incremental (continuous WAL/transaction log)
Backup frequencyRDS: daily automated snapshot at 03:00 UTC + continuous WAL archiving; S3: real-time versioning; EKS (Velero): daily at 04:00 UTC
Backup retentionRDS snapshots: 35 days; WAL archive: 35 days; S3 versions: 90 days; Velero: 30 days; cross-region backup copies: 7 days
ControlDetail
ImmutabilityRDS snapshots: locked via AWS Backup Vault Lock (compliance mode, 35-day retention); S3: Object Lock (governance mode) on audit log bucket
EncryptionAll backups encrypted with AWS KMS CMK (same key as source data); cross-region copies re-encrypted with region-specific CMK
Access controlBackup operations restricted to DBA IAM role and AWS Backup service role; snapshot sharing disabled; cross-account backup vault in isolated security account
#ScenarioRecovery ApproachRTORPO
1Primary AZ failureAutomatic: Karpenter reschedules pods to surviving AZ; RDS Multi-AZ failover5 minutes0 (synchronous replication)
2Primary region failure (eu-west-2)Manual DR activation: promote RDS read replica in eu-west-1, scale up EKS cluster, update Route 53 DNS1 hour15 minutes (async replication lag)
3Critical software component failure (e.g., Account Service crash loop)Automatic: Kubernetes rolls back to last known good deployment (revision history); manual: Argo CD rollback via Git revert10 minutes (auto) / 30 minutes (manual)0
4Direct Connect failureAutomatic: BGP failover to site-to-site VPN (30s convergence)30 seconds0
5External connectivity failure (Internet)AWS Shield Advanced DDoS mitigation; CloudFront absorbs volumetric attacks; status page updated15 minutes (mitigation)0
6Ransomware / cyber-attackIsolate affected components (security group lockdown); restore from immutable backups (AWS Backup Vault Lock); forensic investigation using preserved snapshots4 hours15 minutes (point-in-time recovery)
7Accidental data corruption / deletionRDS point-in-time recovery to moment before corruption; S3 version restore for objects; Velero restore for Kubernetes resources1 hour1 minute (continuous WAL)

MetricTargetMeasurement Method
Response time (P50)< 80msJaeger trace duration, API Gateway CloudWatch metrics
Response time (P95)< 200msJaeger trace duration, Grafana dashboard
Response time (P99)< 500msJaeger trace duration, Grafana dashboard
Throughput5,000 req/s sustained, 8,000 req/s burstAPI Gateway request count metrics, load test validation
Error rate (5xx)< 0.01%API Gateway 5xx count / total count
Partner-specific rate limit100 req/s per partner (burst: 200 req/s)API Gateway usage plan metrics
Cache hit ratio (Redis)> 85%ElastiCache CloudWatch metrics
Core banking query latency< 50ms (P95)Jaeger span duration for JDBC calls
AttributeDetail
Performance testing approachLoad testing (sustained 5,000 req/s for 1 hour), stress testing (ramp to 15,000 req/s), soak testing (3,000 req/s for 24 hours), spike testing (0 to 8,000 req/s in 30 seconds)
Testing toolsk6 (Grafana Labs) for load generation; Grafana for real-time monitoring during tests
Testing environmentStaging environment (production-mirror sizing); quarterly test in production (read-only traffic, off-peak)
Testing frequencyEvery release in staging (automated in CI/CD via k6 Cloud); quarterly production validation; ad hoc before major partner onboarding
MetricCurrent1 Year3 Years5 Years
Partner applications (total)255080120
Peak requests per second2,5005,0008,00015,000
Data volume (PostgreSQL)560 GB740 GB1.2 TB2.0 TB
Transaction volume (per day)12M25M45M80M
Storage requirement (total incl. audit)800 GB1.2 TB2.5 TB5.0 TB
QuestionResponse
Will the current design scale to accommodate projected growth?Yes for 3-year horizon. At the 5-year mark, PostgreSQL vertical scaling may reach limits; migration to Aurora PostgreSQL or introduction of read replica sharding will be evaluated at the 3-year review.
Are there known seasonal or cyclical demand patterns?Yes — 30% traffic increase on salary payment dates (25th-28th of month), 50% increase in January (financial year activities), and 20% reduction during UK bank holidays. Auto-scaling handles these patterns.
StrategyImplementation
Right-sizingGraviton3 instances (m7g.xlarge) selected for best price-performance; pod resource requests set based on 6 months of production metrics; quarterly rightsizing review using AWS Compute Optimizer
CachingRedis cache-aside pattern for account balances (60s TTL); API Gateway response caching for partner metadata (5-min TTL); DNS caching for internal service discovery (30s TTL)
Connection poolingHikariCP connection pools per service: Account Service (max 20), Transaction Service (max 30); PgBouncer considered but not needed at current scale
Asynchronous processingAudit logging and notifications fully asynchronous via EventBridge + SQS; no synchronous writes in API response path except primary query
Content deliveryNot applicable (API-only, no static assets); API Gateway edge-optimised endpoint provides global edge routing
Database optimisationComposite indexes on frequently queried columns (account_id + date range); partitioned transaction table by month; EXPLAIN ANALYSE review for all new queries; pg_stat_statements monitoring for slow queries
AttributeDetail
Latency requirements< 5ms within VPC (pod-to-pod); < 10ms to core banking (Direct Connect); < 30ms to partner applications (Internet, UK-based)
Bandwidth requirements500 Mb/s peak egress; 200 Mb/s peak ingress; 800 Mb/s Direct Connect
QoS requirementsNo specific QoS marking; priority is low latency for API traffic
Content delivery strategyAPI Gateway edge-optimised endpoints; CloudFront distribution for developer portal static assets only
Network optimisationHTTP/2 enabled on API Gateway; gzip compression for responses > 1 KB; connection keep-alive (60s timeout); TCP Fast Open enabled on NLB

PostureSelectedDetail
Most cost-effective options intentionally not selected[x]Graviton instances are more cost-effective than x86 equivalents (20% saving); however, Multi-AZ RDS and Redis cluster mode were chosen for reliability over single-AZ (30% cost premium justified by Tier 1 criticality)
  • Yes — detailed cost modelling performed using AWS Pricing Calculator and validated against 6 months of production billing data. TCO comparison conducted against legacy PIL (on-premises Oracle SOA Suite) showing 45% reduction in total annual operating cost.
ComponentMonthly Cost (GBP)Notes
EKS cluster (control plane + nodes)8,2001 cluster, 8 m7g.xlarge nodes (average), Graviton pricing
RDS PostgreSQL (Multi-AZ)4,8002x db.r7g.xlarge, Multi-AZ, 1 TB gp3 storage, reserved instance (1-year)
ElastiCache Redis (cluster mode)1,6002x cache.r7g.large with replicas, reserved instance
API Gateway2,1005,000 req/s average, REST API pricing
S3 (audit logs + application logs)400Standard + lifecycle to Glacier; growing 15 GB/month
Direct Connect1,2001 Gbps dedicated connection + data transfer
CloudFront + WAF + Shield Advanced3,200Shield Advanced: GBP 2,400/month; WAF: GBP 300/month; CloudFront: GBP 500/month
EventBridge + SQS300Consumption-based pricing
Monitoring (Prometheus, Grafana)600Amazon Managed Prometheus + Grafana
Secrets Manager + KMS200Per-secret and per-API-call pricing
NAT Gateway + data transfer9002 NAT Gateways (Multi-AZ) + data processing
Other (Route 53, CloudWatch, etc.)500DNS, CloudWatch Logs, AWS Backup
Total monthly (production)24,000
Total annual (production)288,000
Non-production environments8,000/monthDev + Test + Staging (smaller sizing, no reserved instances)
Total annual (all environments)384,000
  • No — the design fully meets all requirements. The primary cost decision was reserving capacity (1-year reserved instances for RDS and ElastiCache) which reduced annual cost by GBP 38,000 compared to on-demand pricing.
PracticeImplementation
Cost monitoringCloudHealth (VMware Aria Cost) for daily cost tracking and anomaly detection; Grafana cost dashboard; weekly cost report to Platform team
Cost allocationAWS resource tagging strategy: Project (CAP), Environment (prod/staging/test/dev), Service (account-svc/txn-svc/auth-svc/notify-svc), CostCentre (CC-4720)
Reserved capacity1-year reserved instances for RDS (db.r7g.xlarge) and ElastiCache (cache.r7g.large); EKS nodes use Savings Plans (1-year, partial upfront)
Rightsizing reviewsMonthly review of AWS Compute Optimizer recommendations; quarterly review of pod resource requests vs actual utilisation
Waste eliminationAutomated shutdown of dev and test EKS clusters at 19:00 weekdays and all weekend (Lambda-based scheduler); Spot instances for non-production node groups
Budget governanceAWS Budget alerts at 80% and 100% of monthly forecast; approval required from Platform Lead for any change > GBP 500/month

QuestionResponse
Has the hosting location been chosen to reduce environmental impact?Partially — eu-west-2 (London) was chosen primarily for data sovereignty, but AWS London region operates at a lower carbon intensity than some other European regions. AWS is committed to 100% renewable energy by 2025 for all regions.
What is the expected workload demand pattern?Variable — significant peaks during UK business hours (08:00-18:00) and month-end; lower demand evenings and weekends
QuestionResponse
Must the application be available continuously?Yes — regulatory obligation for 24x7 availability (Open Banking). However, traffic drops significantly outside UK business hours.
Can the solution be shut down or scaled down during off-peak hours?Partially — auto-scaling reduces pod count during off-peak (minimum 2 replicas maintained for HA); EKS nodes scale down from 8 to 4 overnight
Are non-production environments configured to downscale or shut down when not in use?Yes — dev and test clusters shut down at 19:00 weekdays and fully off at weekends (saves approximately GBP 3,000/month); staging runs 24x7 only during release weeks
QuestionResponse
Are resources rightsized to avoid overprovisioning?Yes — pod resource requests based on P95 utilisation data; Karpenter consolidates pods onto fewer nodes during low-demand periods
Is vCPU utilisation monitored?Yes — target 40-60% average utilisation during business hours; alerts if sustained below 20% (rightsizing trigger) or above 80% (scaling trigger)
Are the highest performance-per-watt hardware options used?Yes — Graviton3 (ARM-based) instances provide up to 60% better energy efficiency than comparable x86 instances (AWS published benchmarks)
QuestionResponse
How do the language and framework choices contribute to efficiency?Java 21 with virtual threads (Project Loom) reduces memory overhead for concurrent request handling; GraalVM Native Image evaluated but deferred due to reflection-heavy Spring Boot framework
Has the code been optimised for the target platform and workload?Yes — connection pooling (HikariCP), efficient JSON serialisation (Jackson with afterburner module), lazy database fetching to avoid unnecessary data transfer
Are efficient algorithms and data structures used?Yes — database queries use indexed lookups; pagination enforced on all list endpoints to prevent unbounded result sets; Redis cache reduces redundant core banking queries by 85%
Is the number of vCPU hours per job/request minimised?Yes — average request processing time is 15ms CPU time; async offloading of audit/notification reduces per-request compute by approximately 40% compared to synchronous design
QuestionResponse
Is data held close to compute to reduce network transfer?Yes — Redis cache co-located in same VPC/AZ as application pods; PostgreSQL in same region; core banking read replicas in same AWS region connected via Direct Connect
Are data replicas minimised?Replicas are justified: RDS Multi-AZ (HA requirement), Redis replicas (HA), DR read replica (regulatory DR requirement). No unnecessary copies.
Is old or unused data removed to reduce storage?Yes — S3 lifecycle policies transition audit logs to Glacier (1 year) then Deep Archive (3 years); transaction data purged after 2 years; Redis TTL evicts stale cache entries
Are efficient data formats and compression used?Yes — gzip compression on API responses; PostgreSQL TOAST compression for large text fields; S3 objects compressed before archival
Are jobs prioritised and distributed to optimise resource usage?Yes — nightly batch jobs (data replication from core banking) scheduled during off-peak hours (02:00-05:00 UTC) to use capacity freed by auto-scaling
Are efficient networking patterns used?Yes — VPC endpoints for S3, SQS, EventBridge, and Secrets Manager to avoid NAT Gateway charges and Internet transit; Direct Connect for high-volume core banking traffic

  • Yes — all microservices are developed internally by the API Team.
AttributeDetail
Source control platformGitHub Enterprise (MFS organisation)
CI/CD platformGitHub Actions (corporate standard)
Build automationGitHub Actions workflows triggered on push and PR; Maven builds for Java services, npm for Node.js; Docker multi-stage builds for container images
Deployment automationArgo CD (GitOps) for Kubernetes deployments; Terraform for infrastructure changes; Helm charts for all services
Test automationUnit tests (JUnit 5, Jest), integration tests (Testcontainers), contract tests (Pact), security scanning, and container image scanning — all in CI pipeline
ControlImplementation
Security requirements identificationThreat model (SEC-TM-2024-019) reviewed at sprint planning; security stories in backlog; OWASP ASVS Level 2 as baseline
Static Application Security Testing (SAST)SonarQube (integrated in GitHub Actions; quality gate blocks merge on critical/high findings)
Dynamic Application Security Testing (DAST)Yes — OWASP ZAP (weekly automated scan against staging environment)
Software Composition Analysis (SCA)Snyk (integrated in GitHub Actions; blocks merge on high/critical CVEs; daily monitoring of deployed images)
Container image scanningSnyk Container + Amazon Inspector (continuous scanning of ECR images; alerts on new CVEs)
Secure coding practicesOWASP Secure Coding Guidelines; mandatory security training (annual); peer code review required for all PRs; security champion in API team
Patch managementCritical CVEs: 24-hour SLA for mitigation plan, 7-day SLA for patch deployment. High: 30-day SLA. Medium/Low: next scheduled release.
ClassificationSelected?Description
Replace[x]The legacy SOAP-based Partner Integration Layer (PIL) is being replaced entirely with the new cloud-native Customer API Platform
AttributeDetail
Deployment strategyStrangler Fig — partner traffic gradually migrated from legacy PIL to new CAP using API Gateway routing rules; both systems run in parallel during transition
Data migration modeContinuous Sync — core banking data replicated to CAP’s PostgreSQL via CDC; no bulk data migration required (CAP reads from core banking, not PIL)
Data migration methodCDC (Change Data Capture) from Oracle GoldenGate to PostgreSQL via Debezium + Kafka Connect
Data volume to migrate0 GB (no data migrated from PIL; CAP builds its own data store from core banking source)
End-user cutover approachPhased — partners migrated individually over 3-month window; each partner given 4-week notice and 2-week parallel-run period
External system cutoverPhased — partners cut over individually; legacy PIL endpoints deprecated with 6-month sunset notice
Maximum acceptable downtimeZero — parallel run ensures no downtime; partners switch DNS/config to new endpoints at their convenience during migration window
Rollback planAPI Gateway routing rules can redirect traffic back to legacy PIL within 5 minutes; partner-specific rollback possible without affecting other partners
Acceptance criteriaAll 8 legacy partners migrated and confirmed; PIL traffic at zero for 30 consecutive days; PIL decommission approval from all stakeholders
Transient infrastructure needed?Yes — Debezium + Kafka Connect cluster for initial CDC setup (decommissioned after steady-state CDC established via direct Oracle-to-PostgreSQL replication)
Test TypeScopeApproachEnvironmentAutomated?
Integration testingAll service-to-service interactions, database queries, external API callsTestcontainers (PostgreSQL, Redis, LocalStack) in CI; full integration suite in stagingCI + StagingYes
Contract testingAPI contracts between CAP and partner applications; internal service contractsPact (consumer-driven contract tests); OBIE conformance test suiteCI + StagingYes
Performance testingLoad, stress, soak, spike testing against production-equivalent environmentk6 load tests in CI/CD pipeline (smoke: every deploy; full: weekly)Staging (production-mirror)Yes
Security testingSAST, DAST, SCA, penetration testingSAST/SCA: every PR; DAST: weekly; annual penetration test by NCC GroupCI + Staging + ProductionPartially (pen test is manual)
DR testingFailover to eu-west-1, RDS promotion, DNS cutoverQuarterly automated failover drill; annual full DR exercise with SRE teamProduction + DRPartially (scripted but manually triggered)
AttributeDetail
Release frequencyWeekly (every Tuesday); hotfixes as needed (emergency change process)
Release processFeature branch —> PR (automated tests + 2 approvals) —> merge to main —> automated deploy to staging —> manual approval gate —> blue-green deploy to production via Argo CD
Release validationAutomated smoke tests post-deploy (5-minute suite); canary analysis (10% traffic for 15 minutes); automated rollback if error rate > 0.1%
Feature flags / togglesLaunchDarkly for feature flags; used for partner-specific feature rollouts and kill switches for new functionality
AttributeDetail
Support modelL1: MFS Service Desk (basic triage); L2: SRE team (6 engineers, dedicated to CAP and 2 other platform services); L3: API development team; L4: Solution Architect / CTO
Support hours24x7 (SRE on-call rotation); development team: UK business hours (09:00-17:30) with on-call for P1 escalations
SLAsExternal (partner-facing): 99.95% monthly availability, P95 response time < 200ms, incident notification within 15 minutes. Internal: P1 response < 15 min, P2 < 30 min, P3 < 4 hours
Escalation pathsL1 —> L2 (15 min for P1, 1 hour for P2) —> L3 (30 min for P1, 4 hours for P2) —> L4 (1 hour for P1). Security incidents: immediate CISO notification.
QuestionResponse
Non-prod auto-shutdown schedule and enforcementKarpenter scale-to-zero on dev/stage EKS clusters 19:00-07:00 weekdays + all weekend; non-prod RDS paused via Lambda cron; enforced by AWS Config rule (alerts FinOps if a non-prod resource runs continuously > 24h without a documented exception)
Periodic right-sizing review cadenceQuarterly via AWS Compute Optimizer + Datadog. Last review (Q1 2026) downgraded 18 over-provisioned pods, recovering ~£2,400/month
Unused / orphaned resource reclamationWeekly Lambda job tags resources idle > 14 days; FinOps reviews and confirms before deletion. Scope: snapshots, EBS volumes, ELB targets, unused security groups
Carbon footprint reported alongside costYes — monthly FinOps review includes AWS Customer Carbon Footprint Tool output; reported to ARB and Sustainability committee quarterly
Environment retirement actually deletes (vs stops)Yes — decommissioning runbook requires Terraform destroy + S3 bucket emptying + KMS key scheduled-deletion; CMDB entry marked Retired only after AWS Cost Explorer confirms zero spend for 30 days
Skill AreaCurrent LevelAction Required
AWS (EKS, RDS, networking)HighOngoing: 2 engineers pursuing AWS Solutions Architect Professional certification
Infrastructure as Code (Terraform)HighNone — team fully proficient
CI/CD (GitHub Actions, Argo CD)HighNone — team developed the pipeline
Java / Spring BootHighOngoing: Java 21 virtual threads training completed Q1 2025
Kubernetes operationsHighOngoing: CKA certification for 2 junior engineers
PostgreSQL DBAMediumAction: DBA team member allocated 50% to CAP; advanced PostgreSQL training planned for Q1 2026
Security & complianceMediumAction: Security champion training completed; annual OWASP training for all developers
QuestionResponse
Can the team fully operate and support this solution in production?A: Fully capable
If B, C, or D: what additional resources are required?N/A
Is a managed service being considered for ongoing operations?No — SRE team operates the platform; AWS managed services (RDS, EKS, ElastiCache) reduce operational burden

Application start-up sequence:

  1. EKS cluster and node groups are always running (managed by Karpenter auto-scaling).
  2. RDS PostgreSQL instances are always running (Multi-AZ).
  3. ElastiCache Redis cluster is always running (cluster mode).
  4. Kubernetes deployments are managed by Argo CD; pods start in order: Auth Service first (dependency for other services), then Account Service and Transaction Service (parallel), then Notification Service.
  5. Kubernetes readiness probes ensure services are only added to the load balancer after successful health checks (database connectivity, Redis connectivity, configuration loaded).
  6. API Gateway is always available (managed service); no start-up required.
  7. Full start-up from cold (e.g., after a DR failover scale-up) takes approximately 8 minutes.
ConcernApproach
Keeping software versions current and supportedEKS: upgraded within 60 days of new minor release; RDS PostgreSQL: minor versions applied in monthly maintenance window; Java/Node.js: upgraded within 90 days of LTS release; all dependencies tracked by Snyk
Hardware lifecycle managementN/A — all cloud-managed; Graviton instance generations reviewed annually for cost/performance improvements
Certificate managementPartner mTLS certificates: 1-year validity, automated renewal reminders at 60/30/7 days; internal TLS: AWS Certificate Manager (auto-renewal); KMS keys: annual automatic rotation
Dependency managementSnyk monitors all dependencies continuously; Dependabot PRs for automated updates; quarterly dependency review meeting
AttributeDetail
Intended lifespan7-10 years; major architecture review planned at 5 years (2030)
End-of-life triggersReplacement by next-generation API platform; regulatory change removing Open Banking obligation (unlikely); AWS service deprecation
Decommissioning blockers25+ partner integrations dependent on the platform; 7-year audit log retention obligation
Data disposalCustomer data: secure deletion from RDS (NIST 800-88 compliant); audit logs: retained in S3 Glacier until 7-year obligation met, then lifecycle-expired; encryption keys: scheduled for deletion after data disposal
Infrastructure disposalTerraform destroy for all AWS resources; DNS records removed; IAM roles deleted; GitHub repositories archived (not deleted, for audit trail)
AttributeDetail
Exit strategyAll microservices are containerised with standard Kubernetes manifests (Helm charts); PostgreSQL is standard (no AWS-specific extensions); data exportable via pg_dump; audit logs in S3 exportable via standard S3 API
Data portabilityPostgreSQL: pg_dump/pg_restore to any PostgreSQL host; S3 audit logs: standard object download; Redis: cache can be rebuilt from source data (no persistent data); EventBridge schemas documented in JSON Schema
Vendor lock-in assessmentOverall: Low-Moderate. Primary lock-in is AWS IAM/KMS (High) and EventBridge (Moderate). All other components use standard, portable technologies. Estimated exit effort: 3-4 months for a 6-person team.
Exit timeline estimate6 months (including 3 months infrastructure migration + 3 months partner migration and parallel run)

IDConstraintCategoryImpact on DesignLast Assessed
C-001Must comply with PCI-DSS v4.0 for transaction data handlingRegulatoryNetwork segmentation, encryption at rest and in transit, access controls, vulnerability management, audit logging — all mandated by PCI-DSS2025-11-01
C-002All customer PII must reside within the UK (data sovereignty)RegulatoryPrimary region must be eu-west-2 (London); DR region (eu-west-1) restricted to non-PII data only; cross-region replication must filter PII2025-11-01
C-003Must integrate with existing core banking Oracle database via read replicasTechnicalCannot replace core banking data source; must maintain JDBC connectivity via Direct Connect; data model constrained by Oracle schema2025-06-15
C-00499.95% monthly availability SLA committed to partnersCommercialMulti-AZ deployment mandatory; active-passive DR required; auto-scaling and fault tolerance must support SLA; monthly SLA reporting to partners2025-11-01
IDAssumptionImpact if FalseCertaintyStatusOwnerEvidence
A-001Core banking Oracle read replicas will support 15,000 queries/s at peakPlatform cannot meet performance targets; would require caching redesign or additional read replicasHighClosedJane DoeLoad test results (TEST-2025-031) confirmed 18,000 queries/s sustained
A-002Featurespace ARIC API will maintain <100ms P95 latency under our projected loadFraud checking would increase API response time beyond P95 target; circuit breaker would bypass fraud checks more frequentlyMediumOpenFred BloggsFeaturespace SLA contractually commits to 100ms P95 at 10,000 req/s; no independent verification at our projected 3-year volume
A-003Partner adoption will grow linearly to 80 partners over 3 yearsNon-linear growth could exceed capacity plans; under-adoption would mean over-provisioned infrastructure (cost waste)MediumOpenSally DoeBusiness development pipeline shows 50 partners in negotiation; growth rate tracking to plan
IDRisk EventCategorySeverityLikelihoodOwner
R-001Core banking Oracle DB upgrade causes schema changes that break data replicationTechnicalHighMediumJane Doe
R-002Partner onboarding volume exceeds forecast, overwhelming support capacityOperationalMediumMediumSally Doe
R-003Critical vulnerability discovered in base container image requiring emergency patching across all servicesSecurityHighHighJoe Bloggs
R-004AWS eu-west-2 region experiences prolonged outage exceeding DR activation thresholdOperationalCriticalLowTom Bloggs
IDMitigation StrategyMitigation PlanResidual RiskLast Assessed
R-001MitigateContract testing against core banking schema (Pact); advance notification agreement with DBA team (60-day notice for schema changes); schema compatibility layer in Core Banking AdapterMedium2025-11-01
R-002MitigateSelf-service partner onboarding portal (Phase 2, delivered); automated API key provisioning; partner onboarding runbook; escalation to additional support resource if queue > 5 partnersLow2025-11-01
R-003MitigateSnyk continuous monitoring with P1 alert on critical CVEs; pre-built patched base images maintained in ECR; emergency deployment pipeline (bypasses staging for security patches); rollback capabilityMedium2025-11-01
R-004Accept (with mitigation)Active-passive DR in eu-west-1; quarterly DR drills; RTO 1 hour validated through testing; accept 15-minute RPO for async replication lagLow2025-11-01
IDDependencyDirectionStatusOwnerEvidenceLast Assessed
D-001Core banking Oracle DB read replicas provisioned in eu-west-2 via Direct ConnectInboundResolvedDBA teamDirect Connect live; read replicas operational since 2025-01-152025-11-01
D-002Featurespace ARIC API available and contracted for CAP usageInboundCommittedProcurementContract MFS-VENDOR-2024-089 signed; API access provisioned2025-09-01
D-003Partner Onboarding Portal (APP-0456) consuming Auth Service APIs for partner registrationOutboundResolvedPartner Portal teamIntegration live since 2025-06-012025-11-01
IDIssueCategoryImpactOwnerResolution PlanStatusLast Assessed
I-001Redis cluster failover caused 45-second cache miss spike during October maintenance windowOperationalLowTom BloggsUpdated maintenance procedure to pre-warm cache before failover; implemented dual-write to new primary during planned failoverResolved2025-11-01
I-002Three partners have not completed mTLS certificate renewal (certificates expiring in 60 days)OperationalMediumSally DoeAutomated renewal reminders sent at 90/60/30/7 days; partner manager directly contacting non-responsive partners; contingency: temporary API key fallback (with CISO approval)In Progress2025-11-15
QuestionResponse
Does this design create any exception to current policies and standards?No
If yes, have exceptions been logged and accepted through the exceptions process?N/A
QuestionResponse
Does this design create an issue against the process library?No
If yes, has this been acknowledged by the process owner?N/A
QuestionResponse
Does the design materially change the organisation’s technology risk profile?No — the design reduces risk by replacing unsupported legacy middleware with a modern, actively maintained platform. The introduction of cloud-hosted customer data is covered by the existing AWS risk assessment (MFS-TRA-2023-012).
If yes, has this been evaluated with Risk and Controls teams?N/A
ADR #TitleStatusDateImpact
ADR-001EKS over ECS for container orchestrationAccepted2024-10-01Determines container platform and operational model for all microservices
ADR-002PostgreSQL over DynamoDB for primary data storeAccepted2024-10-05Determines database technology, data model, and backup/recovery approach
ADR-003Event-driven architecture for notifications and auditAccepted2024-10-08Determines async processing pattern and notification architecture
Standard / PrincipleRequirementHow the Design Satisfies ItEvidence Section
PCI-DSS v4.0 Req 1Install and maintain network security controlsVPC segmentation, security groups, NACLs, WAF, Shield3.3 Physical View, 3.5 Security View
PCI-DSS v4.0 Req 3Protect stored account dataAES-256 encryption at rest, field-level encryption for PII, KMS key management3.4 Data View, 3.5 Security View
PCI-DSS v4.0 Req 4Protect cardholder data with strong cryptography during transmissionTLS 1.3 enforced for all external connections; TLS 1.2 minimum for all internal3.2 Integration & Data Flow, 3.5 Security View
PCI-DSS v4.0 Req 7Restrict access to system components and cardholder data by business need to knowRBAC + ABAC via OAuth scopes, Kubernetes RBAC, IAM least privilege3.5 Security View
PCI-DSS v4.0 Req 10Log and monitor all access to system components and cardholder dataComprehensive audit logging, Splunk SIEM integration, 7-year retention4.1 Operational Excellence, 3.5 Security View
OBIE Standard 3.1.11API conformance for Account Information ServicesREST APIs conform to OBIE specification; contract tests validate compliance3.2 Integration & Data Flow, 3.6 Scenarios
UK GDPR Art 5(1)(f)Integrity and confidentiality of personal dataField-level encryption, mTLS, access controls, audit trail, DPIA completed3.4 Data View, 3.5 Security View
UK GDPR Art 17Right to erasureConsent revocation endpoint; data deletion job for expired consents; audit trail of deletions3.4 Data View, 3.6 Scenarios
FCA SYSC 13Operational resilience for important business servicesMulti-AZ, DR strategy, impact tolerance testing, chaos testing, quarterly DR drills4.2 Reliability
MFS Cloud Security Standard 1.3Encryption, access management, monitoring for cloud workloadsKMS encryption, IAM least privilege, GuardDuty, CloudTrail, Splunk integration3.3 Physical View, 3.5 Security View

TermDefinition
ARICAdaptive, Real-time, Individual, Contextual — Featurespace’s fraud detection platform
CAPCustomer API Platform — the solution described in this SAD
CDCChange Data Capture — a pattern for capturing and replicating data changes
CMACompetition and Markets Authority — UK regulator that mandated Open Banking
EKSElastic Kubernetes Service — AWS managed Kubernetes
FAPIFinancial-grade API — an OAuth 2.0 security profile for financial services
HPAHorizontal Pod Autoscaler — Kubernetes auto-scaling mechanism
IRSAIAM Roles for Service Accounts — EKS feature for pod-level IAM
MFSMeridian Financial Services — the fictional organisation in this example
mTLSMutual TLS — two-way TLS authentication where both client and server present certificates
OBIEOpen Banking Implementation Entity — the UK body governing Open Banking standards
PILPartner Integration Layer — the legacy SOAP-based system being replaced
PSD2Payment Services Directive 2 — EU directive mandating open banking
SCAStrong Customer Authentication — PSD2 requirement for multi-factor authentication
TPPThird-Party Provider — an authorised fintech that accesses bank APIs under Open Banking
DocumentVersionDescriptionLocation
OBIE Account and Transaction API Specification3.1.11Open Banking UK API specification for AIShttps://openbankinguk.github.io/read-write-api-site3/
PCI-DSS4.0Payment Card Industry Data Security Standardhttps://www.pcisecuritystandards.org/
MFS Information Security Policy4.2Corporate information security policyConfluence: /security/policies/isp-v4.2
MFS Cloud Security Standard1.3Security controls for AWS workloadsConfluence: /security/standards/cloud-sec-v1.3
MFS Data Classification Standard2.0Data classification scheme and handling requirementsConfluence: /data/standards/classification-v2.0
AWS Well-Architected Framework2024AWS architecture best practiceshttps://aws.amazon.com/architecture/well-architected/
NIST Cybersecurity Framework2.0Cybersecurity risk management frameworkhttps://www.nist.gov/cyberframework
CAP Threat ModelSEC-TM-2024-019STRIDE-based threat model for the Customer API PlatformConfluence: /security/threat-models/cap-2024
DPIA - Customer API PlatformDPIA-2024-047Data Protection Impact AssessmentConfluence: /compliance/dpia-047
Standard / Pattern IDNameVersionApplicability
OBIE-AIS-3.1.11Open Banking Account Information API3.1.113.2 Integration & Data Flow
PCI-DSS-4.0Payment Card Industry Data Security Standard4.03.5 Security View, 6.8 Compliance Traceability
OWASP-ASVS-4.0Application Security Verification Standard4.05.1 Application Security in Development
NIST-800-88Guidelines for Media SanitizationRev 15.9 End-of-Life
C4-ModelC4 Model for Software ArchitectureN/A3.1 Logical View (diagramming approach)
12-FactorThe Twelve-Factor AppN/A3.1 Logical View (microservice design principles)
RoleNameDateSignature / Approval Reference
Lead Solution ArchitectFred Bloggs2025-11-20JIRA: CAP-ARB-2025-003 (approved)
Principal Security ArchitectJoe Bloggs2025-11-18JIRA: CAP-SEC-2025-012 (approved)
Data ArchitectJane Doe2025-11-15JIRA: CAP-DATA-2025-007 (approved)
Head of ComplianceAlice Doe2025-11-19JIRA: CAP-COMP-2025-004 (approved)
CISOMarcus Doe2025-11-19JIRA: CAP-SEC-2025-013 (approved)
CTODr. Helen Zhao2025-11-20JIRA: CAP-ARB-2025-003 (approved)
ARB ChairDave Bloggs2025-11-20JIRA: CAP-ARB-2025-003 (approved)

Assessment Summary

This SAD was assessed at Comprehensive depth. The scores below reflect a mature, well-documented architecture for a Tier 1 Critical, regulated financial services platform.

SectionScoreJustification
0. Document Control5Full version history, multiple contributors and approvers, clear scope, related documents referenced
1. Executive Summary5Clear business drivers with priority, strategic alignment with reuse assessment, current-state architecture documented, business criticality justified with revenue impact
2. Stakeholders & Concerns5Comprehensive stakeholder register including external parties, concerns matrix fully mapped to sections, regulatory context with five applicable regulations
3.1 Logical View5Full component decomposition with technology choices, design patterns documented with rationale, vendor lock-in assessed for all components, service-to-capability mapping complete
3.2 Integration & Data Flow5All internal and external integrations documented with protocols and authentication, API contracts versioned, end user access patterns documented, SLAs defined per interface
3.3 Physical View5Deployment diagram described, compute fully specified (Graviton instances, pod sizing), full networking documented including Direct Connect, environments listed with sizing, security agents deployed
3.4 Data View5All data stores classified with retention and encryption, field-level encryption for PII, data sovereignty addressed with cross-region filtering, DPIA completed, data integrity controls evidenced
3.5 Security View5STRIDE threat model with 7 threats and mitigations, comprehensive IAM (internal + external + privileged), mTLS and OAuth 2.0 FAPI, HSM-backed encryption, SIEM integration with correlation rules
3.6 Scenarios5Three architecturally significant use cases crossing all views, three ADRs with alternatives and quality attribute tradeoffs
4.1 Operational Excellence5Centralised logging with Splunk, Grafana dashboards, PagerDuty alerting with escalation, Jaeger distributed tracing, comprehensive runbooks, capacity planning process
4.2 Reliability5Multi-AZ with active-passive DR, RTO 1hr / RPO 15min validated through quarterly testing, chaos testing with Gremlin, fault tolerance with circuit breakers, immutable backups
4.3 Performance5P50/P95/P99 targets defined, 5,000 req/s throughput target, automated performance testing with k6, caching strategy documented, 3-year growth projections
4.4 Cost5Detailed monthly cost breakdown by component, reserved instance analysis, CloudHealth monitoring, FinOps practices documented, tagging strategy, rightsizing reviews
4.5 Sustainability4Graviton instances for energy efficiency, non-prod auto-shutdown, auto-scaling for demand matching. Score reduced from 5: no carbon metrics baselined, no formal sustainability KPIs.
5. Lifecycle5Full CI/CD with security scanning, Strangler Fig migration plan, test strategy covering all types, weekly releases with blue-green and canary, team skills assessed, exit plan documented
6. Governance54 constraints, 3 assumptions (with evidence), 4 risks with mitigation plans, 3 dependencies tracked to resolution, 2 issues tracked, compliance traceability table mapping 10 requirements
7. Appendices5Domain-specific glossary, 9 reference documents, 6 standards/patterns referenced, full approval sign-off with JIRA references
Overall4.9Comprehensive depth achieved across all sections. Exemplary documentation for a Tier 1 Critical regulated platform.