AWS Solutions Architect Exam  >  AWS Solutions Architect Notes  >  : Professional Level  >  Case Studies: IaC & Automation

Case Studies: IaC & Automation

Case Studies: IaC & Automation - Domain 3: Continuous Improvement

Case Study 1

A global financial services company manages over 400 AWS accounts using AWS Organizations, with workloads deployed across multiple regions for regulatory compliance. The infrastructure team currently uses AWS CloudFormation StackSets to deploy networking resources, security baselines, and compliance controls. However, deployment failures in individual accounts often go undetected for days, causing compliance audit failures. The company requires real-time detection of failed stack instances, automated remediation where possible, and detailed audit trails for all infrastructure changes. The solution must integrate with their existing ServiceNow ITSM platform for ticket creation and must not require custom Lambda functions for core monitoring functionality. The team has a limited operations staff and prefers managed services over custom-built solutions.

Which solution provides the MOST operationally efficient approach to monitoring and remediating CloudFormation StackSet deployment failures across all accounts?

  1. Enable AWS CloudTrail in the management account and configure an EventBridge rule to detect StackSet operation failures, sending notifications to an SNS topic that triggers ServiceNow API integration through an EventBridge API destination, while implementing AWS Config rules with automatic remediation actions to redeploy failed stacks
  2. Configure CloudFormation StackSets with automatic deployment failure detection enabled, use AWS Systems Manager OpsCenter to aggregate stack instance failures across all accounts, create OpsItems automatically with ServiceNow integration through AWS Service Management Connector, and configure EventBridge rules in each member account to trigger Systems Manager Automation documents for remediation
  3. Deploy AWS CloudWatch synthetic monitors in each account to poll CloudFormation stack status every 5 minutes, aggregate logs to a central CloudWatch Logs account using subscription filters, create CloudWatch alarms for failure patterns, and use CloudWatch Events to trigger ServiceNow webhooks and Systems Manager remediation workflows
  4. Implement AWS Control Tower with customizations for organizations (CfCT) to replace existing StackSets, use Service Catalog for infrastructure provisioning with built-in approval workflows, enable drift detection through AWS Config, and configure SNS notifications for all provisioning failures that integrate with ServiceNow

Answer & Explanation

Correct Answer: 2 - Configure CloudFormation StackSets with automatic deployment failure detection enabled, use AWS Systems Manager OpsCenter to aggregate stack instance failures across all accounts, create OpsItems automatically with ServiceNow integration through AWS Service Management Connector, and configure EventBridge rules in each member account to trigger Systems Manager Automation documents for remediation

Why this is correct: This solution leverages native CloudFormation StackSets failure detection capabilities combined with Systems Manager OpsCenter, which is specifically designed to aggregate operational issues across multiple accounts in an AWS Organization. The AWS Service Management Connector for ServiceNow provides pre-built, managed integration without custom Lambda functions. EventBridge rules in member accounts can natively trigger Systems Manager Automation documents for remediation. This approach uses entirely managed services, satisfies the "no custom Lambda" constraint, provides real-time detection through native CloudFormation events, and offers detailed audit trails through OpsCenter's built-in tracking.

Why the other options are wrong:

  • Option 1: While EventBridge API destinations can integrate with ServiceNow, AWS Config rules with automatic remediation are not designed to redeploy failed CloudFormation StackSet instances-Config remediates resource-level drift, not stack deployment failures. This creates a gap in the remediation strategy and would require custom Lambda functions to properly handle StackSet redeployment, violating the constraint.
  • Option 3: Using CloudWatch synthetic monitors to poll CloudFormation status every 5 minutes is an inefficient, custom-built approach that introduces detection latency and does not meet the "real-time detection" requirement. This violates the preference for managed services over custom solutions and creates unnecessary operational overhead with multiple polling mechanisms across 400 accounts.
  • Option 4: Implementing AWS Control Tower CfCT would require replacing the existing CloudFormation StackSets infrastructure-a massive architectural change that introduces significant disruption and migration effort. The question asks for a monitoring and remediation solution, not a complete infrastructure provisioning re-architecture. This option also introduces Service Catalog, which is not necessary for the stated requirements and adds complexity.

Key Insight: The exam tests whether candidates understand that Systems Manager OpsCenter is the AWS-native aggregation point for operational issues across Organizations, and that the AWS Service Management Connector provides managed ITSM integration-many candidates default to building custom EventBridge/Lambda/SNS integration patterns when managed connectors already exist.

Case Study 2

A healthcare technology company operates a HIPAA-compliant platform on AWS with strict change control requirements. All infrastructure modifications must be peer-reviewed, approved by security teams, and automatically tested before deployment to production. The company uses Terraform for infrastructure as code with state stored in S3 buckets with versioning and encryption enabled. Development teams frequently bypass the review process by running "terraform apply" directly from their local machines, creating compliance violations. The security team needs to enforce that all Terraform changes go through a centralized CI/CD pipeline with mandatory approval gates, prevent direct CLI access to production state files, and maintain complete audit logs of who approved what changes. The solution must work with their existing GitLab Enterprise installation and must not require teams to change their local Terraform workflow for development environments.

Which combination of controls will enforce the required governance while maintaining developer productivity? (Select TWO)

  1. Implement S3 bucket policies on production Terraform state files that deny all access except from the GitLab CI/CD runner IAM role, remove production AWS credentials from developer workstations, and configure GitLab pipelines with manual approval stages that log approver identity to CloudTrail through assumed roles
  2. Migrate from S3 backend to Terraform Cloud with remote execution enabled, configure workspace-level permissions to restrict applies to CI/CD service accounts only, implement sentinel policies for compliance checks, and use Terraform Cloud's native approval workflow with audit logging
  3. Deploy HashiCorp Vault to manage AWS credentials dynamically, configure short-lived tokens for developers that only permit read access to state files, implement Vault policies that only grant write access to production accounts through the CI/CD pipeline identity, and enable Vault audit logging
  4. Configure AWS Systems Manager Session Manager with IAM policies that prevent developers from assuming roles with Terraform state write permissions, create a custom approval workflow in AWS Service Catalog that executes Terraform commands through Systems Manager Automation documents, and use CloudTrail to log all approval decisions
  5. Implement GitLab merge request approval rules with designated security team approvers, configure GitLab CI/CD pipelines to execute Terraform plan on merge requests and require approval before running apply, use GitLab environment-specific deployment permissions to restrict production applies to pipeline service accounts only, and enable GitLab audit events

Answer & Explanation

Correct Answer: 1 and 5

Why these are correct: Option 1 addresses the AWS-side controls by using S3 bucket policies to technically prevent direct state file access from developer credentials while allowing the CI/CD pipeline role to function. This creates an enforcement mechanism at the infrastructure level that cannot be bypassed. Option 5 implements the required peer-review and approval workflow within GitLab (their existing tooling), provides audit trails through GitLab's native audit events, and uses GitLab's environment-specific deployment permissions to restrict who can deploy to production-this satisfies the "centralized CI/CD pipeline with mandatory approval gates" requirement. Together, these create defense-in-depth: GitLab enforces the process workflow and approvals, while AWS IAM/S3 policies enforce technical controls preventing circumvention.

Why the other options are wrong:

  • Option 2: Migrating to Terraform Cloud would require significant changes to existing workflows, state file migration, and introduces a third-party SaaS dependency for a HIPAA-compliant workload. While technically capable, this represents a major architectural change rather than implementing controls on existing infrastructure. The question asks to work with existing GitLab installation, not replace the orchestration mechanism.
  • Option 3: While HashiCorp Vault provides excellent credential management, implementing it solely for this use case introduces a new complex system to operate and maintain. The read-only token approach doesn't prevent a determined developer from using their own AWS credentials to access state files unless combined with S3 bucket policy controls (already covered in Option 1). This adds operational overhead without addressing the core approval workflow requirement-Vault manages credentials but doesn't enforce approval gates.
  • Option 4: This option fundamentally misunderstands the use case by trying to run Terraform through Systems Manager Automation documents and Service Catalog-these services are not designed for general-purpose IaC pipeline orchestration. This would require extensive custom automation, breaks standard Terraform workflows, and creates significant operational complexity. Service Catalog is for pre-approved product provisioning, not for managing git-based IaC approval workflows.

Key Insight: Many candidates over-engineer solutions by introducing new tools (Terraform Cloud, Vault) when the requirement is to enforce controls on existing systems. The correct answer combines technical enforcement (S3 bucket policies preventing access) with process enforcement (GitLab approval workflows)-neither alone is sufficient, but together they prevent both accidental and intentional bypass of the approval process.

Case Study 3

An e-commerce company uses AWS CDK written in TypeScript to define their infrastructure across 50 microservices. Each development team maintains their own CDK application in separate repositories. The platform team has created reusable CDK constructs for standard patterns like VPC configuration, ALB setup, and ECS task definitions that all teams should use to maintain consistency. However, teams frequently copy-paste construct code instead of importing shared libraries, leading to configuration drift and security inconsistencies. Some teams are running outdated construct versions with known security vulnerabilities. The platform team needs a solution to distribute, version, and enforce usage of approved CDK constructs across all teams. Updates to shared constructs must be tested before teams adopt them, and teams must receive notifications when using deprecated construct versions.

What is the MOST effective approach to managing and distributing reusable CDK constructs across development teams?

  1. Publish shared CDK constructs as private NPM packages to AWS CodeArtifact with semantic versioning, implement npm package deprecation warnings for outdated versions, configure CodeArtifact event notifications to alert teams when new versions are available, and use CDK Aspects to scan applications for deprecated construct usage during synthesis
  2. Store shared constructs in a centralized GitHub repository with a monorepo structure, use git submodules for teams to reference shared code, implement GitHub Actions workflows to run automated testing on construct changes, create GitHub issues automatically when teams reference deprecated construct versions through scheduled scans
  3. Create AWS Service Catalog products for each shared construct pattern, distribute them across accounts using CloudFormation StackSets, implement AWS Config rules to detect when teams deploy resources that don't match approved patterns, and use EventBridge to notify teams of compliance violations
  4. Package shared constructs as AWS Lambda layers containing the CDK code, distribute layers across accounts using AWS Resource Access Manager, configure Lambda to execute construct synthesis on behalf of development teams, and use CloudWatch alarms to detect when teams deploy resources using outdated layer versions

Answer & Explanation

Correct Answer: 1 - Publish shared CDK constructs as private NPM packages to AWS CodeArtifact with semantic versioning, implement npm package deprecation warnings for outdated versions, configure CodeArtifact event notifications to alert teams when new versions are available, and use CDK Aspects to scan applications for deprecated construct usage during synthesis

Why this is correct: AWS CodeArtifact is specifically designed for managing software packages and artifacts, including NPM packages. Publishing CDK constructs as NPM packages aligns with standard TypeScript/Node.js development practices-teams can import them using normal package management (npm install), semantic versioning provides clear upgrade paths, and package deprecation is a standard NPM feature that CodeArtifact supports. CDK Aspects provide a native mechanism to scan CDK applications during synthesis and enforce policies or detect deprecated patterns. CodeArtifact integrates with EventBridge for notifications when packages are published. This solution follows established software distribution patterns and leverages CDK's built-in extensibility mechanisms.

Why the other options are wrong:

  • Option 2: Git submodules are notoriously difficult to manage and often lead to versioning confusion-teams must manually update submodule references and can easily fall behind. This approach doesn't provide the structured versioning and dependency management that NPM provides. Creating GitHub issues through scheduled scans is a reactive, batch-oriented approach rather than providing immediate feedback during development. Git submodules don't support the concept of deprecation warnings that appear during normal development workflow.
  • Option 3: Service Catalog products are for deploying pre-configured AWS resources, not for distributing code libraries or constructs that developers import into their applications. This fundamentally misunderstands the difference between distributing reusable code (CDK constructs) and distributing pre-approved infrastructure configurations. Config rules detecting non-compliant resources is a post-deployment detection mechanism that doesn't prevent teams from using outdated constructs-it only detects the problem after deployment, which is too late in the development cycle.
  • Option 4: Lambda layers are for sharing code that runs within Lambda functions, not for distributing development-time libraries used in CDK applications. CDK synthesis happens on developer workstations or CI/CD pipelines, not within Lambda functions. This option fundamentally misapplies Lambda layers to a use case they weren't designed for. Additionally, having Lambda "execute construct synthesis" doesn't align with how CDK works-synthesis is part of the development and deployment pipeline, not a runtime operation.

Key Insight: This question tests whether candidates understand that CDK constructs are NPM packages (in TypeScript CDK applications) and should be distributed using standard package management tooling. Many candidates overthink this by trying to use AWS infrastructure services (Service Catalog, Lambda layers) when the problem is fundamentally about software library distribution, not infrastructure provisioning.

Case Study 4

A media streaming company manages infrastructure using Terraform with separate workspaces for development, staging, and production environments. Their production environment serves 2 million active users and generates $500K in revenue per hour during peak times. Recently, a developer accidentally ran "terraform apply" with the production workspace selected, intending to deploy changes to development. The change deleted critical RDS database instances, causing a 3-hour outage while restoring from backups. The engineering director mandates that production infrastructure changes must now require multi-person verification, with a mandatory 24-hour waiting period between planning and applying changes. Additionally, all production applies must execute only during scheduled maintenance windows (Sundays 2-6 AM UTC). The solution must prevent any individual-including administrators-from bypassing these controls.

What solution architecture enforces these requirements with the LEAST operational overhead while preventing policy circumvention?

  1. Implement AWS Service Control Policies (SCPs) that prevent CloudFormation, EC2, RDS, and other service write operations outside maintenance windows, migrate from Terraform to CloudFormation for production deployments, use AWS CloudFormation Change Sets with SNS notification for manual review, require two separate IAM users to approve changes through CloudFormation console before execution
  2. Configure separate AWS accounts for each environment using AWS Organizations, implement SCPs that deny all production account write operations outside maintenance windows with conditions based on aws:CurrentTime, remove direct AWS credentials for production from all users, create a CI/CD pipeline in GitLab that requires merge approval from two designated reviewers, implement a 24-hour delay between merge and deployment using GitLab scheduled pipelines
  3. Implement Terraform Sentinel policies that enforce time-based restrictions and require dual approval before applies, configure Terraform Enterprise with role-based access control requiring two approvers for production workspace operations, enable Terraform audit logging to S3, and use Lambda functions to enforce the 24-hour delay by locking the workspace after plan and unlocking it after the waiting period
  4. Create an AWS Lambda function that acts as a Terraform wrapper, storing planned changes in DynamoDB with timestamps and approval flags, require two different IAM principals to invoke approval APIs updating the DynamoDB records, schedule a separate Lambda function using EventBridge scheduled rules during maintenance windows to execute approved Terraform applies, and use KMS encryption for Terraform state with key policies requiring multi-party approval for decrypt operations

Answer & Explanation

Correct Answer: 2 - Configure separate AWS accounts for each environment using AWS Organizations, implement SCPs that deny all production account write operations outside maintenance windows with conditions based on aws:CurrentTime, remove direct AWS credentials for production from all users, create a CI/CD pipeline in GitLab that requires merge approval from two designated reviewers, implement a 24-hour delay between merge and deployment using GitLab scheduled pipelines

Why this is correct: Service Control Policies with time-based conditions using aws:CurrentTime provide enforcement at the AWS API level that cannot be bypassed-even account administrators cannot override SCPs. This technically enforces the maintenance window requirement. Removing direct credentials and routing all production changes through a CI/CD pipeline eliminates the ability for anyone to run Terraform locally against production. GitLab's merge approval workflows provide the dual-approval mechanism with built-in audit trails, and scheduled pipelines can enforce the 24-hour delay. This solution leverages AWS's strongest policy enforcement mechanism (SCPs) combined with pipeline-based controls, ensuring no individual can circumvent the process regardless of permissions.

Why the other options are wrong:

  • Option 1: Mandating migration from Terraform to CloudFormation is a massive undertaking that introduces significant project risk and operational disruption-the question asks for a solution to enforce controls, not re-architect the entire IaC approach. Additionally, CloudFormation's manual approval process through the console doesn't provide the programmatic enforcement needed to prevent circumvention-a user with sufficient IAM permissions could still execute changes directly via CLI/API, bypassing the console workflow.
  • Option 3: Terraform Sentinel policies are part of Terraform Enterprise/Cloud, and while they can enforce policies, they require the specific commercial product. The Lambda-based workspace locking mechanism is a custom solution that creates operational complexity and potential failure modes (what if Lambda fails? what if the lock isn't properly released?). Time-based restrictions in Sentinel don't provide AWS API-level enforcement-a user could potentially bypass Terraform entirely and use AWS CLI to make changes, which Sentinel cannot prevent.
  • Option 4: This option describes building an entirely custom orchestration system around Terraform using Lambda and DynamoDB, which creates significant operational overhead and complexity-the opposite of "least operational overhead." This custom-built solution would require ongoing maintenance, error handling, monitoring, and introduces multiple points of failure. The KMS key policy requiring multi-party approval for decryption would prevent normal Terraform operations from reading state files, breaking standard Terraform functionality.

Key Insight: The critical insight is that SCPs provide AWS API-level enforcement that cannot be bypassed by any principal in the account, making them the strongest control mechanism. Many candidates focus on Terraform-level controls (Sentinel, custom wrappers) without recognizing that these can be circumvented by using AWS APIs directly-true enforcement must occur at the AWS control plane level, not at the tooling level.

Case Study 5

A logistics company operates a hybrid cloud environment with significant infrastructure remaining in their on-premises data centers due to existing hardware investments and latency requirements for warehouse automation systems. They are adopting infrastructure as code practices and need to manage both AWS resources and on-premises VMware infrastructure using a unified approach. The on-premises environment includes 200 VMs across 8 vSphere clusters, while AWS resources include VPCs, EC2 instances, RDS databases, and S3 buckets across 3 regions. The infrastructure team is experienced with Terraform and prefers to continue using it rather than learning new tools. They require strong consistency guarantees-if an automation run partially fails, all changes must roll back automatically to prevent split-brain configurations. The solution must provide a single audit trail for all infrastructure changes regardless of target environment.

Which approach provides unified infrastructure management across hybrid environments with automatic rollback capabilities?

  1. Use Terraform with both the AWS provider and the vSphere provider in a single root module, enable Terraform's automated rollback feature for failed applies, configure remote state in S3 with DynamoDB state locking, implement a CI/CD pipeline that runs terraform apply with -auto-rollback flag, and aggregate Terraform logs from both providers into CloudWatch Logs for centralized audit trails
  2. Implement AWS CloudFormation StackSets for AWS resources and use CloudFormation custom resources with Lambda functions that communicate with vCenter APIs to manage VMware resources, enable CloudFormation rollback triggers for both resource types, and consolidate CloudTrail logs with vCenter event logs in Amazon OpenSearch Service for unified auditing
  3. Deploy Terraform using separate configurations for AWS and VMware environments with state files in different backends, use AWS Systems Manager Automation documents to orchestrate Terraform runs across both environments within a single automation execution, implement compensating transactions in Systems Manager runbooks to manually reverse changes when failures occur, and aggregate logs in CloudWatch
  4. Use Terraform with separate state files for AWS and vSphere providers but orchestrated through a single CI/CD pipeline, implement pre-commit hooks to validate cross-environment dependencies, create custom scripting that wraps Terraform with transaction-like behavior by maintaining snapshots of both state files and reverting both if either apply fails, and send all logs to a centralized SIEM

Answer & Explanation

Correct Answer: 4 - Use Terraform with separate state files for AWS and vSphere providers but orchestrated through a single CI/CD pipeline, implement pre-commit hooks to validate cross-environment dependencies, create custom scripting that wraps Terraform with transaction-like behavior by maintaining snapshots of both state files and reverting both if either apply fails, and send all logs to a centralized SIEM

Why this is correct: This is actually the best available option despite requiring custom scripting, because the scenario's requirements cannot be met by native Terraform or CloudFormation capabilities. Terraform does not have native cross-provider automatic rollback (Option 1 mentions a non-existent "-auto-rollback flag"). The critical requirement is "if an automation run partially fails, all changes must roll back automatically"-this requires transactional semantics across multiple independent systems (AWS and VMware). Separate state files for each provider with custom orchestration logic provides the only realistic path to implementing rollback behavior by maintaining state snapshots and coordinating rollback across both environments when any failure occurs. Single CI/CD pipeline provides unified orchestration and audit trail.

Why the other options are wrong:

  • Option 1: This option contains a critical technical inaccuracy-Terraform does not have an "automated rollback feature" or "-auto-rollback flag" for failed applies. When a Terraform apply fails partway through, some resources may be created while others fail-Terraform does not automatically destroy successfully created resources to rollback to the previous state. While you can use a single root module with multiple providers, this doesn't solve the rollback requirement. Terraform applies are not transactional across resources.
  • Option 2: CloudFormation does not have a provider for VMware/vSphere infrastructure. While Lambda functions with custom resources could theoretically make vCenter API calls, this would require building extensive custom logic to translate CloudFormation resource models to vCenter operations-essentially building a complete custom provider. CloudFormation rollback triggers work for CloudFormation resources but wouldn't automatically rollback VMware changes made through custom resources, failing to meet the strong consistency requirement.
  • Option 3: Systems Manager Automation documents are designed for operational tasks and simple orchestration, not for managing complex IaC deployments with rollback semantics. "Compensating transactions" and "manually reverse changes" contradicts the requirement for automatic rollback. This approach also separates AWS and VMware configurations into different state files without a cohesive rollback mechanism-if the AWS apply succeeds but VMware fails, the Systems Manager runbook would need extensive custom logic to identify and reverse AWS changes, which is complex and error-prone.

Key Insight: This question tests whether candidates understand that Terraform does not provide native transactional rollback across resources or providers-this is a common misconception. When infrastructure spans multiple independent systems (AWS and VMware), achieving atomic commit/rollback behavior requires custom orchestration logic, as no single tool provides this natively. The question also tests recognition that some requirements necessitate custom solutions even when the prompt asks for "least overhead."

Case Study 6

A government agency is migrating legacy applications to AWS and must comply with strict security baselines that require all infrastructure to meet NIST 800-53 controls before deployment. The security team has defined 47 specific configuration requirements covering resource tagging, encryption settings, network configurations, and logging. Currently, infrastructure is deployed using AWS CloudFormation, but non-compliant resources are frequently deployed, discovered weeks later during manual audits, and then remediated retroactively. The agency requires a preventive control mechanism that blocks non-compliant infrastructure deployment before resources are created, with exemptions possible only through a formal waiver process approved by the security office. Failed deployment attempts must generate security incident tickets. The solution must work with existing CloudFormation templates without requiring major template rewrites.

What is the MOST effective solution to enforce security baselines preventively during infrastructure deployment?

  1. Implement AWS CloudFormation Guard to define policy-as-code rules matching NIST 800-53 requirements, integrate CloudFormation Guard validation into CI/CD pipelines as a pre-deployment gate that blocks pipeline progression if validation fails, store approved waiver exceptions in a DynamoDB table that CloudFormation Guard queries during validation, and configure pipeline failures to create ServiceNow security incidents through API integration
  2. Enable AWS Config rules for all 47 security requirements with auto-remediation actions, configure Config rules to evaluate resources immediately upon creation, implement AWS Config conformance packs based on NIST 800-53 operational best practices, use EventBridge to detect non-compliant resource creation and trigger Lambda functions that delete non-compliant resources and create security incident tickets
  3. Deploy AWS Service Catalog with pre-approved CloudFormation templates that meet security requirements, disable direct CloudFormation access for development teams through IAM policies, require all infrastructure provisioning through Service Catalog portfolios, implement Service Catalog TagOptions for required tagging, and configure Service Catalog to integrate with ITSM for waiver workflows
  4. Implement CloudFormation hooks that execute custom Lambda functions before resource provisioning, develop Lambda functions that validate each resource against security baselines, store waiver exceptions in AWS Systems Manager Parameter Store with expiration dates, configure Lambda functions to allow provisioning only if resources are compliant or have active waivers, and publish failures to SNS topics connected to security ticketing systems

Answer & Explanation

Correct Answer: 1 - Implement AWS CloudFormation Guard to define policy-as-code rules matching NIST 800-53 requirements, integrate CloudFormation Guard validation into CI/CD pipelines as a pre-deployment gate that blocks pipeline progression if validation fails, store approved waiver exceptions in a DynamoDB table that CloudFormation Guard queries during validation, and configure pipeline failures to create ServiceNow security incidents through API integration

Why this is correct: AWS CloudFormation Guard is specifically designed for policy-as-code validation of CloudFormation templates. It evaluates templates before deployment (preventive control) and can enforce rules matching compliance frameworks like NIST 800-53. Integrating Guard into CI/CD pipelines provides a hard gate that prevents non-compliant infrastructure from ever being deployed-resources are never created if the template doesn't pass validation. CloudFormation Guard can be extended with custom logic to check exception databases. This approach works with existing CloudFormation templates without modification (Guard validates the template, not changes to the template), provides pre-deployment enforcement (not post-deployment detection), and integrates naturally into existing CI/CD workflows.

Why the other options are wrong:

  • Option 2: AWS Config rules are detective controls, not preventive controls-they evaluate resources after creation and detect non-compliance, but they don't prevent non-compliant resources from being created in the first place. Even with auto-remediation, there's a window where non-compliant resources exist. The approach of using Lambda to delete non-compliant resources retroactively fails to meet the "blocks non-compliant infrastructure deployment before resources are created" requirement-resources are created first, then deleted, which may cause service disruptions and doesn't prevent the compliance violation from occurring.
  • Option 3: While Service Catalog provides a curated portfolio approach, restricting teams to only Service Catalog products represents a significant operational change and workflow disruption. This requires migrating all CloudFormation templates to Service Catalog products and retraining teams on new provisioning workflows. The question states "must work with existing CloudFormation templates without requiring major template rewrites"-forcing a Service Catalog-only workflow is a major operational change even if templates themselves don't change. Additionally, this doesn't provide validation logic for the 47 specific requirements-it relies on manual curation of products.
  • Option 4: CloudFormation hooks execute during stack operations but run per-resource and are complex to implement for comprehensive policy validation across 47 different requirements. Developing custom Lambda functions for all security validations creates significant development and maintenance overhead. While technically this provides preventive control, CloudFormation Guard is the AWS-native purpose-built tool for exactly this use case-building custom Lambda-based validation is reinventing functionality that already exists in a managed form.

Key Insight: The key distinction is between preventive controls (blocking non-compliant deployment before resources are created) versus detective controls (identifying non-compliance after deployment). Many candidates default to AWS Config because it's well-known for compliance checking, but fail to recognize that Config is fundamentally a detective control. CloudFormation Guard is less widely known but is the correct tool for policy-as-code validation before deployment.

Case Study 7

A SaaS company provides a multi-tenant analytics platform with separate AWS accounts for each major customer to ensure billing and security isolation. They currently have 200 customer accounts and add approximately 10 new customer accounts monthly. Each customer account requires identical baseline infrastructure: VPC with standard CIDR ranges, security groups, IAM roles for service access, CloudTrail logging, GuardDuty enablement, and S3 buckets with specific encryption and lifecycle policies. The infrastructure team uses AWS CloudFormation StackSets to deploy this baseline configuration. Recently, StackSet operations have become unreliable-deployments to 200 accounts take 6-8 hours, frequently encounter rate limit errors, and partial failures leave accounts in inconsistent states requiring manual remediation. The team needs to significantly reduce deployment time, handle AWS API rate limits gracefully, and improve visibility into which accounts are in which state during deployments.

What is the PRIMARY cause of the performance and reliability issues with CloudFormation StackSets in this scenario?

  1. CloudFormation StackSets uses sequential deployment by default, deploying to one account at a time unless maximum concurrent accounts is increased, and the team likely has not configured the MaxConcurrentAccounts parameter to enable parallel deployments across multiple accounts simultaneously, resulting in unnecessarily long deployment times
  2. The team is likely using a single-region deployment strategy when they should be using CloudFormation StackSets with multiple region targets to distribute API calls across regional endpoints and avoid throttling on a single region's service quotas for CloudFormation operations
  3. CloudFormation StackSets is not designed for deployments exceeding 100 accounts and lacks the architecture needed to handle API rate limiting at scale; the team should migrate to AWS Control Tower which uses a different underlying deployment mechanism specifically architected for multi-account management at this scale
  4. The team is likely deploying all resources in a single StackSet when they should split the baseline configuration into multiple StackSets organized by resource type, with separate StackSets for networking, security, and logging resources to parallelize deployments and reduce the blast radius of failures

Answer & Explanation

Correct Answer: 1 - CloudFormation StackSets uses sequential deployment by default, deploying to one account at a time unless maximum concurrent accounts is increased, and the team likely has not configured the MaxConcurrentAccounts parameter to enable parallel deployments across multiple accounts simultaneously, resulting in unnecessarily long deployment times

Why this is correct: CloudFormation StackSets has configurable concurrency settings controlled by the MaxConcurrentAccounts parameter, which defaults to 1 if not specified. This means StackSet operations deploy to accounts one at a time by default. With 200 accounts, if each account takes 2-3 minutes to deploy, sequential deployment results in the 6-8 hour deployment times described. Increasing MaxConcurrentAccounts to a higher value (e.g., 20-50) would enable parallel deployment across multiple accounts simultaneously, dramatically reducing total deployment time. The rate limiting issues occur because the team likely tried to work around slow deployments without understanding the concurrency settings, potentially by manually triggering multiple StackSet operations or by setting MaxConcurrentAccounts too high without corresponding adjustment to failure tolerance settings.

Why the other options are wrong:

  • Option 2: Multi-region deployment in CloudFormation StackSets means deploying the same resources to multiple regions within each account-it doesn't distribute the control plane API calls for the StackSet orchestration itself across regions. CloudFormation StackSet orchestration operations always go through the management account's region where the StackSet was created. Adding multiple regions would actually increase deployment time and API call volume, worsening the problem rather than solving it. This option reflects a misunderstanding of how StackSets regional deployment works.
  • Option 3: CloudFormation StackSets absolutely supports deployments to hundreds of accounts-AWS Organizations commonly manages thousands of accounts, and StackSets is a core tool for that scale. The 100-account limit mentioned does not exist. While AWS Control Tower provides additional features for account management, the underlying deployment mechanism for baseline configurations still uses CloudFormation StackSets. Migrating to Control Tower wouldn't solve a configuration issue with StackSet concurrency settings.
  • Option 4: While splitting resources into multiple StackSets can be a valid architectural pattern for managing dependencies, it doesn't address the root cause of 6-8 hour deployment times. Multiple StackSets would still suffer from the same sequential deployment issue if MaxConcurrentAccounts isn't configured properly. In fact, multiple StackSets add complexity-now you have to orchestrate the order of StackSet deployments and manage dependencies between them (e.g., VPC must exist before security groups). This doesn't solve the concurrency problem and adds operational complexity.

Key Insight: This question tests whether candidates understand CloudFormation StackSets operational parameters and common misconfiguration issues. Many engineers deploy StackSets with default settings without realizing that concurrency is configurable and defaults to extremely conservative values. The 6-8 hour deployment time is the telltale symptom of sequential deployment across 200 accounts-this question rewards candidates who can identify root causes from operational symptoms.

Case Study 8

A fintech startup operates a microservices architecture with 30 services, each defined in separate GitHub repositories with their own CI/CD pipelines deploying to EKS clusters. The infrastructure team maintains Kubernetes manifests and Helm charts for each service. As the company scales, they face increasing challenges: configuration drift between environments, inconsistent resource definitions across services (some teams use Deployments, others StatefulSets for similar workloads), no standardization of labels and annotations, and difficulty enforcing security policies like required resource limits and pod security standards. The CTO wants to implement guardrails that ensure all microservices follow company standards for Kubernetes configurations while allowing development teams autonomy to define service-specific requirements. The solution must provide immediate validation feedback to developers during local development and in CI/CD pipelines, not just post-deployment.

What approach provides the MOST effective policy enforcement for Kubernetes configurations during development and deployment?

  1. Implement Open Policy Agent (OPA) Gatekeeper in each EKS cluster with constraint templates defining company policies, deploy admission webhooks that evaluate pod specifications at creation time and reject non-compliant deployments, configure OPA to audit existing resources for policy violations, and integrate OPA with monitoring tools to alert on policy violations
  2. Use Kyverno installed in each EKS cluster to define policy resources as Kubernetes custom resources, create ClusterPolicy objects for required labels, resource limits, and pod security standards, enable Kyverno's admission control to validate and mutate resources during admission, implement Kyverno's validation failure reporting in Kubernetes events, and configure CI/CD pipelines to use kubectl dry-run to detect policy violations before deployment
  3. Implement Conftest in CI/CD pipelines using OPA policies written in Rego language to validate Kubernetes manifests and Helm chart outputs as part of the build process, create policy libraries for required standards stored in a centralized repository that all pipelines reference, configure pre-commit hooks in local developer environments to run Conftest validation before commits, and block pipeline progression when policy violations are detected
  4. Migrate all microservices to AWS App Mesh for service mesh capabilities, use App Mesh virtual node and virtual service configurations to standardize service definitions, implement AWS Service Catalog to create approved Kubernetes deployment templates that teams must use, disable direct kubectl access for developers and require all deployments through Service Catalog portfolios with embedded compliance controls

Answer & Explanation

Correct Answer: 3 - Implement Conftest in CI/CD pipelines using OPA policies written in Rego language to validate Kubernetes manifests and Helm chart outputs as part of the build process, create policy libraries for required standards stored in a centralized repository that all pipelines reference, configure pre-commit hooks in local developer environments to run Conftest validation before commits, and block pipeline progression when policy violations are detected

Why this is correct: Conftest provides policy validation during development and CI/CD pipelines-before resources ever reach the Kubernetes cluster. This gives developers immediate feedback during local development (via pre-commit hooks) and provides a hard gate in CI/CD pipelines that prevents non-compliant configurations from being deployed. This is a "shift-left" approach that catches issues early in the development cycle. Conftest can validate Kubernetes manifests, Helm chart outputs, and other structured configuration files using OPA's Rego policy language. Centralized policy libraries ensure consistency across all 30 microservice repositories while allowing service-specific customization within the policy constraints. This satisfies the requirement for "immediate validation feedback to developers during local development and in CI/CD pipelines, not just post-deployment."

Why the other options are wrong:

  • Option 1: OPA Gatekeeper runs as admission control within the Kubernetes cluster, which means validation happens when resources are applied to the cluster-not during local development or early in CI/CD pipelines. Developers wouldn't get feedback until they attempt to deploy to a cluster, which is later in the development cycle. While Gatekeeper is excellent for runtime enforcement (preventing non-compliant resources from entering clusters), it doesn't meet the requirement for "immediate validation feedback to developers during local development." The feedback loop is too late in the process.
  • Option 2: Kyverno also runs as admission control within the Kubernetes cluster, with the same limitation as OPA Gatekeeper-validation happens at deployment time to the cluster, not during local development. While the option mentions using kubectl dry-run in CI/CD pipelines, this still requires connectivity to a Kubernetes cluster and doesn't provide the local development feedback loop requested. Kyverno is a cluster-based policy engine, not a development-time validation tool. Developers working locally wouldn't receive validation feedback until pushing to CI/CD or connecting to a cluster.
  • Option 4: This option fundamentally misunderstands the requirement by suggesting migration to App Mesh and Service Catalog, which are completely different tools solving different problems. App Mesh is a service mesh for traffic management, not a policy enforcement tool for Kubernetes configurations. Requiring Service Catalog for deployments and disabling kubectl access represents a massive operational change that reduces developer autonomy-the opposite of "allowing development teams autonomy to define service-specific requirements." This doesn't address configuration standardization or policy enforcement at all.

Key Insight: The critical distinction is between validation-time policy enforcement (Conftest during development/CI) versus admission-time policy enforcement (OPA Gatekeeper, Kyverno in the cluster). Both are valuable, but the requirement explicitly asks for feedback "during local development and in CI/CD pipelines, not just post-deployment." Many candidates default to admission controllers because they're commonly discussed for Kubernetes policy enforcement, but miss that admission control is cluster-based and doesn't provide the early feedback loop requested.

Case Study 9

An insurance company uses AWS CodePipeline to orchestrate deployments across development, staging, and production environments. Their pipeline includes stages for source (CodeCommit), build (CodeBuild), test (CodeBuild with integration tests), and deployment (CloudFormation). Recently, deployments to production have been frequently rolled back due to issues discovered only after deployment-performance problems, integration failures with external APIs, and subtle configuration errors that passed earlier testing stages. These rollbacks cause service disruptions and have occurred 8 times in the last 2 months. The engineering team wants to add comprehensive validation between the staging deployment and production deployment that would have caught these specific issues. The validation must run automatically, should detect performance regressions, verify integration with external dependencies, and confirm that all production-specific configurations (different API endpoints, database connection strings, encryption keys) are correct before traffic is served.

What combination of actions will provide the MOST comprehensive pre-production validation to reduce rollback frequency? (Select TWO)

  1. Add a CodePipeline manual approval stage between staging and production deployments that requires the operations manager to review deployment logs, CloudWatch metrics from staging, and integration test results before approving production deployment
  2. Implement a CodePipeline stage between staging and production that uses CodeBuild to execute synthetic transaction tests against the production environment after deployment but before updating the load balancer target group to route traffic, validating that external API integrations work with production credentials and that performance meets SLA thresholds
  3. Add AWS CodeDeploy blue/green deployment configuration with CloudWatch alarms as deployment monitors, configuring automatic rollback if error rates exceed thresholds during the first 15 minutes after deployment, with alarms monitoring application-specific metrics including external API call success rates and response times
  4. Implement AWS AppConfig feature flags to deploy configuration changes separately from code deployments, use AppConfig validators to verify production-specific configurations against expected schemas before activation, gradually roll out configuration changes using AppConfig deployment strategies with automatic rollback on CloudWatch alarm breaches
  5. Create a separate AWS account for pre-production validation that mirrors production configuration exactly, add a CodePipeline stage that deploys to this environment and runs comprehensive smoke tests, integration tests, and performance benchmarks before promoting to actual production

Answer & Explanation

Correct Answer: 3 and 4

Why these are correct: Option 3 addresses the deployment-time validation requirement by implementing blue/green deployments with automated monitoring and rollback capabilities based on actual production behavior during initial traffic exposure. CloudWatch alarms monitoring application-specific metrics (external API success rates, response times) can detect integration and performance issues early, triggering automatic rollback before widespread impact-this directly addresses the performance problems and integration failures mentioned. Option 4 addresses the configuration-specific validation requirement by separating configuration management from code deployment using AppConfig. AppConfig validators can verify production-specific configurations (API endpoints, connection strings) before they're activated, and gradual rollout with automatic rollback provides safe deployment of configurations-this directly addresses the "subtle configuration errors" and "production-specific configurations" issues. Together, these provide both deployment-time validation (Option 3) and configuration-time validation (Option 4), covering the full spectrum of issues described.

Why the other options are wrong:

  • Option 1: Manual approval stages add human oversight but don't provide automated technical validation-they rely on a person reviewing information and making judgment calls. Manual reviews are prone to human error, don't scale well, slow down deployment velocity, and wouldn't reliably catch the "subtle configuration errors" mentioned. The question asks for comprehensive validation that "must run automatically"-manual approval is the opposite of automation. This is a process control, not a technical validation.
  • Option 2: While synthetic transaction tests are valuable, running them "after deployment but before updating the load balancer target group" is problematic because the new version is already deployed to production infrastructure. If production-specific configurations are wrong (different API endpoints, connection strings), the tests would need to somehow validate production configs before deployment, not after. Additionally, this describes a blue/green deployment pattern but implements it awkwardly through custom CodeBuild scripting rather than using AWS CodeDeploy's native blue/green capabilities (which is what Option 3 provides more cleanly).
  • Option 5: Creating a separate pre-production account that "mirrors production configuration exactly" seems appealing but has a critical flaw-if it truly mirrors production configuration including production API endpoints, database connection strings, and encryption keys, then tests in this environment would affect production data and external systems (a testing anti-pattern). If it doesn't use production configurations, then it doesn't validate production-specific configurations as required. This also adds significant operational complexity (maintaining configuration parity between two environments) and infrastructure cost without addressing the core issue of validating actual production behavior.

Key Insight: The question tests understanding of where different types of validation should occur in a deployment pipeline. Configuration validation (AppConfig) should happen before deployment, while behavior validation (blue/green with monitoring) should happen during initial exposure to production traffic. Many candidates choose options that describe reasonable-sounding practices (manual approvals, additional test environments) but don't provide the automated technical validation against production-specific conditions that the scenario requires.

Case Study 10

A pharmaceutical research company uses AWS to process genomic data analysis workflows. Scientists submit job requests through a web portal, which triggers AWS Step Functions state machines that orchestrate data processing pipelines using AWS Batch for compute-intensive analysis, with results stored in S3 and metadata in DynamoDB. The infrastructure team manages the Step Functions state machine definitions, Batch compute environment configurations, and IAM roles using Terraform. The current challenge is that when infrastructure changes are deployed-such as updating Batch instance types, modifying Step Functions retry logic, or adjusting IAM permissions-there's no safe way to test these changes without impacting production workloads. Failed infrastructure changes have caused pipeline failures affecting time-sensitive research projects. The team needs a methodology to safely test infrastructure changes before applying them to production, with the ability to validate that active job executions would succeed with the new infrastructure configuration. Budget constraints limit options for maintaining duplicate complete environments.

What is the MOST cost-effective approach to safely test infrastructure changes before production deployment?

  1. Implement Terraform workspaces to create complete parallel infrastructure stacks for testing, deploy infrastructure changes to the test workspace, execute representative sample jobs through the test infrastructure, validate results and costs, then apply the same changes to the production workspace and destroy the test workspace after validation is complete
  2. Use AWS CloudFormation StackSets to deploy infrastructure changes to a separate testing account first, create a data pipeline to replicate a subset of production S3 data and DynamoDB metadata to the testing account, execute full end-to-end test jobs in the testing environment, validate results, then deploy the same CloudFormation changes to the production account using StackSets after successful testing
  3. Implement feature flags using AWS AppConfig to toggle between old and new infrastructure configurations, deploy both configurations in production simultaneously with flag-controlled selection, gradually shift traffic to new configurations for selected test jobs while monitoring for issues, roll back by changing the flag if problems occur, then remove old configurations once validation is complete
  4. Create Terraform modules for all infrastructure components with versioning, implement automated integration testing using Terratest that provisions temporary testing infrastructure, executes synthetic validation jobs, verifies expected behavior, then automatically destroys test infrastructure, and only promotes Terraform module versions to production after tests pass in CI/CD pipelines

Answer & Explanation

Correct Answer: 4 - Create Terraform modules for all infrastructure components with versioning, implement automated integration testing using Terratest that provisions temporary testing infrastructure, executes synthetic validation jobs, verifies expected behavior, then automatically destroys test infrastructure, and only promotes Terraform module versions to production after tests pass in CI/CD pipelines

Why this is correct: Terratest provides automated infrastructure testing for Terraform code-it can provision resources, run validation tests against them, then automatically tear them down. This approach minimizes cost because test infrastructure only exists during the test execution (typically minutes to hours), not continuously. Creating versioned Terraform modules provides reusability and safe testing of infrastructure changes independently from production. Automated testing in CI/CD provides consistent validation before production deployment. The ephemeral nature of test infrastructure addresses "budget constraints limit options for maintaining duplicate complete environments"-you're not maintaining permanent duplicate environments, you're creating temporary test environments on-demand. This validates infrastructure changes before they reach production while keeping costs minimal.

Why the other options are wrong:

  • Option 1: Terraform workspaces share the same backend and are primarily designed for managing different environments' state, not for creating complete parallel infrastructure stacks for testing. While technically possible, workspaces can become confusing and error-prone when managing truly separate infrastructure. More critically, this approach doesn't describe automated testing-it describes manual deployment to test workspace, manual job execution, and manual validation, which is time-consuming, error-prone, and doesn't integrate well into CI/CD pipelines. The "deploy, test, deploy again to production" workflow is inefficient and still requires manual intervention.
  • Option 2: The question states they use Terraform, but this option suggests switching to CloudFormation and StackSets, which would require rewriting all infrastructure code-a massive undertaking. Maintaining a separate testing account with replicated data addresses the testing requirement but fails the "most cost-effective" constraint-running a complete parallel account with compute environments, ongoing data replication, and persistent infrastructure is expensive. This is essentially a traditional staging environment approach, which the question's budget constraint specifically indicates they want to avoid.
  • Option 3: AppConfig feature flags are designed for application configuration changes and feature toggling, not infrastructure configuration changes. You cannot use feature flags to "toggle between old and new infrastructure configurations" for things like Batch compute environment instance types or Step Functions retry logic-these are infrastructure-level settings, not application runtime configurations. This represents a fundamental misunderstanding of what AppConfig does. Infrastructure changes like switching IAM role permissions or Batch compute environment settings cannot be toggled at runtime via feature flags.

Key Insight: The question tests understanding of infrastructure testing strategies and cost optimization through ephemeral test environments. Many candidates default to traditional staging environment approaches (permanent parallel infrastructure) without considering that automated, ephemeral testing infrastructure can provide validation at much lower cost. Terratest represents modern IaC testing practices where infrastructure is tested like application code-provision, test, destroy-rather than maintaining expensive permanent test environments.

The document Case Studies: IaC & Automation is a part of the AWS Solutions Architect Course AWS Solutions Architect: Professional Level.
All you need of AWS Solutions Architect at this link: AWS Solutions Architect
Explore Courses for AWS Solutions Architect exam
Get EduRev Notes directly in your Google search
Related Searches
Summary, Case Studies: IaC & Automation, Viva Questions, past year papers, Free, Exam, practice quizzes, Semester Notes, shortcuts and tricks, Sample Paper, video lectures, Objective type Questions, Previous Year Questions with Solutions, Case Studies: IaC & Automation, mock tests for examination, MCQs, Case Studies: IaC & Automation, ppt, pdf , Extra Questions, study material, Important questions;