Ansible vs Tf vs Cf: Disaster Recovery

At this point, I am working on my disaster recovery plan by developing templates of my school’s current resources. I have come to understand that CloudFormation is great if you are only working with AWS resources. Terraform works well with AWS but also with other cloud providers, so if you are multi-cloud or opposed to vendor lock in, then Terraform makes a lot of sense. As for Ansible… I can configure servers with it, but deploying cloud resources is not as easy as I would like, and I think that is why everywhere I interview, CF & TF are for Infrastructure as Code, and Ansible is for configuration management. However, I am still working on replicating my resource stacks in all three languages, just for the practice. Today, I am making some serious headway with CloudFormation.

My first task was to get a sense of what is deployed in my organization’s region. This was greatly aided by CloudFormer. I ran the template for the CloudFormer server, then changed it for a larger instance, since it seemed to be grinding on the size of our DNS record stacks. Once I did that, I walked through the interface and was rewarded with about 7K lines of yaml code that declared everything that was built. I ran it two more times, splitting the DNS records (2700 lines) from the main output file (4600 lines). This gave me a good understanding of what all I had built in the console over the last three years. Now I have a baseline from which to build a copy for another region. I am also considering deploying it in the current region and rebuilding all the resources, then deleting the old resources, clearing out the cruft of experimentation.

The next step was to recreate the VPC stack. This began my first use of the !Ref intrinsic function to refer to the parameters I added to the template:

Parameters:

  EnvironmentName:
    Description: An environment name that will be prefixed to resource names
    Type: String
    Default: DR-Prod

  VpcCIDR:
    Description: Please enter the IP range (CIDR notation) for this VPC
    Type: String
    Default: 10.0.0.0/16

  PublicSubnet1CIDR:
    Description: Please enter the IP range (CIDR notation) for the public subnet in the first Availability Zone
    Type: String
    Default: 10.0.0.0/24

## truncated for example

Resources:
  VPC:
    Type: AWS::EC2::VPC
    Properties:
      CidrBlock: !Ref VpcCIDR
      EnableDnsSupport: true
      EnableDnsHostnames: true
      Tags:
        - Key: Name
          Value: !Ref EnvironmentName

  InternetGateway:
    Type: AWS::EC2::InternetGateway
    Properties:
      Tags:
        - Key: Name
          Value: !Ref EnvironmentName

  InternetGatewayAttachment:
    Type: AWS::EC2::VPCGatewayAttachment
    Properties:
      InternetGatewayId: !Ref InternetGateway
      VpcId: !Ref VPC

# subnets
  PublicSubnet1:
    Type: AWS::EC2::Subnet
    Properties:
      VpcId: !Ref VPC
      AvailabilityZone: !Select [ 0, !GetAZs '' ]
      CidrBlock: !Ref PublicSubnet1CIDR
      MapPublicIpOnLaunch: true
      Tags:
        - Key: Name
          Value: !Sub ${EnvironmentName} Public Subnet (AZ1)

# routing
  PublicRouteTable:
    Type: AWS::EC2::RouteTable
    Properties:
      VpcId: !Ref VPC
      Tags:
        - Key: Name
          Value: !Sub ${EnvironmentName} Public Routes

  DefaultPublicRoute:
    Type: AWS::EC2::Route
    DependsOn: InternetGatewayAttachment
    Properties:
      RouteTableId: !Ref PublicRouteTable
      DestinationCidrBlock: 0.0.0.0/0
      GatewayId: !Ref InternetGateway

  PublicSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Properties:
      RouteTableId: !Ref PublicRouteTable
      SubnetId: !Ref PublicSubnet1

## truncated for example

 This also allowed me to use the !Select function to get the availability zones without having to name them explicitly, and the !Sub function to add some variety to my tags.

As I began to build other stacks, such as the security stack, I found that I needed to use dynamically generated values like resource IDs, which lead me to the Outputs section, which I had only ever used to generate a URL as part of a tutorial.


Outputs:
  VPC:
    Description: A reference to the created VPC
    Value: !Ref VPC
    Export:
      Name: VP

  PublicSubnet1:
    Description: A reference to the public subnet in the 1st Availability Zone
    Value: !Ref PublicSubnet1
    Export:
      Name: PublicSubnet1

Each named value is then available to other stacks in the region, so you can use the !ImportValue function to retrieve them for your follow on scripts. Here’s part of the security stack, which handles security groups and ingress rules. Later I’ll add a NACLs set, which will be good practice for my Networking Specialty exam coming up.

Description: "This template applies the network security stack, implementing security groups, egress and ingress rules, network access control lists, sets up the bastion host, and applies the NAT and route rules for the private subnets."


Resources:

# security groups and ingress rules

# mdm security groups
  sgMdmProd:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Mdm server in production
      VpcId: !ImportValue VPC
      Tags:
      - Key: Name
        Value: MdmProd

# all access 8443
  ingressMdmProd01:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref sgMdmProd
      IpProtocol: tcp
      FromPort: '8443'
      ToPort: '8443'
      CidrIp: 0.0.0.0/0

# ldap 389 only to ldap server
  ingressMdmProd05:
    Type: AWS::EC2::SecurityGroupIngress
    Properties:
      GroupId: !Ref sgMdmProd
      IpProtocol: tcp
      FromPort: '389'
      ToPort: '389'
      SourceSecurityGroupId: !Ref sgOdProd
      SourceSecurityGroupOwnerId: '<redacted>'

# truncated for examples

I exported my security groups so they could be used by other scripts that create resources which require security groups. The best part about having the network stack separate is that I can look at it closely and make sure there are no extraneous rules, and that no traffic is incoming from anywhere I don’t want.

As my list of stacks got larger and larger, I learned about nested stacks and DependsOn conditions. This is my root stack example:

AWSTemplateFormatVersion: '2010-09-09'
Description: "disaster recovery / automatic failover"

Resources:
  DrVpc:
    Type: AWS::CloudFormation::Stack
    Properties:
      TemplateURL: https://s3.amazonaws.com/<redacted>/DrVpc.yaml
      TimeoutInMinutes: '5'
  
  DrSecGroupRules:
    Type: AWS::CloudFormation::Stack
    DependsOn: DrVpc
    Properties:
      TemplateURL: https://s3.amazonaws.com/<redacted>/DrSecurityRules.yaml
      TimeoutInMinutes: '5'

  DrRds:
    Type: AWS::CloudFormation::Stack
    DependsOn: DrSecGroupRules
    Properties:
      TemplateURL: https://s3.amazonaws.com/<redacted>/DrRds.yaml
      TimeoutInMinutes: '20'

Normally, CloudFormation will try to build all the stacks at the same time. If one stack requires the IDs of resources from another stack, the DependsOn directive will cause CloudFormation to wait to create that stack until after the required stack is finished.

The TimeoutInMinutes property value was so that I would not wait an hour for a stack to fail. I had a situation where a resource in the Vpc stack was not getting created, but not failing. I think the default wait time in CloudFormation is 60 minutes before declaring a failure ( I think that was a test question somewhere), so I lowered it to reduce the suspense. As I tested each stack individually, I took note of how long it took to build each stack of resources, and set my timeouts accordingly. The RDS stack took the longest, with each database taking about 5 minutes to spin up, which is important to know 🙂

At present, I have stacks for VPC, security, load balancers, EFS volumes and databases. I’m working on the individual server stacks, trying to figure out how to get the necessary data from the backups to the EFS volumes and the databases. I may have to build some lambda functions to auto copy AMIs over and update the scripts with the AMI IDs. That bears some consideration, and probably some googling 🙂

My disaster recovery drill is scheduled for Friday night, so I have three more days to finish these CloudFormation templates. We’ll see how it goes, and as I learn about helper functions and init scripts I’ll post it here. Once the CF scripts are done, time to do it all over again in Terraform!

Leave a Reply

Your email address will not be published. Required fields are marked *