Deploying Cloud Native Qumulo on AWS with Terraform

This section explains how to deploy Cloud Native Qumulo (CNQ) by creating the persistent storage and the cluster compute and cache resources with Terraform. It also provides recommendations for Terraform deployments and information about post-deployment actions and optimization.

For an overview of CNQ on AWS, its prerequisites, and limits, see How Cloud Native Qumulo Works.

The aws-terraform-cnq.zip file contains comprehensive Terraform configurations that let you deploy S3 buckets and then create a CNQ cluster with 4 to 24 instances that adhere to the AWS Well-Architected Framework and have fully elastic compute and capacity.

Prerequisites

This section explains the prerequisites to deploying CNQ on AWS.

To allow your Qumulo instance to report metrics to Qumulo, your AWS VPC must have outbound Internet connectivity through a NAT gateway or a firewall. Your instance shares no file data during this process.

Important
Connectivity to api.missionq.qumulo.com is required for a successful deployment of a Qumulo instance and the formation of a quorum.

The following features require specific versions of Qumulo Core:

Feature	Minimum Qumulo Core Version
Adding S3 buckets to increase persistent storage capacity Increasing the soft capacity limit for an existing CNQ cluster	7.2.1.1
S3 Intelligent-Tiering storage class Infrequent Access S3 access tier	7.2.0.2
Creating persistent storage Important You must create persistent storage by using a separate Terraform deployment before you deploy the compute and cache resources for your cluster.	7.1.3 with version 4.0 of this repository

Before you configure your Terraform environment, you must sign in to the AWS CLI.

A custom IAM role or user must include the following AWS services:
- cloudformation:*
- ec2:*
- elasticloadbalancing:*
- iam:*
- kms:*
- lambda:*
- logs:*
- resource-groups:*
- route53:*
- s3:*
- secretsmanager:*
- sns:*
- ssm:*
- sts:*
Note
Although the AdministratorAccess managed IAM policy provides sufficient permissions, your organization might use a custom policy with more restrictions.

How the CNQ Provisioner Works

The CNQ Provisioner is an m5.large EC2 instance that uses custom user data to configure your Qumulo cluster and any additional AWS environment requirements.

The Provisioner stores all necessary state information in the AWS Parameter Store and shuts down automatically when it completes any of its following major tasks:

Click to expand

Qumulo Cluster Configuration

Forms the first quorum with specific Hot or Cold parameters
Adds nodes to the quorum (when expanding the cluster)
Assigns floating IP addresses to nodes in the cluster
Manages cluster replacement (new compute and cache resources) for changing instance sizes
Manages the addition of S3 buckets and soft capacity limit increases
Changes the administrative password

AWS Configuration

Checks for connectivity to Amazon S3
Checks for the presence of an S3 Gateway in the VPC (this is required for provisioning)
Checks that all S3 buckets are empty before forming quorum
Checks for connectivity to the public Internet running a curl command against api.missionq.qumulo.com/
Configures the throughput and IOPS for the EBS gp3 volume
Tags EBS volumes with deployment_unique_name and volume type
Tracks software versions, cluster IP addresses, instance IDs, and UUID in the AWS Parameter Store
Tracks the last-run-status for the Provisioner in the Parameter Store

Step 1: Deploying Cluster Persistent Storage

This section explains how to deploy the S3 buckets that act as persistent storage for your Qumulo cluster.

Log in to Nexus, click Downloads > Deployment on AWS, and then download the Terraform configuration, Debian package, and host configuration file.
In your S3 bucket, create the qumulo-core-install directory. Within this directory, create another directory with the Qumulo Core version as its name. The following is an example path:
```
my-s3-bucket-name/my-s3-bucket-prefix/qumulo-core-install/7.2.3
```
Tip
Make a new subdirectory for every new release of Qumulo Core.
Copy qumulo-core.deb and host_configuration.tar.gz into the directory named after the Qumulo Core version (in this example, it is 7.2.3).
Copy aws-terraform-cnq.zip to your Terraform environment and decompress it.
Navigate to the aws-terraform-cnq directory and then run the terraform init command.
Navigate to the persistent-storage directory and then take the following steps:
1. Review the terraform.tfvars file.
  - Specify the correct aws_region for your cluster’s persistent storage.
  - Enter the soft_capacity_limit.
2. Run the terraform apply command.
  
  Tip
  Note the value for deployment_unique_name that Terraform outputs. You will need this value for deploying your cluster.
Terraform creates each S3 bucket with a unique state for its deployment.

Step 2: Deploying Cluster Compute and Cache Resources

This section explains how to deploy compute and cache resources for a Qumulo cluster by using a Ubuntu AMI and the Qumulo Core .deb installer.

Important

Provisioning completes successfully when the provisioning instance shuts down automatically. If the provisioning instance doesn't shut down, the provisioning cycle has failed and must be troubleshooted. To monitor the provisioner's status, you can watch the Terraform status posts in your terminal or in the AWS Parameter Store, under /qumulo/<my-deployment-name>/last-run-status.
The first variable in the example configuration files in the aws-terraform-cnq repository is deployment_name. To help avoid conflicts between Network Load Balancers (NLBs), resource groups, cross-region CloudWatch views, and other deployment components, Terraform ignores the deployment_name value and any changes to it. Terraform generates the additional deployment_unique_name variable; appends a random, 11-digit alphanumeric value to it; and then tags all future resources with this variable, which never changes during subsequent Terraform deployments.
If you plan to deploy multiple Qumulo clusters, give the q_cluster_name variable a unique name for each cluster.
(Optional) If you use Amazon Route 53 private hosted zones, give the q_fqdn_name variable a unique name for each cluster
Familiarize yourself with how the CNQ on AWS Provisioner works and don't interfere with its operation.

Configure your VPC to use the gateway VPC endpoint for S3.

Important
It isn’t possible to deploy your cluster without a gateway.
Navigate to the aws-terraform-cnq directory.
Choose config-standard.tfvars or config-advanced.tfvars and fill in the values for all required variables. For more information, see readme.pdf in aws-terraform-cnq.zip.
To log in to your cluster’s Web UI, use the IP address from the Terraform output as the endpoint and the username and password that you have configured during deployment as the credentials.

Important
If you change the administrative password for your cluster by using the Qumulo Web UI, qq CLI, or REST API after deployment, you must add your new password to AWS Secrets Manager.

You can use the Web UI to create and manage NFS exports, SMB shares, snapshots, and continuous replication relationships You can also join your cluster to Active Directory, configure LDAP, and perform many other operations.
Mount your Qumulo file system by using NFS or SMB and your cluster’s DNS name or IP address.

Tip
You can configure DNS names with a Route 53 private hosted zone by using a template.

Step 3: Performing Post-Deployment Actions

This section describes the common actions you can perform on a CNQ cluster after deploying it.

Adding a Node to an Existing Cluster

Important
To add a node to an existing cluster, the total node count must be greater than that of the current deployment.

Edit terraform.tfvars and change the value of q_node_count to a new value.
Run the terraform apply command.
To ensure that the Provisioner shut downs automatically, review the /qumulo/my-deployment-name/last-run-status parameter in the AWS Parameter Store.
To check that the cluster is healthy, log in to the Web UI.

Removing a Node from an Existing Cluster

Removing a node from an existing cluster is a two-step process. First, you remove the node from the cluster’s quorum. Next, you tidy up your AWS resources.

Step 1: Remove the Node from the Cluster’s Quorum

You must perform this step while the cluster is running.

Copy the remove-nodes.sh script from the utilities directory to an AWS Linux 2 AMI running in your VPC.
Tip
- To make the script executable, run the chmod +x remove-nodes.sh command.
- To see a list of required parameters, run remove-nodes.sh
Run the remove-nodes.sh script and specify the AWS region, the unique deployment name, the current node count, and the final node count.

In the following example, we reduce a cluster from 6 to 4 nodes.
```
./remove-nodes.sh \
  --region us-west-2 \
  --qstackname my-unique-deployment-name \
  --currentnodecount 6 \
  --finalnodecount 4
```
When prompted, confirm the nodes’ removal.
To check that the cluster is healthy, log in to the Web UI.

Step 2: Tidy Up Your AWS Resources

Edit terraform.tfvars and change the value of q_node_count to a lower value (for example, 4).
Run the terraform apply command.
To monitor the provisioner’s status, you can watch the Terraform status posts in your terminal or in the AWS Parameter Store, under /qumulo/<my-deployment-name>/last-run-status.

The node and the infrastructure associated with the node are removed.
To check that the cluster is healthy, log in to the Web UI.

Changing the EC2 Instance Type for an Existing Cluster

Changing the EC2 instance type is a three-step process. First, you create a new deployment in a new Terraform workspace (this process ensures that the required instances are available) and join the new instances to a quorum. Next, you clean up your S3 bucket policies. Finally, you remove the existing instances.

Important
To avoid a number of potential issues, you must perform this cluster replacement procedure as a two-quorum event. For example, if you stop the existing instances by using the AWS Management Console and change the instance types, two quorum events occur for each node and the read and write cache isn’t optimized for the instance type. Moreover, your system might experience resource constraints before you can change the type of every instance.

Step 1: Create a New Deployment in a New Terraform Workspace

To create a new Terraform workspace, run the terraform workspace new my-new-workspace-name command.
To initialize the workspace, run the terraform init command.
Use the existing deployment name or choose a new name.

Note
Regardless, Terraform assigns a unique name with an 11-digit alphanumeric suffix to your deployment.
Edit the terraform.tfvars file and take the following steps:
1. Specify the value for the q_instance_type variable.
2. Set the value of the q_replacement_cluster variable to true.
3. Set the value of the q_existing_deployment_unique_name variable to the current deployment’s name.
4. (Optional) To change the number of nodes, specify the value for the q_node_count variable.
Run the terraform apply command.
To ensure that the Provisioner shut downs automatically, review the /qumulo/my-deployment-name/last-run-status parameter in the AWS Parameter Store.
To perform future node addition or removal operations, edit the terraform.tfvars file and set the q_replacement_cluster variable to false.
To check that the cluster is healthy, log in to the Web UI.

Step 2: Remove the Existing Instances

To select the previous Terraform workspace (for example, default), run the terraform workspace select <default> command.
To ensure that the correct workspace is selected, run the terraform workspace show command.
To delete the previous instance in its entirety, run the terraform destroy command.

The previous instances are deleted.

Note
The persistent storage deployment remains in its original Terraform workspace. You can perform the next cluster replacement procedure in the original Terraform workspace, and so on.

Step 3: Clean Up S3 Bucket Policies

To select the new Terraform workspace, run the terraform workspace select <my-new-workspace-name> command.
Edit the terraform.tfvars file and set the q_replacement_cluster variable to false.
Run the terraform apply command. This ensures that the S3 bucket policies have least privilege.

Increasing the Soft Capacity Limit for an Existing Cluster

Increasing the soft capacity limit for an existing cluster is a two-step process. First, you set new persistent storage parameters. Next, you set new compute and cache deployment parameters.

Step 1: Set New Persistent Storage Parameters

Edit the terraform.tfvars file in the persistent-storage directory and set the soft_capacity_limit variable to a higher value.
Run the terraform apply command.

Terraform creates new S3 buckets as necessary.

Step 2: Update Existing Compute and Cache Resource Deployment

Navigate to the root directory of the aws-terraform-cnq repository.
Run the terraform apply command.

Terraform updates the necessary IAM roles and S3 bucket policies, adds S3 buckets to the persistent storage list for the cluster, and increases the soft capacity limit. When the Provisioner shuts down automatically, this process is complete.

Deleting an Existing Cluster

Deleting a cluster is a two-step process. First, you delete your Cloud Native Qumulo resources. Next you delete your persistent storage.