This section explains how to safely replace hardware components on Cisco UCS servers running Qumulo Core, including procedures for node recuse and reintroduction.

Component Replacement Overview

Component Hot-Swappable Node Offline Required
Drives (NVMe, SSD, HDD) Yes No
Boot Drive No Yes (special procedure)
Power Supplies Yes No
Fans Yes No (but special handling required)
DIMMs (Memory) No Yes
CPUs No Yes
Motherboard No Yes
NICs No Yes

Hot-Swappable Components

Drive Replacement

Drives in Qumulo clusters are hot-swappable. You can replace a failed drive without taking the node offline.

  1. Identify the failed drive using the Qumulo Web UI or qq cluster_slots_status command.
  2. Locate the physical drive using the Drive Bay Mapping.
  3. Remove the failed drive.
  4. Insert the replacement drive.
  5. Qumulo Core automatically detects and incorporates the new drive.

For physical drive replacement procedures, refer to the Cisco Installation and Service Guides listed in the Technical Specifications.

Power Supply Replacement

Power supplies are hot-swappable when the server has redundant power supplies.

  1. Verify the server has redundant power supplies and the remaining PSU is functioning.
  2. Remove the failed power supply.
  3. Insert the replacement power supply.

Fan Replacement

Fans are hot-swappable but require prompt replacement to prevent thermal issues.

  1. Identify the failed fan using CIMC or front panel LEDs.
  2. Have the replacement fan ready before beginning.
  3. Follow the fan replacement procedure in the Cisco Installation and Service Guide for your server model. This includes removing the top cover and accessing the fan modules.

For detailed fan replacement procedures, refer to the Cisco Installation and Service Guides listed in Technical Specifications.


Components Requiring Node Offline

For DIMM, CPU, motherboard, or NIC replacements, the node must be safely taken offline using the following procedure.

Before You Begin

Step 1: Verify Cluster Protection Status

You can verify cluster protection status using either the Web UI or the CLI.

Using the Qumulo Web UI

  1. Log in to the Qumulo Web UI.
  2. Navigate to Cluster > Cluster Overview.
  3. Under Data Protection, verify that the cluster can tolerate a node failure.

Using the qq CLI

SSH into any node and run:

qq protection_status_get

You will see output similar to:

{
    "remaining_drive_failures": 1,
    "remaining_node_failures": 2
}

Step 2: Recuse the Node

Recusing a node safely removes it from the cluster quorum, allowing data protection mechanisms to account for the missing node.

  1. SSH into the node that requires maintenance.

  2. Run the recuse command:

    /opt/qumulo/recuse_node.py --reason "Component replacement"
    
  3. After running this command, a red banner appears in the Qumulo Web UI stating:

    Unable to communicate with node X
    
  4. This confirms the node was successfully recused from the cluster.

Step 3: Power Off the Node

After the node is recused, power it off:

Using CIMC

  1. Log in to the node’s CIMC interface.
  2. Navigate to Server > Power.
  3. Click Power Off.

Using the CLI

From another node or management system:

ipmitool -I lanplus -H <CIMC-IP> -U admin -P <password> power off

Using the Physical Power Button

Press and hold the power button on the front panel until the server powers off.

Step 4: Replace the Component

Perform the hardware replacement according to Cisco documentation:

DIMM Replacement Notes

When replacing DIMMs:

  1. Follow Cisco’s DIMM population guidelines.
  2. Ensure replacement DIMMs match the specifications of the original.
  3. Use proper ESD precautions.

Fan Replacement Notes (Requiring Node Offline)

If multiple fans need replacement or if the server must be powered off:

  1. Follow Cisco’s fan module replacement procedures.
  2. Ensure all fans are properly seated before powering on.

Step 5: Power On the Node

After completing the component replacement:

  1. Power on the node using the power button, CIMC, or IPMI command.
  2. Wait for the node to complete POST and boot Qumulo Core.
  3. Verify the component is functioning:
    • For DIMMs: Check total memory in CIMC or qq node_status
    • For NICs: Check network connectivity
    • For fans: Check CIMC sensor readings

Step 6: Reintroduce the Node

After verifying the component replacement was successful, reintroduce the node to the cluster:

  1. SSH into the node that was serviced.

  2. Run the reintroduce command:

    /opt/qumulo/sbin/reintroduce_node.sh
    
  3. Monitor the Qumulo Web UI to verify:

    • The red banner disappears
    • The node appears healthy in the cluster overview
    • Data reprotect begins (if applicable)

Step 7: Verify Cluster Health

After reintroduction:

  1. Check cluster protection status:

    qq protection_status_get
    
  2. Verify all nodes are healthy:

    qq cluster_slots_status
    
  3. Monitor the Web UI for any alerts or warnings.


Boot Drive Replacement

After you replace the boot drive, you must initialize the replacement boot drive by using the Qumulo Core Installer and then rebuild the replacement boot drive by using a script on the node in your cluster.

Step 1: Initialize the Replacement Boot Drive

  1. Create a Qumulo Core USB Drive Installer.

  2. Power on your node, enter the boot menu, and select your USB drive.

    The Qumulo Core Installer begins to run automatically.

  3. When prompted, take the following steps:

    1. Select [x] Perform maintenance.

    2. Select [1] Boot drive reset and then follow the prompts.

    The Qumulo Core Installer initializes the boot drive.

  4. When the process is complete, the node is powered down automatically.

Step 2: Rebuild the Replacement Boot Drive

  1. Power on your node and log in to the node by using the qq CLI.

  2. To get root privileges, run the sudo qsh command.

  3. To stop the Qumulo Networking Services, run the service qumulo-networking stop command.

  4. To configure the IP address for the node, run the ip addr add command and specify the node’s IP address. For example:

    ip addr add 203.0.113.0/CDR dev bond0
    
  5. Ensure that the node can ping other nodes in the cluster.

  6. Run the rebuild_boot_drive.py script and specify the IP address of another node in the cluster, the ID of the node whose boot drive has been replaced, and the password of the administrative account of the cluster. For example:

    /opt/qumulo/rebuild_boot_drive.py \
      --address 203.0.113.1 \
      --node-id 2 \
      --username admin \
      --password my\(Special\*Password
    

    Follow the prompts.

  7. When the process is complete, reboot the node.


Getting Help

If you have any questions or encounter issues during component replacement, contact contact the Qumulo Care Team through Slack, email, or by phone.

For Cisco hardware-specific issues, contact Cisco Support.