Update downstream backup to include RKE2 + restore when etcd down

2026-04-27 00:35:41 +00:00 · 2023-08-01 16:09:32 -07:00
parent e2b8d6ccda
commit 6eef53207e
2 changed files with 217 additions and 19 deletions
--- a/docs/how-to-guides/new-user-guides/backup-restore-and-disaster-recovery/back-up-rancher-launched-kubernetes-clusters.md
+++ b/docs/how-to-guides/new-user-guides/backup-restore-and-disaster-recovery/back-up-rancher-launched-kubernetes-clusters.md
@@ -16,6 +16,9 @@ Snapshots of the etcd database are taken and saved either [locally onto the etcd

 ### Snapshot Components

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 When Rancher creates a snapshot, it includes three components:

 - The cluster data in etcd
@@ -24,24 +27,57 @@ When Rancher creates a snapshot, it includes three components:

 Because the Kubernetes version is now included in the snapshot, it is possible to restore a cluster to a prior Kubernetes version.

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+Rancher delegates snapshot creation to the downstream Kubernetes engine. When the Kubernetes engine creates a snapshot, it includes three components:
+
+- The cluster data in etcd
+- The Kubernetes version
+- The cluster configuration
+
+Because the Kubernetes version is included in the snapshot, it is possible to restore a cluster to a prior Kubernetes version while also restoring an etcd snapshot.
+
+</TabItem>
+</Tabs>
+
 The multiple components of the snapshot allow you to select from the following options if you need to restore a cluster from a snapshot:

 - **Restore just the etcd contents:** This restore is similar to restoring to snapshots in Rancher before v2.4.0.
 - **Restore etcd and Kubernetes version:** This option should be used if a Kubernetes upgrade is the reason that your cluster is failing, and you haven't made any cluster configuration changes.
 - **Restore etcd, Kubernetes versions and cluster configuration:** This option should be used if you changed both the Kubernetes version and cluster configuration when upgrading.

-It's always recommended to take a new snapshot before any upgrades.
+It is always recommended to take a new snapshot before performing any configuration changes or upgrades.
+

 ### Generating the Snapshot from etcd Nodes

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 For each etcd node in the cluster, the etcd cluster health is checked. If the node reports that the etcd cluster is healthy, a snapshot is created from it and optionally uploaded to S3.

 The snapshot is stored in `/opt/rke/etcd-snapshots`. If the directory is configured on the nodes as a shared mount, it will be overwritten. On S3, the snapshot will always be from the last node that uploads it, as all etcd nodes upload it and the last will remain.

 In the case when multiple etcd nodes exist, any created snapshot is created after the cluster has been health checked, so it can be considered a valid snapshot of the data in the etcd cluster.

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+Snapshots are enabled by default.
+
+The snapshot directory defaults to `/var/lib/rancher/<RUNTIME>/server/db/snapshots`, where `<RUNTIME>` is either `rke2` or `k3s`.
+
+In RKE2, snapshots are stored on each etcd node. If you have multiple etcd or etcd + control-plane nodes, you will have multiple copies of local etcd snapshots.
+
+</TabItem>
+</Tabs>
+
 ### Snapshot Naming Conventions

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 The name of the snapshot is auto-generated. The `--name` option can be used to override the name of the snapshot when creating one-time snapshots with the RKE CLI.

 When Rancher creates a snapshot of an RKE cluster, the snapshot name is based on the type (whether the snapshot  is manual or recurring) and the target (whether the snapshot is saved locally or uploaded to S3). The naming convention is as follows:
@@ -58,8 +94,39 @@ Some example snapshot names are:
 - c-9dmxz-ms-t6bjb
 - c-9dmxz-rs-8gxc8

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+The name of the snapshot is auto-generated. The `--name` option can be used to override the base name of the snapshot when creating one-time snapshots with the RKE2 or K3s CLI.
+
+When Rancher creates a snapshot of an RKE2 or K3s cluster, the snapshot name is based on the type (whether the snapshot is manual or recurring) and the target (whether the snapshot is saved locally or uploaded to S3). The naming convention is as follows:
+
+`<name>-<node>-<timestamp>`
+
+`<name>`: is the base name set by `--name` and can be one of the the following
+
+- `etcd-snapshot` is prepended on recurring snapshots
+- `on-demand` is prepended on manual, on-demand snapshots
+
+`<node>`: Node is the name of the node that the snapshot was taken on.
+
+`<timestamp>` is a unix-time stamp of the snapshot creation date.
+
+Some example snapshot names are:
+
+- `on-demand-my-super-rancher-k8s-node1-1652288934`
+- `on-demand-my-super-rancher-k8s-node2-1652288936`
+- `etcd-snapshot-my-super-rancher-k8s-node1-1652289945`
+- `etcd-snapshot-my-super-rancher-k8s-node2-1652289948`
+
+</TabItem>
+</Tabs>
+
 ### How Restoring from a Snapshot Works

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 On restore, the following process is used:

 1. The snapshot is retrieved from S3, if S3 is configured.
@@ -68,8 +135,34 @@ On restore, the following process is used:
 4. The other etcd nodes download the snapshot and validate the checksum so that they all use the same snapshot for the restore.
 5.  The cluster is restored and post-restore actions will be done in the cluster.

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+On restore, Rancher delivers a few sets of plans to perform a restoration. A set of phases are used, namely:
+
+- Started
+- Shutdown
+- Restore
+- RestartCluster
+- Finished
+
+If the etcd snapshot restore fails, the phase will be set to `Failed`.
+
+1. The etcd snapshot restore request is received, and depending on `restoreRKEConfig`, the cluster configuration/kubernetes version are reconciled.
+1. The phase is set to `Started`.
+1. The phase is set to `Shutdown`, and the entire cluster is shut down using plans that run the distribution `killall.sh` script. A new init node is elected. If the snapshot being restored is a local snapshot, the node that the snapshot resides on will be selected as the init node. If the snapshot is being restored from S3, the existing init node will be used.
+1. The phase is set to `Restore`, and the init node has the snapshot restored onto it.
+1. The phase is set to `RestartCluster`, and the cluster is restarted/rejoined to the new init node that has the freshly restored snapshot information.
+1. The phase is set to `Finished`, and the cluster is deemed successfully restored. The `cattle-cluster-agent` will reconnect, and the cluster will finish reconciliation.
+
+</TabItem>
+</Tabs>
+
 ## Configuring Recurring Snapshots

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 Select how often you want recurring snapshots to be taken as well as how many snapshots to keep. The amount of time is measured in hours. With timestamped snapshots, the user has the ability to do a point-in-time recovery.

 By default, [Rancher launched Kubernetes clusters](../../../pages-for-subheaders/launch-kubernetes-with-rancher.md) are configured to take recurring snapshots (saved to local disk). To protect against local disk failure, using the [S3 Target](#s3-backup-target) or replicating the path on disk is advised.
@@ -85,30 +178,89 @@ In the **Advanced Cluster Options** section, there are several options available
 | Recurring etcd Snapshot Creation Period | Time in hours between recurring snapshots| 12 hours |
 | Recurring etcd Snapshot Retention Count | Number of snapshots to retain| 6 |

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+Set the schedule for how you want recurring snapshots to be taken as well as how many snapshots to keep. The schedule is conventional cron format. The retention policy dictates the number of snapshots matching a name to keep per node.
+
+By default, [Rancher launched Kubernetes clusters](../../../pages-for-subheaders/launch-kubernetes-with-rancher.md) are configured to take recurring snapshots (saved to local disk) every 5 hours starting at 12 AM. To protect against local disk failure, using the [S3 Target](#s3-backup-target) or replicating the path on disk is advised.
+
+During cluster provisioning or editing the cluster, the configuration for snapshots can be found under **Cluster Configuration**. Click on **etcd**.
+
+| Option | Description | Default Value|
+| --- | ---| --- |
+| Recurring etcd Snapshot Enabled | Enable/Disable recurring snapshots | Yes |
+| Recurring etcd Snapshot Creation Period | Cron schedule for recurring snapshot | `0 */5 * * *` |
+| Recurring etcd Snapshot Retention Count | Number of snapshots to retain | 5 |
+
+</TabItem>
+</Tabs>
+
 ## One-Time Snapshots

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 In addition to recurring snapshots, you may want to take a "one-time" snapshot. For example, before upgrading the Kubernetes version of a cluster it's best to backup the state of the cluster to protect against upgrade failure.

 1. In the upper left corner, click **☰ > Cluster Management**.
 1. On the **Clusters** page, navigate to the cluster where you want to take a one-time snapshot.
 1. Click **⋮ > Take Snapshot**.

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+In addition to recurring snapshots, you may want to take a "one-time" snapshot. For example, before upgrading the Kubernetes version of a cluster it's best to backup the state of the cluster to protect against upgrade failure.
+
+1. In the upper left corner, click **☰ > Cluster Management**.
+1. On the **Clusters** page, navigate to the cluster where you want to take a one-time snapshot.
+1. Navigate to the `Snapshots` tab and click `Snapshot Now`
+
+### How Taking One-Time Snapshots Works
+
+On one-time snapshot creation, the Rancher delivers a few sets of plans to perform snapshot creation. A set of phases are used, namely:
+
+- Started
+- RestartCluster
+- Finished
+
+If the etcd snapshot creation fails, the phase will be set to `Failed`.
+
+1. The etcd snapshot creation request is received.
+1. The phase is set to `Started`. All etcd nodes in the cluster receive a plan to create an etcd snapshot, per the cluster configuration.
+1. The phase is set to `RestartCluster`, and the plans on every etcd node are reset to the original plan for the etcd nodes.
+1. The phase is set to `Finished`.
+
+</TabItem>
+</Tabs>
+
 **Result:** Based on your [snapshot backup target](#snapshot-backup-targets), a one-time snapshot will be taken and saved in the selected backup target.

 ## Snapshot Backup Targets

 Rancher supports two different backup targets:

-* [Local Target](#local-backup-target)
-* [S3 Target](#s3-backup-target)
+- [Local Target](#local-backup-target)
+- [S3 Target](#s3-backup-target)

 ### Local Backup Target

+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
 By default, the `local` backup target is selected. The benefits of this option is that there is no external configuration. Snapshots are automatically saved locally to the etcd nodes in the [Rancher launched Kubernetes clusters](../../../pages-for-subheaders/launch-kubernetes-with-rancher.md) in `/opt/rke/etcd-snapshots`. All recurring snapshots are taken at configured intervals. The downside of using the `local` backup target is that if there is a total disaster and _all_ etcd nodes are lost, there is no ability to restore the cluster.

+</TabItem>
+<TabItem value="RKE2/K3s">
+
+By default, the `local` backup target is selected. The benefits of this option is that there is no external configuration. Snapshots are automatically saved locally to the etcd nodes in the [Rancher launched Kubernetes clusters](../../../pages-for-subheaders/launch-kubernetes-with-rancher.md) in `/var/lib/rancher/<runtime>/server/db/snapshots` where `<runtime>` is either `k3s` or `rke2`. All recurring snapshots are taken per the cron schedule. The downside of using the `local` backup target is that if there is a total disaster and _all_ etcd nodes are lost, there is no ability to restore the cluster.
+
+</TabItem>
+</Tabs>
+
 ### S3 Backup Target

-The `S3` backup target allows users to configure a S3 compatible backend to store the snapshots. The primary benefit of this option is that if the cluster loses all the etcd nodes, the cluster can still be restored as the snapshots are stored externally. Rancher recommends external targets like `S3` backup, however its configuration requirements do require additional effort that should be considered.
+The `S3` backup target allows users to configure a S3 compatible backend to store the snapshots. The primary benefit of this option is that if the cluster loses all the etcd nodes, the cluster can still be restored as the snapshots are stored externally. Rancher recommends external targets like `S3` backup, however its configuration requirements do require additional effort that should be considered. Additionally, it is recommended to ensure that every cluster has a unique bucket and/or folder, as Rancher will populate snapshot information for any available snapshot that is listed in the S3 bucket/folder that is configured for the cluster.

 | Option | Description | Required|
 |---|---|---|
@@ -127,10 +279,10 @@ The backup snapshot can be stored on a custom `S3` backup like [minio](https://m

 The `S3` backup target supports using IAM authentication to AWS API in addition to using API credentials. An IAM role gives temporary permissions that an application can use when making API calls to S3 storage. To use IAM authentication, the following requirements must be met:

- - The cluster etcd nodes must have an instance role that has read/write access to the designated backup bucket.
- - The cluster etcd nodes must have network access to the specified S3 endpoint.
- - The Rancher Server worker node(s) must have an instance role that has read/write to the designated backup bucket.
- - The Rancher Server worker node(s) must have network access to the specified S3 endpoint.
+- The cluster etcd nodes must have an instance role that has read/write access to the designated backup bucket.
+- The cluster etcd nodes must have network access to the specified S3 endpoint.
+- The Rancher Server worker node(s) must have an instance role that has read/write to the designated backup bucket.
+- The Rancher Server worker node(s) must have network access to the specified S3 endpoint.

 To give an application access to S3, refer to the AWS documentation on [Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances.](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_use_switch-role-ec2.html)

@@ -142,14 +294,10 @@ The list of all available snapshots for the cluster is available in the Rancher
 1. In the **Clusters** page, go to the cluster where you want to view the snapshots and click its name.
 1. Click the **Snapshots** tab to view the list of saved snapshots. These snapshots include a timestamp of when they were created.

-## Safe Timestamps
+## Safe Timestamps (RKE)

 Snapshot files are timestamped to simplify processing the files using external tools and scripts, but in some S3 compatible backends, these timestamps were unusable.

 The option `safe_timestamp` is added to support compatible file names. When this flag is set to `true`, all special characters in the snapshot filename timestamp are replaced.

 This option is not available directly in the UI, and is only available through the `Edit as Yaml` interface.
-
-## Enabling Snapshot Features for Clusters Created Before Rancher v2.2.0
-
-If you have any Rancher launched Kubernetes clusters that were created before v2.2.0, after upgrading Rancher, you must [edit the cluster](../../../pages-for-subheaders/cluster-configuration.md) and _save_ it, in order to enable the updated snapshot features. Even if you were already creating snapshots before v2.2.0, you must do this step as the older snapshots will not be available to use to [back up and restore etcd through the UI](restore-rancher-launched-kubernetes-clusters-from-backup.md).
--- a/docs/how-to-guides/new-user-guides/backup-restore-and-disaster-recovery/restore-rancher-launched-kubernetes-clusters-from-backup.md
+++ b/docs/how-to-guides/new-user-guides/backup-restore-and-disaster-recovery/restore-rancher-launched-kubernetes-clusters-from-backup.md
@@ -41,12 +41,66 @@ To restore snapshots from S3, the cluster needs to be configured to [take recurr
 1. In the upper left corner, click **☰ > Cluster Management**.
 1. In the **Clusters** page, go to the cluster where you want to view the snapshots and click the name of the cluster.
 1. Click the **Snapshots** tab to view the list of saved snapshots.
-1. Go to the snapshot you want to restore and click **⋮ > Restore Snapshot**.
+1. Go to the snapshot you want to restore and click **⋮ > Restore**.
+1. Select a **Restore Type**.
 1. Click **Restore**.

 **Result:** The cluster will go into `updating` state and the process of restoring the `etcd` nodes from the snapshot will start. The cluster is restored when it returns to an `active` state.

-## Recovering etcd without a Snapshot
+## Restoring a Cluster From a Snapshot When the controlplane/etcd Are Completely Unavailable
+
+In a disaster recovery scenario, the control plane and etcd nodes managed by Rancher in a downstream cluster may no longer be available or functioning. The cluster can be rebuilt by adding control plane and etcd nodes again, followed by restoring from an available snapshot.
+
+<Tabs groupId="k8s-distro">
+<TabItem value="RKE">
+
+Follow the procedure described in the [SUSE Knowledgebase](https://www.suse.com/support/kb/doc/?id=000020695).
+
+</TabItem>
+<TabItem value="RKE2/K3s">
+
+:::note
+
+Due to a [known issue](https://github.com/rancher/rancher/issues/41080), this procedure requires Rancher v2.7.5 or newer.
+
+:::
+
+:::note
+
+If you are using [local snapshots](./back-up-rancher-launched-kubernetes-clusters.md#local-backup-target), it is **VERY** important that you ensure you back up the corresponding snapshot you want to restore from the `/var/lib/rancher/<k3s/rke2>/server/db/snapshots/` folder on the etcd node you are going to be removing. You can copy the snapshot onto your new node in the `/var/lib/rancher/<k3s/rke2>/server/db/snapshots/` folder. Furthermore, if using local snapshots and restoring to a new node, restoration cannot be done via the UI as of now.
+
+:::
+
+1. Remove all etcd nodes from your cluster.
+
+    1. In the upper left corner, click **☰ > Cluster Management**.
+    1. In the **Clusters** page, go to the cluster where you want to remove nodes.
+    1. In the **Machines** tab, click **⋮ > Delete** on each node you want to delete. Initially, you will see the nodes hang in a `deleting` state, but once all etcd nodes are deleting, they will be removed together. This is due to the fact that Rancher sees all etcd nodes deleting and proceeds to "short circuit" the etcd safe-removal logic.
+
+1. After all etcd nodes are removed, add a new etcd node that you are planning to restore from.
+
+    - For custom clusters, go to the **Registration** tab then copy and run the registration command on your node. If the node has previously been used in a cluster, [clean the node](../manage-clusters/clean-cluster-nodes.md#cleaning-up-nodes) first.
+    - For node driver clusters, a new node is provisioned automatically.
+
+    At this point, Rancher will indicate that restoration from etcd snapshot is required.
+
+1. Restore from an etcd snapshot.
+
+    - For S3 snapshots, restore using the UI.
+      1. Click the **Snapshots** tab to view the list of saved snapshots.
+      1. Go to the snapshot you want to restore and click **⋮ > Restore**.
+      1. Select a **Restore Type**.
+      1. Click **Restore**.
+    - For local snapshots, restore using the UI is **not** available.
+      1. In the upper right corner, click **⋮ > Edit YAML**.
+      1. Define `spec.cluster.rkeConfig.etcdSnapshotRestore.name` as the filename of the snapshot on disk in `/var/lib/rancher/<k3s/rke2>/server/db/snapshots/`.
+
+1. After restoration is successful, you can scale your etcd nodes back up to the desired redundancy.
+
+</TabItem>
+</Tabs>
+
+## Recovering etcd without a Snapshot (RKE)

 If the group of etcd nodes loses quorum, the Kubernetes cluster will report a failure because no operations, e.g. deploying workloads, can be executed in the Kubernetes cluster. The cluster should have three etcd nodes to prevent a loss of quorum. If you want to recover your set of etcd nodes, follow these instructions:

@@ -75,7 +129,3 @@ If the group of etcd nodes loses quorum, the Kubernetes cluster will report a fa
 5. Run the revised command.

 6. After the single nodes is up and running, Rancher recommends adding additional etcd nodes to your cluster. If you have a [custom cluster](../../../pages-for-subheaders/use-existing-nodes.md) and you want to reuse an old node, you are required to [clean up the nodes](../manage-clusters/clean-cluster-nodes.md) before attempting to add them back into a cluster.
-
-## Enabling Snapshot Features for Clusters Created Before Rancher v2.2.0
-
-If you have any Rancher launched Kubernetes clusters that were created before v2.2.0, after upgrading Rancher, you must [edit the cluster](../../../pages-for-subheaders/cluster-configuration.md) and _save_ it, in order to enable the updated snapshot features. Even if you were already creating snapshots before v2.2.0, you must do this step as the older snapshots will not be available to use to [back up and restore etcd through the UI](restore-rancher-launched-kubernetes-clusters-from-backup.md).