Operate Vault in recovery mode
Challenge
In exceptional circumstances, you can face the need to troubleshoot issues with a Vault server, such as configuration changes which cause it to become unavailable for general use.
Recovery via snapshot is not always a viable solution in such extreme cases, often because the issue root cause can prevent a Vault server from starting or servicing user requests.
Diagnosing and resolving such exceptional outage states can require that you access the storage at a low level that is impossible with a running Vault cluster.
Solution
Users of Vault version 1.3.0 or higher can operate Vault in recovery mode to troubleshoot and recover from some extreme circumstances when other methods are unavailable.
Recovery mode allows for direct low level interaction with raw portions of the internal storage for any supported storage type.
Vault limits recovery mode operation to list, read, delete and write operations against keys and values contained under the root path /sys/raw/
.
While operating in recovery mode, Vault is not available for responding to standard user requests, and instead just provides the minimum functionality required for maintenance and recovery purposes.
You can learn more about operating Vault in recovery mode by following the lab in this tutorial.
Warning
Ensure you have a backup or snapshot of the Vault server data before using any of the information from this tutorial in a live setting.
Prerequisites
To perform the steps in this tutorial, you need Vault. The Community Edition is suitable for this tutorial.
The Vault foundations tutorials is a great starting point if you are not familiar with Vault.
Some examples use, but do not necessarily require jq for formatting JSON output.
Prepare environment
Create a temporary directory to contain the work you will do in this scenario, and assign its path to the environment variable LEARN_VAULT
.
$ mkdir -p /tmp/learn-vault-recovery/data && \
export LEARN_VAULT=/tmp/learn-vault-recovery
Write the example configuration
You will begin the scenario with the example configuration file, vault-server.hcl
.
Write it to the scenario home directory.
$ cat > "${LEARN_VAULT}"/vault-server.hcl << EOF
api_addr = "http://127.0.0.1:8200"
cluster_addr = "http://127.0.0.1:8201"
cluster_name = "learn-recovery-server"
default_lease_ttl = "10h"
disable_mlock = true
max_lease_ttl = "10h"
pid_file = "$LEARN_VAULT/pidfile"
ui = true
listener "tcp" {
address = "127.0.0.1:8200"
tls_disable = "true"
}
backend "file" {
path = "$LEARN_VAULT/data"
node_id = "learn-recovery-server"
}
EOF
Insecure operation
The listener stanza disables TLS (tls_disable = "true"
). In production, Vault should always use
TLS to enable secure communication between clients and the Vault server. It requires a certificate file and key file on each Vault host.
Start Vault server
$ vault server -config $LEARN_VAULT/vault-server.hcl
==> Vault server configuration:
Api Address: http://127.0.0.1:8200
Cgo: disabled
Cluster Address: https://127.0.0.1:8201
Go Version: go1.16.5
Listener 1: tcp (addr: "127.0.0.1:8200", cluster address: "127.0.0.1:8201", max_request_duration: "1m30s", max_request_size: "33554432", tls: "disabled")
Log Level: info
Mlock: supported: false, enabled: false
Recovery Mode: false
Storage: file
Version: Vault v1.8.0
Version Sha: 82a99f14eb6133f99a975e653d4dac21c17505c7
==> Vault server started! Log data will stream in below:
2021-08-12T15:21:47.361-0400 [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""
Initialize, unseal, and login
In another terminal session, export the VAULT_ADDR
environment variable to address the Vault server.
$ export VAULT_ADDR=http://127.0.0.1:8200
Initialize Vault, and write initialization output to the file named .vault_init
in the temporary scenario directory specified by $LEARN_VAULT
.
$ vault operator init \
-key-shares=1 \
-key-threshold=1 \
> $LEARN_VAULT/.vault_init
Insecure operation
Do not run an unsealed Vault in production with a single key share and a single key threshold. This approach is just used here to simplify the unsealing process for this demonstration.
Set the environment variable UNSEAL_KEY
with the unseal key as its value.
$ UNSEAL_KEY="$(grep 'Unseal Key 1' "$LEARN_VAULT/.vault_init" | awk '{print $NF}')"
Unseal Vault.
$ vault operator unseal "$UNSEAL_KEY"
Key Value
--- -----
Seal Type shamir
Initialized true
Sealed false
Total Shares 1
Threshold 1
Version 1.8.0
Storage Type file
Cluster Name learn-recovery-server
Cluster ID 5820993a-bde7-b8f9-9894-b5fe07378833
HA Enabled false
Set the environment variable `ROOT_TOKEN
value to that of the initial root token.
$ ROOT_TOKEN=$(grep 'Initial Root Token' "$LEARN_VAULT/.vault_init" | awk '{print $NF}')
Note
For the purpose of this tutorial, you can use the root
token to work with Vault. However, you should use root tokens just for initial setup or in emergencies. As a best practice, use tokens with an appropriate set of policies based on your role in the organization.
Authenticate to Vault with the initial root token.
$ vault login -no-print "$ROOT_TOKEN"
Confirm that you've authenticated to Vault with the initial root token by checking that the token has the root policy attached.
$ vault token lookup | grep policies
policies [root]
You are now prepared to begin the scenario.
Scenario Introduction
To explore a Vault server running in recovery mode, you will perform the following:
- Run a Vault server using filesystem storage.
- Login with the initial root token, enable an audit device, and enable resource quotas.
- Stop the Vault server.
- Start the server again in recovery mode.
- Generate a recovery mode token, and use that token to perform some basic examination of the storage items through the
/sys/raw
endpoint.
Enable file audit device and resource quota
You can enable some simple configuration in Vault and an audit device so that you get a better picture of data in Vault later through the lens of recovery mode.
Enable a file audit device with output to the file at $LEARN_VAULT/audit.log
.
$ vault audit enable file file_path=$LEARN_VAULT/audit.log
Success! Enabled the file audit device at: file/
Enable a resource quota on the path sys/health
to enforce rate limiting of response headers and audit logging.
You will examine this information later as an example of configuration that you can change while in recovery mode for example to unblock from an undesired behavior with the server.
$ vault write /sys/quotas/config \
rate_limit_exempt_paths=sys/health \
enable_rate_limit_audit_logging=true \
enable_rate_limit_response_headers=true
Output:
Success! Data written to: sys/quotas/config
Stop Vault server
Return to the terminal session where you started the Vault server.
Press CTRL+C
(or CTRL+BREAK
on Windows) to stop the Vault server.
Start server in recovery mode
The /sys/raw API endpoint is not enabled by default. You must start the Vault server in recovery mode, then generate a recovery mode operation token to access the /sys/raw
endpoint.
When you have Vault operating in recovery mode, you will then generate a recovery mode operation token, and use that token for all operations in this scenario.
Start Vault server in recovery mode.
$ vault server -config $LEARN_VAULT/vault-server.hcl -recovery
Notice from the output that the server is now running in recovery mode.
==> Vault server configuration:
Seal Type: shamir
Cluster Address: http://127.0.0.1:8201
Go Version: go1.16.5
Log Level: info
Recovery Mode: true
Storage: file
Version: Vault v1.8.0
Version Sha: 82a99f14eb6133f99a975e653d4dac21c17505c7
==> Vault server started! Log data will stream in below:
2021-08-16T11:21:59.106-0400 [INFO] proxy environment: http_proxy="" https_proxy="" no_proxy=""
This same information would typically be present in the server logs of a production Vault.
Generate recovery mode operation token
All examples of querying the /sys/raw
endpoint demonstrated in this tutorial require the use of a recovery mode operation token. You will generate one to use as an example of the process here with the with vault
CLI using vault operator generate root
.
Return to the other terminal session where you first authenticated with Vault, and generate a one-time password (OTP).
$ vault operator generate-root -generate-otp -recovery-token
l5T1Uym6Fz5ogWOYTzSBAUj7cD
Use the OTP value to initialize the token generation process.
$ vault operator generate-root -init \
-otp=l5T1Uym6Fz5ogWOYTzSBAUj7cD \
-recovery-token
Example output:
Nonce efbe7aa1-2029-89e0-09c1-a45bd3822d4c
Started true
Progress 0/1
Complete false
OTP Length 26
You must pass in a quorum of unseal or recovery keys as necessary to generate an encoded token. For this scenario, you pass in just the single unseal key value.
Set the environment variable UNSEAL_KEY
with the unseal key as its value.
$ UNSEAL_KEY="$(grep 'Unseal Key 1' "$LEARN_VAULT/.vault_init" | awk '{print $NF}')"
Generate the encoded token.
$ vault operator generate-root \
-nonce efbe7aa1-2029-89e0-09c1-a45bd3822d4c \
-recovery-token $UNSEAL_KEY
Successful output resembles this example, and includes the encoded token.
Nonce efbe7aa1-2029-89e0-09c1-a45bd3822d4c
Started true
Progress 1/1
Complete true
Encoded Token HhtmRzA9P0ABFVkGIRQaKQAZBQQ1LQRQMDY
Decode the encoded token to generate the recovery mode operation token.
$ vault operator generate-root \
-decode=HhtmRzA9P0ABFVkGIRQaKQAZBQQ1LQRQMDY \
-otp=l5T1Uym6Fz5ogWOYTzSBAUj7cD \
-recovery-token $UNSEAL_KEY
Example output:
r.2veDRvGoliFCUpTcVFtxngSr
Note the prefix for the returned token value is r, designating this a recovery mode operation token.
Use the value of this recovery mode operation token for all examples of listing and reading /sys/raw/...
paths throughout the tutorial.
Examine storage paths
First list the top level sys/raw/
path.
$ VAULT_TOKEN=r.2veDRvGoliFCUpTcVFtxngSr vault list sys/raw
Keys
----
core/
logical/
sys/
While Vault encrypts all sensitive secret values, configuration information written to Vault without sensitive content gets stored as plaintext or JSON.
For example, you can find audit device information in the core/audit
key, which itself holds a single key named value
. You can read the key, and pass its value to jq
for a prettier version.
$ VAULT_TOKEN=r.J03W9LmJC4PIo6SHsnGsuShb vault read \
-field=value \
sys/raw/core/audit | jq
Example output:
{
"type": "audit",
"entries": [
{
"table": "audit",
"path": "file/",
"type": "file",
"description": "",
"uuid": "d2e93952-0eb9-61f5-4f84-eb3f5ca5979b",
"backend_aware_uuid": "",
"accessor": "audit_file_4ae38500",
"config": {},
"options": {
"file_path": "/tmp/learn-vault-recovery/audit.log"
},
"local": false,
"seal_wrap": false,
"namespace_id": "root"
}
]
}
This information corresponds precisely to the file based audit device you enabled earlier.
Tip
When troubleshooting production Vault servers with blocked audit devices, listing this information helps you learn the target file, network port, or socket for the purposes of unblocking the device.
Now list the resource quotas path vault list sys/raw/sys/quotas
.
$ VAULT_TOKEN=r.2veDRvGoliFCUpTcVFtxngSr vault list sys/raw/sys/quotas/
Keys
----
config
default_rate_limit_exempt_paths_toggle
The returned keys contain the resource quota configuration for the quota you enabled earlier. Again, there is a single key named value
containing the JSON configuration.
$ VAULT_TOKEN=r.2veDRvGoliFCUpTcVFtxngSr vault read \
-field=value \
sys/raw/sys/quotas/config | jq
Example output:
{
"enable_rate_limit_audit_logging": true,
"enable_rate_limit_response_headers": true,
"rate_limit_exempt_paths": ["sys/health"]
}
The configuration details match what you wrote earlier in the enable resource quota step before starting Vault in recovery mode.
Most extreme troubleshooting scenarios which require recovery mode typically involve more than listing or reading keys and values. You typically also need to delete particular keys related to the functionality that is blocking operations.
Warning
Exercise extreme caution when using delete or write operations while in recovery mode. Always validate the key name and contents, and have a snapshot from a time before the modifications at hand before performing any operation that writes to the storage. Enterprise users can coordinate with HashiCorp Customer Success for help with this process.
Feel free to explore the other keys and values, and when you finish, you can clean up the scenario environment.
Cleanup
You can clean up from this scenario by following these steps.
From the terminal session where the Vault server is running, press
CTRL+C
(orCTRL+BREAK
on Windows) to stop the server.Remove the data created in the scenario.
$ rm -rf "$LEARN_VAULT"
Unset environment variables.
$ unset ROOT_TOKEN UNSEAL_KEY VAULT_ADDR
Unset environment variables in the other terminal.
$ unset UNSEAL_KEY VAULT_ADDR
Usage tips
Here are some tips to keep in mind when using recovery mode in production.
Always have a recent snapshot available to restore from if you must revert any changes made in recovery mode.
Review the Recovery Mode documentation, which describes the required
-recovery
runtime configuration flag. You should refer to that documentation before configuring your Vault server startup script to start Vault in recovery mode.When using the
vault
CLI, formatting output as JSON with the flag-format=json
can often help with listing items which you need to iterate over.Be sure to update your Vault server startup script to remove
-recovery
from the flags so that you can start the server for regular operation when recovery mode operation is complete.
Summary
You learned how to operate a Vault server in recovery mode, how to generate and use a recovery mode operation token.
You also learned how to examine information in the low level storage using the recovery operation mode token, with an emphasis on the caution around write operations.