Don’t SSH into Production

Routine server system administration tasks should be handled with automation and services, through code and software. Not logging in to system consoles for manual routine maintenance can be seen as an indicator of capability maturity. Logins to critical servers via SSH should be audited to determine who accessed the servers and what they did. Auditing can get complex when accessing servers via SSH is the standard policy and when considering cases like SSH forwarding and tunneling.

As a test, before logging in to a server to carry out a task, ask yourself the following:

• Was this task tested first in a dev/QA/test environment?
• Is this a one-off task (versus a routine task or request)?

If you answer no to either question, you should reconsider your workflow and think of ways to automate away the kind of work you SSH for.

Let’s review some common reasons a cloud engineer would want to log in to a server:

• To examine logs, like application, container, or operating system logs. This is a solved problem. Using a stack like Elasticsearch, Fluentd, and Kibana, or a third-party logging service in the cloud, will provide log aggregation, search, visualization, and permanent storage capabilities, with a proper life cycle and backups.

• For monitoring, to look at server telemetry like CPU/RAM/disk usage or exposed application performance metrics. This is also a solved problem; we have a myriad of commercial and open source tools at our disposition.

• For routine changes in the system, such as making configuration changes, patching the operating system, managing software installations and upgrades, and performing backups and restores. All these changes should ideally be done using infrastructure as code. We declare in code (which we keep versioned) our infrastructure and make our changes in code. Then, depending on our workflow, philosophy, and tooling, we can use configuration management tools, or we can re-create the server image, or we can use our favorite coding language and take advantage of the cloud vendor’s software development kit (SDK) or API.

• Running tests. “Testing” in production can be needed to get a real view of application behavior; fake test data rarely behaves like the real thing. Or we may need to run a query that is not shown in a reporting server. While these are valid tasks, we should still avoid ad hoc manual opera‐ tions and look into replacing them with code and systems that will per‐ form such operations with less risk.

• “My server is a snowflake that needs constant TLC.” Look into “cattle versus pets,” because you have some problems.

• “I don’t know what is running on this server or what this server is supposed to run.” You have bigger problems you need to address.

There are a few valid reasons to SSH into a production server that is part of an application running in the cloud. Sometimes while troubleshooting, we need to log in to a server as a last measure because the information we have from the log and metrics servers is not enough to determine the cause of a problem. For example, we may not be getting logs or metrics themselves, or we may have network issues of the type “this host doesn’t seem to be able to talk to this other host” and we want to verify that connectivity. We may also have hard Linux kernel issues, or strange behavior not explained by logs or indirect information. Another reason to SSH into servers is for the purpose of exploration or learning for new people in a team.

In any case, the next time you are about to log in to a server, stop and think:
“How could I accomplish this task without manually getting into the server?”