This post will show you the keys to troubleshooting ‘server down’ issues in Linux, cover some common causes for outages and how you can investigate and resolve them. A well configured Linux server offers an incredible level of stability and performance. Inevitably though, outages are pretty much a fact of life.
First though, it’s important to know what exactly an outage is.
An outage is pretty universally defined as…
A period of time where a service is unavailable or when the output of a service is incomplete, corrupted, or otherwise unable to provide normal levels of interaction.
For example, we’d consider both of the following to be outages:
• A website not being returned at all when a user requests it
• A site is returned, but all of the images were missing
Outages come in many shapes and sizes however. Here are some we commonly deal with, as we manage and monitor 100s of customer servers every day.
A piece of software on the server stops functioning, either in an unexpected way or altogether. This is usually pretty black and white to investigate.
Is the service running? If not, then you can try starting it back up.
If the service still fails to start, try checking the logs for information.
Typically services only fail when a configuration changes somewhere so that would usually be the first place to start any investigation.
Resources on the server(s) are so strained that requests are extremely slow, or fail due to a timeout.
It’s fairly easy to spot when CPU or RAM is being used up with a simple `top` command.
Disk usage should also be monitored, but many don’t think about disk IO being the bottleneck. This can be especially true in virtualised environments when IO is limited, sometime intentionally by the likes of AWS.
Resource exhaustion can be due to an attack, which might be looking for vulnerabilities or a denial of service (DoS). Typically these requests need to be blocked further upstream.
In this age of the cloud, it’s easy to forget that servers are still physical (even though they are owned by someone else). Physical hardware does still break and these things need to be checked and monitored.
Hard drives are still one of the most common elements to fail. This is why RAID is still important in this day and age.
Memory issues are some of the hardest to diagnose as they cause instability to the system, making it fall over in new and interesting ways (we’ve seen them all!).
Fans can also fail, causing the system to overheat and other parts to fail sooner.
We had a new customer come to us recently, because their system was unstable… two of the three fans had failed, causing memory corruption and a single disk error in their RAID array which they knew nothing about, as they had no monitoring.
Your site is down, your server is unreachable and apparently dead in the water… or is it? Could it be absolutely fine but no one can talk to it?
You might be thinking, “what’s the difference?”. The answer is, what needs to be done, to resolve the issue. Chances are, the problem is upstream from your server and there maybe nothing you can actually control yourself – playing a waiting game is your only answer.
“Netsplits” are still surprisingly common on the internet, this is where two parts of the internet are still “working” but unable to talk to each other.
Uptime monitoring utilities can come in handy here. These utilities constantly attempt to connect to your site and/or applications, from various locations around the globe and send notifications if there are issues. The logs can also help you determine if only certain geographies are unable to access your service, or if the problem is truly global.
Typically a mixture of one or more of the above outage types is happening, but is affecting the ISP that is hosting your server/site. When a service provider’s systems experience problems, this can cause cascading issues with anything built on top of, or relying on their systems.
This is typically resolved with a support ticket to the provider(s) having problems.
In our experience, the only way to mitigate this type of issue is to host across multiple providers. This is not for the faint of heart but is possible with enough planning.
Some final thoughts on troubleshooting ‘server down’ issues in Linux
The lines between the outage types we’ve listed can often be blurred, as one can often have a “domino effect” leading to others. For example, a failing hard disk can cause read/writes to slow down massively, which can lead to requests taking longer than usual, which can lead to resource exhaustion as things get tied up.
Debugging outages can be time consuming if there are lots of moving parts to your application/service. We find our members of staff develop a gut instinct over years and years of debugging problems and honing their skills.
If you struggle troubleshooting ‘server down’ issues in Linux boxes on your infrastructure whether shared or self-hosted … if any of the above seems too much for you, or you’d just like somebody else to take your worries away, please contact us and I’m sure we can help.