Turning Prometheus data into metrics for alerting

As you may have seen in previous blog posts, we have a Warboard in our office which shows the current status of the servers we manage. Most of our customers are using our new alerting stack, but some have their own monitoring solutions which we want to integrate with. One of these was Prometheus. This blog post covers how we transformed raw Prometheus values into percentages which we could display on our warboard and create alerts against.

Integrating with Prometheus

In order to get a summary of the data from Prometheus to display on the Warboard we first needed to look at what information the Node Exporter provided and how it was tagged. Node Exporter is the software which makes the raw server stats available for Prometheus to collect. Given that our primary concerns are CPU, memory, disk IO and disk space usage we needed to construct queries to calculate them as percentages to be displayed on the Warboard.

Prometheus makes the data accessible through its API on "/api/v1/query?query=<query>". Most of the syntax is fairly logical with the general rule being that if we take an average or maximum value we need to specify "by (instance)" in order to keep each server separate. Node Exporter mostly returns raw values from the kernel rather than trying to manipulate them.  This is nice as it gives you freedom to decide how to use the data but does mean we have to give our queries a little extra consideration:


(1 - avg(irate(node_cpu{mode="idle"}[10m])) by (instance)) * 100

CPU usage is being reported as an integer that increases over time so we need to calculate the current percentage of usage ourselves. Fortunately Prometheus has the rate and irate functions for us. Since rate is mostly for use in calculating whether or not alert thresholds have been crossed and we are just trying to display the most recent data, irate seems a better fit. We are currently taking data over the last 10 minutes to ensure we get data for all servers, even if they’ve not reported very recently. As total CPU usage isn’t being reported it is easiest use the idle CPU usage to calculate the total as 100% – idle% rather than trying to add up all of the other CPU usage metrics. Since we want separate data for each server we need to group by instance.


((node_memory_MemTotal - node_memory_MemFree) / node_memory_MemTotal) * 100

The memory query is very simple, the only interesting thing to mention would be that MemAvailable wasn’t added until Linux 3.14 so we are using MemFree to get consistent values from every server.

Disk IO

(max(avg(irate(node_disk_io_time_ms[10m])) by (instance, device)) by (instance))/10

Throughout setting up alerting I feel disk IO has been the “most interesting” metric to calculate. For both Telegraf, which we discuss setting up here, and Node Exporter I found looking at the kernel docs most useful for confirming that disk “io_time” was the correct metric to calculate disk IO as a percentage from. Since we need a percentage we have to rule out anything dealing with bytes or blocks as we don’t want to benchmark or assume the speed of every disk. This leaves us with “io_time” and “weighted_io_time”. “weighted_io_time” might give the more accurate representation of how heavily disks are being used; it multiplies the time waited by a process, by the total number of processes waiting. However we need to use “io_time” in order to calculate a percentage or we would have to factor in the number of processes running at a given time. If there are multiple disks on a system, we are displaying the disk with the greatest IO as we are trying to spot issues so we only need to consider the busiest device. Finally we need to divide by 1000 to convert to seconds and multiply by 100 to get a percentage.

Disk Space

max(((node_filesystem_size{fstype=~"ext4|vfat"} - node_filesystem_free{fstype=~"ext4|vfat"}) / node_filesystem_size{fstype=~"ext4|vfat"}) * 100) by (instance)

As Node Exporter is returning 0 filesystem size for nsfs volumes and there are quite a few temporary and container filesystems that we aren’t trying to monitor, we either need to exclude ones we aren’t interested in or just include those that we are. As with disk IO, many servers have multiple devices / mountpoints so we are just displaying the fullest disk, since again we are trying to spot potential issues.

If you are looking to set-up your own queries I would recommend having a look through the Prometheus functions here and running some ad hoc queries from the "/graph" section of Prometheus in order to look at what data you have available.

If you need any help with Prometheus monitoring then please get in touch and we’ll be happy to help.

End of Life for New Relic ‘Servers’ – What are your options?

Today (14 Nov 2017) New Relic are making their ‘Alerts’ and ‘Server’ services end of life (EOL). This will impact anyone who used this service to monitor server resources such as CPU, Memory, Disk Space and Disk IO. All existing alert policies will cease from today.

If you rely on these alerts to monitor your servers then hopefully you have a contingency plan in place already but if not below are your options….

If you do nothing

New Relic Servers will go EOL TODAY (14 Nov 2017) and data will stop being collected.  You would no longer be able to monitor your system resources meaning outages that could have otherwise been prevented could sneak up on you. We do not recommend this option.  See below on how to remove the `newrelic-sysmond` daemon.

Upgrade to New Relic Infrastructure

“Infrastructure” is their new paid server monitoring offering. Infrastructure pricing is based on your servers CPU so prices vary and offers added functionality over the legacy New Relic Servers offering.  The maximum price per server per month is $7.20 however the minimum monthly charge is $9.90 so it’s not effective if you’re only looking to monitor your main production system. Most of the new functionality is integration into other products (including their own) so it’s up to you if this additional functionality is useful and worth the cost for your requirements.

Dogsbody Technology Minder

Over the last year we have been developing our own replacement for New Relic Servers using open source solutions. This product has old New Relic Server customers in mind giving all the information needed to run and maintain a Linux server. It also has the monitoring and hooks required to alert the relevant people of issues allowing us to prevent issues before they happen.  This is a paid service but it is included as standard with all our maintenance packages so any customers using New Relic Servers are being upgraded automatically. If you would like further information please do contact us.

Another alternative monitoring solution

There are plenty of other monitoring providers and solutions out there from in-house build your own open source solutions to paid services.  Monitoring your system resources is essential in helping to prevent major outages of your systems. Pick the best one for you and let the service take the hard work out of monitoring your servers.  We have experience with a number of implementations including the TICK stack (Telegraf, InfluxDB, Chronograf, Kapacitor) and Prometheus.

Removing the `newrelic-sysmond` daemon

If you were using New Relic Servers then you are running the `newrelic-sysmond` daemon on your systems.  While New Relic have turned the service off we have confirmed with them that the daemon will keep running using valuable system resources.

We highly recommend that you uninstall the daemon (tidy server tidy mind) following New Relic uninstallation guide.  That way it won’t take much of your system’s resources, and minimal impact is to be expected.


Happy Server Monitoring

If you need help, further advise or to discuss our monitoring solutions please do contact us.

4 Common Server Setups For Your Web Application

There are so, so, many possibilities you may consider when designing your infrastructure, each with its distinct advantages and disadvantages. Here we’ll cover 4 of the most popular combinations, and explain the pros and cons of each.

To start with, take the following bullet points and think for a moment how your architecture may differ vs another if you made each one the top priority when designing it:

  • ease of management
  • cost
  • reliability
  • performance
  • scalability
  • availability

Everything on one server

All components sits on a single system. For a typical modern web application, this would include all of the parts needed to run the app, such as a web server, database server, and the application code itself. A very common setup of these components would be the LAMP stack, which is an acronym for Linux, Apache, MySQL, PHP (Python, Perl), and is used at all levels, entry to enterprise, just with tweaks for the latter.

When to use it?

If you’re just after a quick and simple set up to host your basic app/site, then you’re gonna struggle to find anything easier to get started with. A tried and tested workhorse.


  • Simple! With some hosting providers you can set this up in just a few clicks


  • All your eggs in one basket. If this server goes offline, so does your site/app
  • Not very easy to scale horizontally
  • All components competing for the same limited resources

Separate Database Server

Splitting out the database component of your infrastructure from the rest of it (and this will be an ongoing theme) allows you to isolate the resources available to each of these components. This makes capacity planning much easier, and can also give you some fault tolerance in more advanced configurations. This is almost always the first infrastructure upgrade we see people spring for and we often recommend it ourselves. It’s a quick win, and pretty easy to do.

When to use it?

When you’re looking for your first set of upgrades for your infrastructure, your database layer is struggling, or you’d like better separation between your web and database components.


  • Simple upgrade from a single server set up
  • Better separation between resources, leading to easier scaling


  • Additional cost over single server
  • A little more complex than a single server setup
  • More areas, primarily network performance, need to be considered

Caching Layer

Caching content can make an absolutely massive difference to site performance and scalability. Caching involves storing in memory a (usually) commonly requested asset, think a logo or a web page, so that it can be served to a visitor without having to be generated and/or loaded from the disk every time.

Two of the most popular pieces of software used in caching are Memcached and Redis, both of which are key-value stores. This means that each piece of cached data has a key, which is essentially a name used to identify the data, and a value, which is the data itself. As explained above, these values are returned to a visitor instead of having to generate the data fresh again.

Plugins for Redis/Memcached support exist for most popular CMSes, allowing for very easy integration with an existing set up. See this tutorial for an example of integrating WordPress with Redis

When to use it?

When you’re serving a lot of static content, such as images, CSS or Javascript.


  • Alleviates load on your application servers
  • Big performance increases for little effort


  • Can be difficult to tune correctly

Load Balancing

The clue is in the name. In this set up, the load is balanced between multiple servers, allowing them all to serve a portion of the traffic. This also has the benefit of added redundancy, in that if one of the servers were to go offline, the other server(s) around around to handle the load.

When to use it?

When redundancy is important, and when scaling individual servers any higher becomes an unreasonable process.


  • Increased redundancy
  • Scaling with zero downtime by adding in additional servers to the pool


  • Single point of failure. If the load balancer goes down, all traffic to your site will go down
  • Additional complexity. A prime example is with sessions, and ensuring that visitors have a consistent web experience regardless of which app server handles their request


Managing infrastructure can be hard, especially when you get to the advanced levels such as load balancing and auto-scaling, but there are also very positive results to be had from not huge amounts of work. Planning for the future and considering the differing needs of the many parts of your app can save you a lot of heartache and expense down the line. Still unsure? Contact us and we’ll be happy to help.


Feature image credit CWCS Managed HostingCC BY 2.0

Surviving A “Hug of Death”

One of the wonders of the modern internet is the ability to share content and have it accessible from anywhere in the world instantly. This allows the spread of information to take place at unparalleled speeds, and can result in a sort of virtual flash mob where something gets very popular very quickly without the chance to manage accordingly. Just as in real life, these flash mobs can get out of hand.

These “virtual flash mobs” have been called a few different things over the years, a common one was “getting slashdotted”, where the traffic resulted from something getting popular on slashdot. Another, and my favourite, is the reddit “hug of death”.

This blog post will aim to help you understand, prepare for, and handle a hug of death.


As mentioned above, hugs of death tend to start quickly, so you’d better have some monitoring with high resolution. If you want to respond before things get too bad, you’ll need to act quick. This is of course if you don’t have automated responses, but that’s something we’ll discuss below.


Optimising any website is important, but it’s particularly important on high traffic sites, as any inefficiencies are going to be amplified the higher the traffic level gets. Identifying these weak-spots ahead of time and working to resolve them can save a lot of heart-ache down the line. An area of particular importance for optimisation is your SQL/database layer, as this is often the first area to struggle, and can be much harder to scale horizontally than other parts of a stack.


Using a CDN to serve your site’s assets helps in two regards. It lowers the download latency for clients by placing the assets in a closer geographic location, and removes a lot of the load from your origin servers by offloading the work to these externals places.

Tiered Infrastructure

Having the different levels of your infrastructure scale independently can allow you to be much more agile with your response to a hug of death. Upscaling only the areas of your infrastructure that are struggling at any given time can allow you to concentrate your efforts in the most important places, and save you valuable money by not wasting it scaling every part of a “monolithic” infrastructure instead of just the area that needs it.


What makes responding to a hug of death easy? Not having to “respond” at all. With the ever increasing popularity of cloud computing, having your infrastructure react automatically to an increase in load is not only possible, but very easy. We won’t go into specifics, but it basically boils down to is “if the load is above 70%, add another server and share the traffic between all servers”.

As scary as a hug of death sounds, they’re actually great overall. It means you’ve done something right, as everybody wants a piece of what you’ve got. If you want some help preparing then please get in touch and we’ll be happy to help.

AWS services that need to be on your radar

We are avid AWS users and the AWS Summit this year really added to our excitement. AWS have grown quicker and larger than any other server host in the past few years and with it there has been a flood of new AWS technologies and services. Below are our favourite solutions, it is time to put them on your radar.

What is AWS?

AWS (Amazon Web Services) are the biggest cloud server provider, their countless services and solutions can help any company adopt the cloud. Unlike some of their competitors AWS allow you to provision server resources nearly instantly, within minutes you can have a server ready and running. This instant provisioning makes AWS a must for anyone looking into scalable infrastructure.

1) Elastic File System (EFS)

EFS has been on our radar since it was first announced, EFS is Amazons solution to NFS (Network File System) as a service. It is the perfect addition to any scalable infrastructure, enabling you to share content instantly between all of your servers and all availability zones. If you wanted your own highly available NFS infrastructure it would take at least five servers and hundreds of pounds to recreate their scale. It has been a long time coming and EFS has finally been released from beta, rolling out into other regions including the EU, huzzah!

2) Codestar

Codestar is Amazon’s new project management glue, it pulls together a number of Amazon services making application development and deployment a seamless process. Within minutes you can turn your code repository into a highly available infrastructure. Codestar automatically integrates with:

  • CodeCommit – A git compatible repository hosting system which scales to your needs.
  • CodeBuild – Compile, test and create applications that are ready to deploy.
  • CodeDeploy – Automatic rolling out updates to your infrastructure, CodeDeploy handles the infrastructure helping you avoid downtime.
  • CodePipeline -The process getting your code from CodeCommit, into testing, into CodeDeploy.
  • Atlassian JIRA – CodeStar can also tie into JIRA, a popular Issue tracking and project management tool.

I have just started using CodeStar for my personal website and I love it, it makes continuous deployment a pleasure. All of those little tweaks are just one git push away from being live and if anything goes wrong CodeDeploy can quickly roll back to previous versions.

3) Athena

In a couple of clicks Athena makes your S3 data into a SQL query-able database. It natively supports: CSV, TSV, JSON, Parquet, ORC and my favourite Apache web logs. Once the data has loaded you can get started writing SQL queries.

Earlier this week there was an attack on one of our servers, in minutes I had the web logs into Athena and was manipulating the data into reports.

4) Elastic Container Service (ECS)

ECS takes all of the hassle out of your container environment letting you focus on developing your application. ECS has been designed with the cloud ethos from the ground up, designed for scaling and isolating tasks. It ties straight into AWS CloudFormation allowing you to start a cluster of EC2 instances all ready for you to push your Docker images and get your code running.

In Summary

One common theme you might have picked up on is that Amazon is ready for when you want to move fast. Streamlined processes are at the heart of their new offerings, ready for horizontal scaling and ready for continuous integration.

Are you making the most of AWS?

Is the cloud is right for you?

Drop us a line and we will help you find your perfect solution.

Feature image by Robbie Sproule licensed CC BY 2.0.

Is your uptime in your control?

People have always relied on 3rd parties to provide services for them, and this is especially true in the technology sector. Think server providers, payment providers, image hosting, CSS & JS libraries, CDNs etc. The list is endless. Using external providers is of course fine, why re-invent the wheel after all? You should be concentrating on what makes your product/service unique, not already-solved problems. (It’s also the Linux ethos!)

Why should you care?

With that said, relying on other people’s services is obviously a problem if their service isn’t up. Luckily, most big service providers, and lots of smaller providers too, have status pages where they will provide the current status for their systems and services. (see ours here). These status pages are great during an unforeseen outage, as you can get the latest info, such as when they are expecting the issue to be fixed, without having to contact their support team with questions, at a time when their support is probably under a lot of strain due to the outage in question.

Lots of status pages even allow you to subscribe to updates, meaning you’ll receive an email or SMS (or even have them call a web-hook for an integration into your alerting systems) when there is an issue.

As much as everyone hates outages, they are unfortunately a part of life, and when it’s another service provider’s outage, there isn’t much you can do. (Ideally you should never have a single point of failure, i.e. high availability, but that is a blog post for another time).

What can you do about it?

However, not all outages are unforeseen, and lots of common issues are easy to prevent ahead of time with some simple steps:

    1. Monitor the status page / blogs of your service providers for warnings of future work that could effect you, and make a record of it
    2. Subscribe to any relevant mailing lists. These not only let you know about issues, but allow you to take part in a discussion around the issue and it’s effects
    3. Set up your own checks for service providers that don’t have a status page and/or an automated reminder system (we can help with this).
    4. Make sure that reminder notifications are actually being seen, not just received. You could have all of the warning time in the world, but if nobody reads the notification, you can’t action anything.

Other things to consider

As mentioned above, your customers are likely to be more forgiving of your outage if it is somebody else’s fault, but they’re not gonna be happy if it’s your fault, and they are really not gonna be happy if it was easily preventable.

The two most common problems that fall into this bracket are domain name and SSL certificate renewals. Every website needs a domain name, and massive amounts of sites use SSL in at least some areas. If your domain name expires, your site could become unavailable immediately (depending on your domain registrar and how nice they are).

SSL certificate expiries can also cause your site to become unavailable immediately. On top of this, browsers will give nasty warnings about the site being insecure. This is likely to stick in the mind of some visitors, meaning it could damage your traffic and/or reputation even after the initial issue has been resolved. It’s also really easy to set up checks for these two things yourselves.

If you don’t want to set these up, then we handle this for you as part of our maintenance packages. Just contact us and we can get this set up for you right away.


Data Privacy Day 2017

This year we bring you an infographic showing how data privacy is good for business. We also encourage you to check out last years post about the business requirements of running a business in the UK.

Feature image by g4ll4is under the CC BY-SA 2.0 license.


HTTP/2 is a fairly new technology, offering significant improvements over its predecessors, whilst remaining backwards compatible with previous web browsers and services. HTTP/2 is only going to get bigger, and it’s certainly not going away any time soon, so here’s some stuff you should know about it.

Before we get too in depth with the advantages of HTTP/2 and the reasons you should be using it, it’s important we understand what HTTP is in the first place, and how it fits into modern internet use.

HTTP stands for Hyper Text Transfer Protocol, and it is one of the main parts of the modern web as we know it. It is a set of rules for how information should be sent and received between systems. Any text, images and media you see and interact with on a standard web page (including this one) would most likely have been sent to you using HTTP.

The downsides of regular ol’ HTTP

HTTP has been around for a long time. This of course is not inherently bad, but HTTP was designed a long time ago, and things have changed a lot since then. HTTP/1.1, which is the version that a very large majority of the modern web uses, was first standardised in 1997, and saw major development before that date too.

That’s 20 years ago now, and in that time the internet has gone from something connecting only large enterprises and government facilities, into a truly global communications utility used daily by billions of people.

The original HTTP/1.1 spec was never designed with this sense of scale and use in mind, and so it has shortcomings in the modern day, resulting in the need for often time-consuming and complex workarounds.

One of the biggest drawbacks of HTTP/1.1 is the need for new connections on every request. This adds overheads, which are amplified due to the large number of assets used on most modern websites, and amplified even further by the additional overhead of negotiating HTTPS connections when loading assets securely.

What is HTTP/2 and what are the advantages?

HTTP/2 is the newest version of HTTP, and was standardised in mid-2015, taking influence from the earlier SPDY protocol, initially designed by Google. HTTP/2 offers significant improvements over previous versions in the following ways

  • Server push – the web server running your website can push assets to visitors before they request them, speeding up the overall load times of pages
  • Concurrency – all communication can happen via one connection, removing the overhead and complexity of establishing and maintaining multiple connections, which again results in speed improvements
  • Dependency specifications – you can now specify which of the items on your page are most important, and make sure the most important ones are dealt with first. This means the content somebody wants to see can be displayed sooner
  • Header compression – decreases the amount of data to be transferred by compressing the metadata in messages being sent and received, lowering bandwidth usage and once again decreasing load times

All of these advantages, combined with sites and applications making the most of them, can result in significant improvements in page load speeds, particularly on mobile devices, and a much nicer overall experience on the web.

An interesting point on HTTP/2 is that although there is nothing in the RFC that specifies HTTP/2 should only support encrypted connections (using TLS or SSL), some major browsers such as Firefox and Chrome have stated they will not support HTTP/2 over plain-HTTP connections. This means that in a lot of cases, you’ll have to support HTTPS in order to reap the benefits that HTTP/2 provides, but you should really be using HTTPS by now anyway, so this is not too big a deal.

Sound good? We can help!

If HTTP/2 sounds like something you’re interested in, then just get in touch and we’re more than happy to help. We’ve been running HTTP/2 on our website for quite a while now, and we’d love to help you get it running on yours!

How will Ubuntu 12.04 end of life affect me?

On April 2017, Ubuntu 12.04 reaches end of life (EOL).
We recommend that you update to Ubuntu 16.04.

Over time technology and security evolves, new bugs are fixed and new threats prevented, so in order to maintain a secure infrastructure it is important to keep all software and systems up to date.

Operating systems are key to security, providing the libraries and technologies behind NGINX, Apache and anything else running your application. Old operating systems don’t support the latest technologies which new releases of software depend on, leading to compatibility issues.

Leaving old Ubuntu 12.04 systems past April 2017 leaves you at risk to:

  • Security vulnerabilities of the system in question
  • Making your network more vulnerable as a whole
  • Software incompatibility
  • Compliance issues (PCI)
  • Poor performance and reliability

Ubuntu End of life dates:

Ubuntu LTS (long term support) operating systems come with a 5 year End Of Life policy. This means that after 5 years it receives no maintenance updates including security updates.

  • Ubuntu 12.04 : April 2017
  • Ubuntu 14.04 : April 2019
  • Ubuntu 16.04 : April 2021


Just picking up your files and moving them from Ubuntu 12.04 to Ubuntu 16.04 will speed up your site due to the new software.

  • Apache 2.2 -> Apache 2.4
  • MySQL 5.5 -> MySQL 5.6
  • PHP 5.3 -> PHP 7.0

Are you still using an old operating system?

Want to upgrade?

Not sure if this effects you?

Drop us a line and see what we can do for you!


Feature image by See1,Do1,Teach1 licensed CC BY 2.0.

Open-sourcing our Raspberry Pi Displayboard

Our office warboard runs off a simple Raspberry Pi plugged into a wall mounted TV however the code to get this to work reliably has taken a bit of tweaking over the years.

Today we continue our efforts to give back to the open source community by publishing our recipe for a solid, stable displayboard that can be used for anything from digital signage to office displays.

You can find all the code in our pi-display GitHub Repo.

This code…

  • Waits for the TV/display to be turned on before proceeding.
  • Reconfigures the resolution to match the best resolution the TV/display has to offer.
  • Fixes itself and any bad configuration should corruption occur from a bad webpage.
  • Works with the latest SSL technologies (TLS1.2).
  • Supports CEC commands allowing you to control the TV via the HDMI cable.
  • Installs fonts required for correct webpage rendering

Our office warboard is not only locked down to certain IP addresses but also uses the latest SSL protocols and ciphers. The stock chromium on Raspberry Pi wasn’t up to date (v22 when the current version is v51) and didn’t support the latest security protocols.

This repo used to use the epiphany browser instead which was more up to date (but not as stable). Now (28 Sep 2016) the Raspberry Pi team have released PIXEL which includes a much more up to date version of the Chromium browser.

This install also downloads and compiles the latest cec-client that allows you to turn the TV on and off each day via cron.

Let us know if you find this useful and feel free to fork and/or make pull requests 🙂