Skip to content

Prometheus: relabel your scrape_config

Prometheus labels every data point — the most well-known example of a label is (probably) instance.

Take a look at this query result (query: up{job="prometheus"}):

up{instance="127.0.0.1:9090",job="prometheus"} 1

So what does this tell me?

I queried for the "up" metric and filtered it for "prometheus" — yay. The "1" says, my service is alive. So far so gut.

Readability

Since we are in the process of running a few Prometheus servers (in federation), each of those metrics will report back with instance="127.0.0.1:9090" (along with other labels of course).

While this works, I'm not a computer. If the "instance" reported an FQDN or some other readable name, it would make any dashboard or alert more approachable. Or readable, if you will.

The instance label

instance is a standard field used in various Grafana dashboards out there. Dashboards often use the value in instance to provide you with a dropdown list of (well) instances (or nodes) to select from.

To not end up with a dropdown full of 127.0.0.1:9000, here is a snippet on how to work with labels to make life a little easier.

Rewriting labels

Consider the following scrape_config:

- job_name: "prometheus"
  metrics_path: "/metrics"
  static_configs:
  - targets:
    - "127.0.0.1:9090"

It produces the result above.

Now, extend it slightly to include a name and relabel the instance field with it:

- job_name: "prometheus"
  metrics_path: "/metrics"
  relabel_configs:
    - source_labels: [name]
      target_label: instance
  static_configs:
  - targets:
    - "127.0.0.1:9090"
    labels:
      name: my-prometheus.example.org

Query again:

up{instance="my-prometheus.example.org",job="prometheus",name="my-prometheus.example.org"} 1

Now "instance" is set to something I can grok by glancing over it. Which makes me happy.

Fin

Thanks for following along!

Bootstrapping molecule instances with volumes

We use Ansible for all kinds of things. One of it being formatting and mounting a volume to be able to use it.

When I introduced the code for that, it worked (flawlessly, of course) — until I hit a bug when I provisioned another cluster. Long story short, I was able to fix the bug. But since we rely on it to work always, and I wanted to make sure I had all situations covered, I decided to extend one of our tests.

Background

We currently use Hetzer Cloud to bootstrap instances for CI. They are pretty okay. At times, you have odd issues where CPU hangs or a server doesn't respond to SSH. But since it's cheap, it hasn't bothered me enough yet to find something else. Add to that, they are European (slightly less issues with data privacy, etc.), know how VAT work (means, the invoices are correct) and allow for SEPA to pay for invoices (means, less credit card fees, no currency conversions, etc.).

Extending the test

A molecule scenario is driven by a file called molecule.yml. It'll look similar to this:

---
dependency:
  name: galaxy
driver:
  name: hetznercloud
lint:
  name: yamllint
platforms:
  - name: node-01-${DRONE_BUILD_NUMBER:-111}
    server_type: cx11
    image: centos-7
  - name: node-02-${DRONE_BUILD_NUMBER:-111}
    server_type: cx11
    image: centos-7
provisioner:
  name: ansible
  config_options:
    ssh_connection:
      pipelining: True
  lint:
    name: ansible-lint
verifier:
  name: testinfra
  lint:
    name: flake8

Most of it is as generated, we added different names though as this test requires multiple instances and we wanted to run multiple builds at the same time, which is why we append $DRONE_BUILD_NUMBER from the environment. (The last bit, ensures the number is still set when you drone exec for a build locally.

TL;DR — the scenario will have two instances available: node-01-XYZ and node-02-XYZ.

Going from there, you have two additional files of interest: create.yml and destroy.yml.

The first is used to bootstrap instances through Ansible's hcloud_server module, the second cleans up after the scenario/build finished.

Adding the volume

In create.yml, I added the following task after "Wait for instance(s) creation to complete":

- name: Attach a volume
  hcloud_volume:
    name: "my-volume-{{ item.name }}"
    server: "{{ item.name }}"
    size: 15
    automount: no
    state: present
  with_items: "{{ molecule_yml.platforms }}"

The task uses Ansible's hcloud_volume and ensures each of my nodes has a 15 GiB volume attached. The volume is called "my-volume" and gets the name of the instance (e.g. node-01-XYZ suffixed). For our purposes, we also decided to attach it, without mounting it as we take care of that with our Ansible role.

Deleting the volume(s)

To save a few bucks and to clean up after each test run. Open destroy.yml and add the following block after the instances are terminated:

- name: Delete volume(s)
  block:
    - name: Detach a volume
      hcloud_volume:
        name: "my-volume-{{ item.instance }}"
        state: absent
      with_items: "{{ instance_conf }}"
  ignore_errors: yes
  when: not skip_instances

Side note

Another neat trick is that if you add variables in molecule.yml in platforms, like so.

For example, to set the size of the volume try the following:

platforms:
  - name: instance
    disk_size: 20

And then use that variable with the hcloud_volume task as {{ item.disk_size }} later. And if disk size is not your objective, you could control here if each instance gets a volume, or if you only need it for certain roles in your setup. This is all a hack and maybe it'll go away, but for the time being I am glad no one bothered to validate these YAML keys or applied a schema.

Fin

Thanks for reading!

NetworkManager (for resolv.conf and firewalld) on CentOS7

As I am spiralling into Linux server administration, there's certainly a lot to learn. Certainly a lot leaves me wanting BSD, but since that's not an option, ... here we go.

NetworkManager

The NetworkManager on Linux (or CentOS specially) manages the network. Whatever content/blog posts/knowledge base I found. It usually suggests that you uninstall it first. Common problems are that people are unable to manage /etc/resolv.conf — because changes made by them to that file get overwritten again.

Internals

The NetworkManager gets everything it needs from a few configuration files.

These are located in: /etc/sysconfig/network-scripts/

There're easy enough to be managed with automation (Ansible, Chef, Salt) and here's how you get a grip on DNS.

As an example, the host I'm dealing with has an eth0 device. It's configuration is located in the directory, in a ifcfg-eth0 file and its contents are the following:

TYPE="Ethernet"
PROXY_METHOD="none"
BROWSER_ONLY="no"
BOOTPROTO="none"
DEFROUTE="yes"
IPV4_FAILURE_FATAL="no"
NAME="eth0"
UUID="63b28d0a-41f0-4e3a-bf30-c05c98772dbb"
DEVICE="eth0"
ONBOOT="yes"
IPADDR="172.21.0.12"
PREFIX="24"
GATEWAY="172.21.0.1"
IPV6_PRIVACY="no"
ZONE=public
DNS1="172.21.0.1"

Most of this speaks for itself, but there are a few titbits in here.

Managing DNS and resolve.conf

In order to (statically) manage the nameservers used by this host, I put the following into the file:

DNS1="172.21.0.1"

If I needed multiple DNS (e.g. for fallback):

DNS1="172.21.0.1"
DNS2="172.21.0.2"
DNS3="172.21.0.3"

In order to apply this, you can use a hammer and reboot — or use your best friend (sarcasm) systemd:

$ systemctl restart NetworkManager

Done!

introducing firewalld

firewalld is another interesting component. It breaks your firewall down into zones, services and sources. (And a few other things.) It's not half bad, even though pf is still superior. Its biggest advantage is that it hides iptables from me (mostly). And it allows me to define rules in a structured XML, which is still easier to read and assert on than iptables -nL.

In order to for example but my eth0 device into a the public zone, put this into ifcfg-eth0:

ZONE=public

This also implies that I can't put this device into another zone — conflicts. But this makes sense. We can also change this of course and put devices into different zones. I believe public may be an implicit default.

FIN

Thanks for reading!

Ansible Molecule drivers

(Hello again. I haven't blogged in a while. But since I'm growing weary of platforms such as medium. Here we go.)

I've recently spent ~~too much~~ a lot of time with Ansible. Once I got into the rhythm of playbooks, roles and maybe modules/libraries, I desperately needed a way to test my YAML. And by testing, I didn't mean the annoying linting that Ansible ships with, but actual (integration) tests to verify everything works.


Enter Molecule (and TestInfra)

Molecule seems to be the go to in the Ansible universe. It's an odd project — primarily because it's so undocumented (and I don't mean Python class docs, but human-readable examples).

One of the fun things about Molecule are drivers. Drivers allow me to use Docker, VirtualBox or a cloud service (like AWS, Azure, DO, Hetzner ;-)) to start instances that my playbook is run on (and then TestInfra runs assertions and collects the results). In the a nutshell, this is what Molecule does — think of it as test-kitchen.

Drivers and switching them

Drivers are crucial to a scenario. You can't, or shouldn't attempt to create a scenario and then switch it to another driver. When a scenario is initialised using a driver, it creates create.yml, destroy.yml (playbook) files. These files are highly specific to the driver and Molecule doesn't play well when these are incorrect, or even missing.

It took me too long to figure this out. Hence, I'm leaving a note here.

Fin

I'll promise I'll blog more. Again. Thanks for reading!

What's wrong with composer and your .travis.yml?

I'm a huge advocate of CI and one service in particular called Travis-Ci.

Travis-CI runs a continuous integration platform for both open source and commercial products. In a nutshell: Travis-CI listens for a commit to a Github repository and runs your test suite. Simple as that, no Jenkins required.

At Imagine Easy we happily take advantage of both. :)

So what's wrong?

For some reason, every other open source project (and probably a lot of closed source projects), use Travis-CI wrong in a way, that it will eventually break your builds.

When exactly? Whenever Travis-CI promotes composer to be a first-class citizen on the platform and attempts to run composer install automatically for you.

There may be breakage, but there may also be slowdown because by then you may end up with not one, but two composer install runs before your tests actually run.

Here's what needs fixing

A lot of projects use composer like so:

language: php
before_script: composer install --dev
script: phpunit

Here's what you have to adjust

language: php
install: composer install --dev
script: phpunit

install vs. before_script

I had never seen the install target either. Not consciously at least. And since I don't do a lot of CI for Ruby projects, I wasn't exposed to it either. On a Ruby build, Travis-CI will automatically run bundler for you, using said install target.

order of execution

In a nutshell, here are the relevant targets and how the execute:

  1. before_install
  2. install
  3. before_script
  4. script

The future

The future is that Travis-CI will do the following:

  1. before_install will self-update the composer(.phar)
  2. install will run composer install
  3. There is also the rumour of a composer_opts (or similar) setting so you can provide something like --prefer-source to the install target, without having to add an install target

Fin

Is any of this your fault? I don't think so, since the documentation leaves a lot to be desired. Scanning it while writing this blog post, I can't find a mention of install target on the pages related to building PHP products.

Long story short: go update your configurations now! ;)

I've started with doctrine/cache and doctrine/dbal, and will make it a habit to send a PR each time I see a configuration which is not what it should be.