Skip to content

Terraform and OpenStack: Boot an instance from CD-ROM

In the spirit of "this took me way too long", here's how to boot an instance with a CD-ROM on OpenStack, using Terraform.

Why would I need this?

In a perfect world, I have templates to bootstrap instances. Means, the instances are ready to go when booted. I customise them with cloud-init and let them do all kinds of cool (or necessary) stuff like configuring the network, setting hostnames, adding user accounts and then maybe joining them to a cluster.

But I don't live in a perfect world, still: I try to automate as much as I can. So I don't have to remember any of it.

Use-case

The use-case is the installation (or setup) of a SoPhos firewall. The vendor provides me with an image which can be booted and then an installer and setup wizard have to be completed to finish the installation process.

Using Terraform

Let's look at the code first - the following is used to create the instance:

resource "openstack_compute_instance_v2" "vpn_host" {
  depends_on = [
    data.openstack_images_image_v2.vpn_image
  ]

  name        = "vpn"
  flavor_name = "dynamic-M1"

  security_groups = [
    "default",
  ]

  # boot device
  block_device {
    source_type           = "blank"
    volume_size           = "100"
    boot_index            = 0
    destination_type      = "volume"
    delete_on_termination = false
  }

  # cd-rom
  block_device {
    uuid             = data.openstack_images_image_v2.vpn_image.id
    source_type      = "image"
    destination_type = "volume"
    boot_index       = 1
    volume_size      = 1
    device_type      = "cdrom"
  }

  network {
    port = openstack_networking_port_v2.vpn_port.id
  }

  network {
    uuid = data.openstack_networking_network_v2.public_network.id
  }
}

I am omitting some code, but let's walk through this.

How to CD-ROM (block_device)

I am approaching this in reverse order — let me talk about the second block_device block first.

This is the bit that took me the longest because I didn't know how disk_bus or device_type play well together. Or which of the two is needed.

The moral of the story is, if the Terraform provider documentation is too vague, read OpenStack's documentation on device mapping instead. Or in your case, you are reading my blog post! :-)

To continue, the image of the SoPhos firewall is referenced by data.openstack_images_image_v2.vpn_image.id. Therefor, I have a data provider which pulls the image from OpenStack (or Glance):

data "openstack_images_image_v2" "vpn_image" {
  name = "fancy readable name of the ISO here"
}

During terraform apply Terraform will try to resolve it. If successful its result will be used to create a (Cinder) volume from it. The "1 (GB)" size of the volume is what OpenStack suggested when I did this via the fancy web UI. Therefor, I used it in my Terraform setup.

The important part of the block_device block is device_type = "cdrom". Without it OpenStack will refuse to boot from the volume even though we provide a boot_index.

Small caveat: I had to add a depends_on as Terraform's dependency graph would not wait for the data provider to resolve during apply.

Boot device

Last but not least: I also need a bootable root partition to install to, and that's the first block_device block in my code snippet.

If all goes well, the provisioning is as follows:

  1. OpenStack starts the instance
  2. It discovers that the first disk is not bootable (yet)
  3. It proceeds with the CD-ROM (attached to /dev/hda in my case).

After the installation is finished, subsequent reboots of the instance always use the first disk. This is similar to dropping a CD into a (real) server, installing it (from the CD) and leaving the CD (in the drive) at the data center (just in case). :-)

The rest

The rest is hopefully straight forward.

I defined two other networks (with another Terraform run) which are used via data providers.

One is used as a port (for fixed IP allocation/configuration, openstack_networking_port_v2.vpn_port.id) and the other provides the VPN instance with another accessible IP for dial-in and remote management from the public network (via data.openstack_networking_network_v2.public_network.id).

Fin

Thanks for reading.

Terraform: Resource not found

Here's a few things I learned and did when I encountered the very verbose "Resource not found" error from Terraform.

Debug your Infrastructure as Code

More logs?

This is my obvious choice or go-to. Terraform comes with different log levels though it will say itself that every level but TRACE is not to be trusted?

2021/03/02 09:21:33 [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility. Use TF_LOG=TRACE to see Terraform's internal logs.

FWIW, DEBUG and ERROR seem to produce okay output to narrow down problems and TRACE seems overwhelming, which is not very helpful.

Refresh, plan?

To narrow down a problem I can run terraform refresh (or import, plan) and hope for the best, but what I found incredibly valuable was adding a -target to either. This allows me to test resources one by one.

To retrieve a list of what is currently known to Terraform's state:

$ terraform state list
data.openstack_images_image_v2.centos
data.openstack_networking_network_v2.public_network
openstack_compute_instance_v2.jump_host
openstack_compute_keypair_v2.ssh_key
openstack_networking_network_v2.network
openstack_networking_secgroup_rule_v2.jump_host_rule
openstack_networking_secgroup_rule_v2.monitoring_rule
openstack_networking_secgroup_v2.jump_group
openstack_networking_subnet_v2.monitoring

Which seems accurate in my case.

Then I proceeded to go through each of them to find out what I may or may not know:

$ terraform plan -target openstack_compute_keypair_v2.ssh_key
...

Of course, it only failed on the one using literally everything else:

$ terraform plan -target openstack_compute_instance_v2.jump_host
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.openstack_networking_network_v2.public_network: Refreshing state... [id=foo]
data.openstack_images_image_v2.centos: Refreshing state... [id=foo]
openstack_compute_keypair_v2.ssh_key: Refreshing state... [id=foo]
openstack_networking_network_v2.network: Refreshing state... [id=foo]
openstack_networking_subnet_v2.monitoring: Refreshing state... [id=foo]
openstack_compute_instance_v2.jump_host: Refreshing state... [id=foo]

Error: Resource not found


Releasing state lock. This may take a few moments...

Provider

If you've read this far, you probably feel my pain. Let's take a look at the provider and that is in my case the OpenStack provider for Terraform. And this is where I wish I looked yesterday.

The OpenStack provider comes with its own log level: OS_DEBUG=1. This only works with the appropriate Terraform TF_LOG= statement (spoiler: not TF_LOG=TRACE).

This is what I started out with:

$ TF_LOG=ERROR OS_DEBUG=1 terraform plan -target openstack_compute_instance_v2.jump_host
... [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility.
  Use TF_LOG=TRACE to see Terraform's internal logs.
  ----
<...snip...>
openstack_networking_subnet_v2.monitoring: Refreshing state... [id=foo]
openstack_compute_instance_v2.jump_host: Refreshing state... [id=foo]
... [ERROR] eval: *terraform.EvalRefresh, err: Resource not found
... [ERROR] eval: *terraform.EvalSequence, err: Resource not found

Error: Resource not found


Releasing state lock. This may take a few moments...

Slightly more helpful (well, not really).

Now re-run the command with TF_LOG=DEBUG and the output will contain API calls made to OpenStack:

... [DEBUG] ..._v1.32.0: Vary: OpenStack-API-Version X-OpenStack-Nova-API-Version
... [DEBUG] ..._v1.32.0: X-Compute-Request-Id: bar
... [DEBUG] ..._v1.32.0: X-Openstack-Nova-Api-Version: 2.1
... [DEBUG] ..._v1.32.0: X-Openstack-Request-Id: bar
... [DEBUG] ..._v1.32.0: 2021/03/02 11:46:21 [DEBUG] OpenStack Response Body: {
... [DEBUG] ..._v1.32.0:   "itemNotFound": {
... [DEBUG] ..._v1.32.0:     "code": 404,
... [DEBUG] ..._v1.32.0:     "message": "Flavor foobar could not be found."
... [DEBUG] ..._v1.32.0:   }
... [DEBUG] ..._v1.32.0: }

And this concludes why my terraform plan fails: the flavour I used four months ago is no longer available.

Fin

If I ever get to it, I have to figure out why those error messages are not bubbled up. Or why TF_LOG=DEBUG doesn't invoke OS_DEBUG=1.

Thank you for reading. Have a great day!

Ansible Galaxy: Install private roles from private GitHub repositories

When I googled how to install private roles using ansible-galaxy, I found suggestions such as, "use git+https://github.com/..." or even better, "I am not sure what you're doing, but it works for me (since Ansible 2.2)".

So, since neither of these suggestions helped me and because I am unable to find documentation with obvious examples, here is how you achieve this.

Assuming you have your ssh key and configuration figured out, put this into requirements.yml:

---
- name: namespace.role
  src: [email protected]:my-organization/private-repository.git
  version: 1.0.0

This forces ansible-galaxy install requirements.yml to git-clone the role using your ssh key.

Prometheus: relabel your scrape_config

Prometheus labels every data point — the most well-known example of a label is (probably) instance.

Take a look at this query result (query: up{job="prometheus"}):

up{instance="127.0.0.1:9090",job="prometheus"} 1

So what does this tell me?

I queried for the "up" metric and filtered it for "prometheus" — yay. The "1" says, my service is alive. So far so gut.

Readability

Since we are in the process of running a few Prometheus servers (in federation), each of those metrics will report back with instance="127.0.0.1:9090" (along with other labels of course).

While this works, I'm not a computer. If the "instance" reported an FQDN or some other readable name, it would make any dashboard or alert more approachable. Or readable, if you will.

The instance label

instance is a standard field used in various Grafana dashboards out there. Dashboards often use the value in instance to provide you with a dropdown list of (well) instances (or nodes) to select from.

To not end up with a dropdown full of 127.0.0.1:9000, here is a snippet on how to work with labels to make life a little easier.

Rewriting labels

Consider the following scrape_config:

- job_name: "prometheus"
  metrics_path: "/metrics"
  static_configs:
  - targets:
    - "127.0.0.1:9090"

It produces the result above.

Now, extend it slightly to include a name and relabel the instance field with it:

- job_name: "prometheus"
  metrics_path: "/metrics"
  relabel_configs:
    - source_labels: [name]
      target_label: instance
  static_configs:
  - targets:
    - "127.0.0.1:9090"
    labels:
      name: my-prometheus.example.org

Query again:

up{instance="my-prometheus.example.org",job="prometheus",name="my-prometheus.example.org"} 1

Now "instance" is set to something I can grok by glancing over it. Which makes me happy.

Fin

Thanks for following along!

Bootstrapping molecule instances with volumes

We use Ansible for all kinds of things. One of it being formatting and mounting a volume to be able to use it.

When I introduced the code for that, it worked (flawlessly, of course) — until I hit a bug when I provisioned another cluster. Long story short, I was able to fix the bug. But since we rely on it to work always, and I wanted to make sure I had all situations covered, I decided to extend one of our tests.

Background

We currently use Hetzer Cloud to bootstrap instances for CI. They are pretty okay. At times, you have odd issues where CPU hangs or a server doesn't respond to SSH. But since it's cheap, it hasn't bothered me enough yet to find something else. Add to that, they are European (slightly less issues with data privacy, etc.), know how VAT work (means, the invoices are correct) and allow for SEPA to pay for invoices (means, less credit card fees, no currency conversions, etc.).

Extending the test

A molecule scenario is driven by a file called molecule.yml. It'll look similar to this:

---
dependency:
  name: galaxy
driver:
  name: hetznercloud
lint:
  name: yamllint
platforms:
  - name: node-01-${DRONE_BUILD_NUMBER:-111}
    server_type: cx11
    image: centos-7
  - name: node-02-${DRONE_BUILD_NUMBER:-111}
    server_type: cx11
    image: centos-7
provisioner:
  name: ansible
  config_options:
    ssh_connection:
      pipelining: True
  lint:
    name: ansible-lint
verifier:
  name: testinfra
  lint:
    name: flake8

Most of it is as generated, we added different names though as this test requires multiple instances and we wanted to run multiple builds at the same time, which is why we append $DRONE_BUILD_NUMBER from the environment. (The last bit, ensures the number is still set when you drone exec for a build locally.

TL;DR — the scenario will have two instances available: node-01-XYZ and node-02-XYZ.

Going from there, you have two additional files of interest: create.yml and destroy.yml.

The first is used to bootstrap instances through Ansible's hcloud_server module, the second cleans up after the scenario/build finished.

Adding the volume

In create.yml, I added the following task after "Wait for instance(s) creation to complete":

- name: Attach a volume
  hcloud_volume:
    name: "my-volume-{{ item.name }}"
    server: "{{ item.name }}"
    size: 15
    automount: no
    state: present
  with_items: "{{ molecule_yml.platforms }}"

The task uses Ansible's hcloud_volume and ensures each of my nodes has a 15 GiB volume attached. The volume is called "my-volume" and gets the name of the instance (e.g. node-01-XYZ suffixed). For our purposes, we also decided to attach it, without mounting it as we take care of that with our Ansible role.

Deleting the volume(s)

To save a few bucks and to clean up after each test run. Open destroy.yml and add the following block after the instances are terminated:

- name: Delete volume(s)
  block:
    - name: Detach a volume
      hcloud_volume:
        name: "my-volume-{{ item.instance }}"
        state: absent
      with_items: "{{ instance_conf }}"
  ignore_errors: yes
  when: not skip_instances

Side note

Another neat trick is that if you add variables in molecule.yml in platforms, like so.

For example, to set the size of the volume try the following:

platforms:
  - name: instance
    disk_size: 20

And then use that variable with the hcloud_volume task as {{ item.disk_size }} later. And if disk size is not your objective, you could control here if each instance gets a volume, or if you only need it for certain roles in your setup. This is all a hack and maybe it'll go away, but for the time being I am glad no one bothered to validate these YAML keys or applied a schema.

Fin

Thanks for reading!