Skip to content

Terraform and OpenStack: Boot an instance from CD-ROM

In the spirit of "this took me way too long", here's how to boot an instance with a CD-ROM on OpenStack, using Terraform.

Why would I need this?

In a perfect world, I have templates to bootstrap instances. Means, the instances are ready to go when booted. I customise them with cloud-init and let them do all kinds of cool (or necessary) stuff like configuring the network, setting hostnames, adding user accounts and then maybe joining them to a cluster.

But I don't live in a perfect world, still: I try to automate as much as I can. So I don't have to remember any of it.

Use-case

The use-case is the installation (or setup) of a SoPhos firewall. The vendor provides me with an image which can be booted and then an installer and setup wizard have to be completed to finish the installation process.

Using Terraform

Let's look at the code first - the following is used to create the instance:

resource "openstack_compute_instance_v2" "vpn_host" {
  depends_on = [
    data.openstack_images_image_v2.vpn_image
  ]

  name        = "vpn"
  flavor_name = "dynamic-M1"

  security_groups = [
    "default",
  ]

  # boot device
  block_device {
    source_type           = "blank"
    volume_size           = "100"
    boot_index            = 0
    destination_type      = "volume"
    delete_on_termination = false
  }

  # cd-rom
  block_device {
    uuid             = data.openstack_images_image_v2.vpn_image.id
    source_type      = "image"
    destination_type = "volume"
    boot_index       = 1
    volume_size      = 1
    device_type      = "cdrom"
  }

  network {
    port = openstack_networking_port_v2.vpn_port.id
  }

  network {
    uuid = data.openstack_networking_network_v2.public_network.id
  }
}

I am omitting some code, but let's walk through this.

How to CD-ROM (block_device)

I am approaching this in reverse order — let me talk about the second block_device block first.

This is the bit that took me the longest because I didn't know how disk_bus or device_type play well together. Or which of the two is needed.

The moral of the story is, if the Terraform provider documentation is too vague, read OpenStack's documentation on device mapping instead. Or in your case, you are reading my blog post! :-)

To continue, the image of the SoPhos firewall is referenced by data.openstack_images_image_v2.vpn_image.id. Therefor, I have a data provider which pulls the image from OpenStack (or Glance):

data "openstack_images_image_v2" "vpn_image" {
  name = "fancy readable name of the ISO here"
}

During terraform apply Terraform will try to resolve it. If successful its result will be used to create a (Cinder) volume from it. The "1 (GB)" size of the volume is what OpenStack suggested when I did this via the fancy web UI. Therefor, I used it in my Terraform setup.

The important part of the block_device block is device_type = "cdrom". Without it OpenStack will refuse to boot from the volume even though we provide a boot_index.

Small caveat: I had to add a depends_on as Terraform's dependency graph would not wait for the data provider to resolve during apply.

Boot device

Last but not least: I also need a bootable root partition to install to, and that's the first block_device block in my code snippet.

If all goes well, the provisioning is as follows:

  1. OpenStack starts the instance
  2. It discovers that the first disk is not bootable (yet)
  3. It proceeds with the CD-ROM (attached to /dev/hda in my case).

After the installation is finished, subsequent reboots of the instance always use the first disk. This is similar to dropping a CD into a (real) server, installing it (from the CD) and leaving the CD (in the drive) at the data center (just in case). :-)

The rest

The rest is hopefully straight forward.

I defined two other networks (with another Terraform run) which are used via data providers.

One is used as a port (for fixed IP allocation/configuration, openstack_networking_port_v2.vpn_port.id) and the other provides the VPN instance with another accessible IP for dial-in and remote management from the public network (via data.openstack_networking_network_v2.public_network.id).

Fin

Thanks for reading.

Terraform: Resource not found

Here's a few things I learned and did when I encountered the very verbose "Resource not found" error from Terraform.

Debug your Infrastructure as Code

More logs?

This is my obvious choice or go-to. Terraform comes with different log levels though it will say itself that every level but TRACE is not to be trusted?

2021/03/02 09:21:33 [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility. Use TF_LOG=TRACE to see Terraform's internal logs.

FWIW, DEBUG and ERROR seem to produce okay output to narrow down problems and TRACE seems overwhelming, which is not very helpful.

Refresh, plan?

To narrow down a problem I can run terraform refresh (or import, plan) and hope for the best, but what I found incredibly valuable was adding a -target to either. This allows me to test resources one by one.

To retrieve a list of what is currently known to Terraform's state:

$ terraform state list
data.openstack_images_image_v2.centos
data.openstack_networking_network_v2.public_network
openstack_compute_instance_v2.jump_host
openstack_compute_keypair_v2.ssh_key
openstack_networking_network_v2.network
openstack_networking_secgroup_rule_v2.jump_host_rule
openstack_networking_secgroup_rule_v2.monitoring_rule
openstack_networking_secgroup_v2.jump_group
openstack_networking_subnet_v2.monitoring

Which seems accurate in my case.

Then I proceeded to go through each of them to find out what I may or may not know:

$ terraform plan -target openstack_compute_keypair_v2.ssh_key
...

Of course, it only failed on the one using literally everything else:

$ terraform plan -target openstack_compute_instance_v2.jump_host
Refreshing Terraform state in-memory prior to plan...
The refreshed state will be used to calculate this plan, but will not be
persisted to local or remote state storage.

data.openstack_networking_network_v2.public_network: Refreshing state... [id=foo]
data.openstack_images_image_v2.centos: Refreshing state... [id=foo]
openstack_compute_keypair_v2.ssh_key: Refreshing state... [id=foo]
openstack_networking_network_v2.network: Refreshing state... [id=foo]
openstack_networking_subnet_v2.monitoring: Refreshing state... [id=foo]
openstack_compute_instance_v2.jump_host: Refreshing state... [id=foo]

Error: Resource not found


Releasing state lock. This may take a few moments...

Provider

If you've read this far, you probably feel my pain. Let's take a look at the provider and that is in my case the OpenStack provider for Terraform. And this is where I wish I looked yesterday.

The OpenStack provider comes with its own log level: OS_DEBUG=1. This only works with the appropriate Terraform TF_LOG= statement (spoiler: not TF_LOG=TRACE).

This is what I started out with:

$ TF_LOG=ERROR OS_DEBUG=1 terraform plan -target openstack_compute_instance_v2.jump_host
... [WARN] Log levels other than TRACE are currently unreliable, and are supported only for backward compatibility.
  Use TF_LOG=TRACE to see Terraform's internal logs.
  ----
<...snip...>
openstack_networking_subnet_v2.monitoring: Refreshing state... [id=foo]
openstack_compute_instance_v2.jump_host: Refreshing state... [id=foo]
... [ERROR] eval: *terraform.EvalRefresh, err: Resource not found
... [ERROR] eval: *terraform.EvalSequence, err: Resource not found

Error: Resource not found


Releasing state lock. This may take a few moments...

Slightly more helpful (well, not really).

Now re-run the command with TF_LOG=DEBUG and the output will contain API calls made to OpenStack:

... [DEBUG] ..._v1.32.0: Vary: OpenStack-API-Version X-OpenStack-Nova-API-Version
... [DEBUG] ..._v1.32.0: X-Compute-Request-Id: bar
... [DEBUG] ..._v1.32.0: X-Openstack-Nova-Api-Version: 2.1
... [DEBUG] ..._v1.32.0: X-Openstack-Request-Id: bar
... [DEBUG] ..._v1.32.0: 2021/03/02 11:46:21 [DEBUG] OpenStack Response Body: {
... [DEBUG] ..._v1.32.0:   "itemNotFound": {
... [DEBUG] ..._v1.32.0:     "code": 404,
... [DEBUG] ..._v1.32.0:     "message": "Flavor foobar could not be found."
... [DEBUG] ..._v1.32.0:   }
... [DEBUG] ..._v1.32.0: }

And this concludes why my terraform plan fails: the flavour I used four months ago is no longer available.

Fin

If I ever get to it, I have to figure out why those error messages are not bubbled up. Or why TF_LOG=DEBUG doesn't invoke OS_DEBUG=1.

Thank you for reading. Have a great day!

Ansible Galaxy: Install private roles from private GitHub repositories

When I googled how to install private roles using ansible-galaxy, I found suggestions such as, "use git+https://github.com/..." or even better, "I am not sure what you're doing, but it works for me (since Ansible 2.2)".

So, since neither of these suggestions helped me and because I am unable to find documentation with obvious examples, here is how you achieve this.

Assuming you have your ssh key and configuration figured out, put this into requirements.yml:

---
- name: namespace.role
  src: [email protected]:my-organization/private-repository.git
  version: 1.0.0

This forces ansible-galaxy install requirements.yml to git-clone the role using your ssh key.

Prometheus: relabel your scrape_config

Prometheus labels every data point — the most well-known example of a label is (probably) instance.

Take a look at this query result (query: up{job="prometheus"}):

up{instance="127.0.0.1:9090",job="prometheus"} 1

So what does this tell me?

I queried for the "up" metric and filtered it for "prometheus" — yay. The "1" says, my service is alive. So far so gut.

Readability

Since we are in the process of running a few Prometheus servers (in federation), each of those metrics will report back with instance="127.0.0.1:9090" (along with other labels of course).

While this works, I'm not a computer. If the "instance" reported an FQDN or some other readable name, it would make any dashboard or alert more approachable. Or readable, if you will.

The instance label

instance is a standard field used in various Grafana dashboards out there. Dashboards often use the value in instance to provide you with a dropdown list of (well) instances (or nodes) to select from.

To not end up with a dropdown full of 127.0.0.1:9000, here is a snippet on how to work with labels to make life a little easier.

Rewriting labels

Consider the following scrape_config:

- job_name: "prometheus"
  metrics_path: "/metrics"
  static_configs:
  - targets:
    - "127.0.0.1:9090"

It produces the result above.

Now, extend it slightly to include a name and relabel the instance field with it:

- job_name: "prometheus"
  metrics_path: "/metrics"
  relabel_configs:
    - source_labels: [name]
      target_label: instance
  static_configs:
  - targets:
    - "127.0.0.1:9090"
    labels:
      name: my-prometheus.example.org

Query again:

up{instance="my-prometheus.example.org",job="prometheus",name="my-prometheus.example.org"} 1

Now "instance" is set to something I can grok by glancing over it. Which makes me happy.

Fin

Thanks for following along!

Speeding up composer on AWS OpsWorks

At EasyBib, we're heavy users of composer and AWS OpsWorks. Since we recently moved a lot of our applications to a continuous deployment model, the benefits of speeding up the deployment process (~4-5 minutes) became more obvious.

Composer install

Whenever we run composer install, there are a lot of rount-trips between the server, our satis and Github (or Amazon S3).

One of my first ideas was to to get around a continous reinstall by symlinking the vendor directory between releases. This doesn't work consistently for two reasons:

What's a release?

OpsWorks, or Chef in particular, calls deployed code releases.

A release is the checkout/clone/download of your application and lives in /srv/www:

srv/
└── www
    └── my_app
        ├── current -> /srv/www/my_app/releases/20131008134950
        ├── releases
        └── shared

The releases directory, contains your application code and the latest is always symlinked into place using current.

Atomic deploys

  1. When deploying code, deploys need to be atomic. We don't want to break whatever is currently online — not even for a second or a fraction of it.
  2. We have to be able to roll-back deployments.

Symlinking the vendor directory between releases doesn't work because it would break existing code (because who knows how long the composer install takes or a restart of the application server) and it would require an additional safety net in place to be able to rollback a failed deployed successfully.

Ruby & Chef to the rescue

Whenever a deployment is run, Chef allows us to hook into the process using deploy hooks. These hooks are documented for OpsWorks as well.

The available hooks are:

  • before migrate
  • before symlink (!)
  • before restart
  • after restart

In order to use them, create a deploy directory in your application and put a couple ruby files in there:

  • before_migrate.rb
  • before_symlink.rb
  • before_restart.rb
  • after_restart.rb

If you're a little in the know about Rails, these hooks will look familiar.

The migration hook is probably used to run database migrations — something we don't do and probably will never do. ;-) But be assured: at this point in time the checkout of your applications is complete: or in other words, the code is on the instance.

The symlink hook is what we use to run composer install to get the web app up to speed, we'll take a closer look in a second.

Before restart is a hook used to run commands before your application server reloads — for example something like purging cache directories, whatever you want to get in order before /etc/init.d/php-fpm reload is executed to revive APC.

And last but not least, after restart — used on our applications to send an annotation to NewRelic — that we successfully deployed a new release.

Before symlink

So up until now, the before_symlink.rb looked like this:

composer_command = "/usr/local/bin/php"
composer_command << " #{release_path}/composer.phar"
composer_command << " --no-dev"
composer_command << " --prefer-source"
composer_command << " --optimize-autoloader"
composer_command << " install"

run "cd #{release_path} && #{composer_command}"

Note: release_path is a variable automatically available/populated in the scope of this script. If you need more, your node attributes are available as well.

Anyway, after reading Scaling Symfony2 with AWS OpsWorks, it inspired me to attempt to copy my vendors around. But instead of doing it in a recipe, I decided to use one of the available deploy hooks for this:

app_current = ::File.expand_path("#{release_path}/../../current")
vendor_dir  = "#{app_current}/vendor"

deploy_user  = "www-data"
deploy_group = "www-data"

release_vendor = "#{release_path}/vendor"

::FileUtils.cp_r vendor_dir, release_vendor if ::File.exists?(vendor_dir)
::FileUtils.chown_R deploy_user, deploy_group, release_vendor if ::File.exists?(release_vendor)

composer_command = "/usr/local/bin/php"
composer_command << " #{release_path}/composer.phar"
composer_command << " --no-dev"
composer_command << " --prefer-source"
composer_command << " --optimize-autoloader"
composer_command << " install"

run "cd #{release_path} && #{composer_command}"

Step by step:

  • copy the current release's vendor to the new release (if it exists)
  • chown all files to the webserver (if the new vendor exists)

This allows the deploy hook to complete, even if we're on a fresh instance.

Benchmarks?

Effectively, this cut deployment from 4-5 minutes, to 2-3 minutes. With tailwind, a 50% improvement.

FIN

That's all. Happy deploying!