About the author

Hi, I'm Ben. I've been trying to make computers do things I want them to do for years.

Over the last decade people have been paying me to get computers to do things they want them to do as well.

Want to get in touch? You can reach me on hi.ben@dartoxia.com (...wtf is a dartoxia?)

How to use Consul DNS in Nomad Docker jobs

Created on: 2021-11-29 16:55

Recently I've been playing with Nextcloud. At first I had a lot of trouble getting it to run performantly in my nomad cluster - lots of crashes and the 'Preview' app basically toppling over the whole application. However I was able to get thi after running the backend, DB and cache in different containers, using the distinct_nodes constraint and making sure I was provisioning enough resources for the backend (specifically 1.2GB memory).

I've not tuned my setup perfectly, and from time to time the DB or the cache seems to get restarted (not sure if this is load related, or the preemption that I enabled recently). These restarts were causing the host and port mappings of services to change since Nomad assigned them dynamically in my setup. Unfortunately the Nextcloud docker images don't support dynamically adjusting configuration based on changes to environment variables - so whenever the db/cache host/port mappings changed, my Nextcloud app stopped working!

I needed to use Consul's DNS interface and static port bindings in nomad jobs to provide fixed config to Nextcloud.

In my Nomad/Consul cluster I have a Consul agent installed on each machine. Each agent can provide DNS resolution of registered services to the host. The resolution of the service will be the IP address(es) of hosts which currently run the service. By default each service can be resolved with a pattern like <service name>.service.<datacenter name>.consul.

For example, in my cluster in datacenter do1, each host running nomad registers a service called nomad-agent with consul. From a host in do1 I can explicitly query the consul nameserver provided by the agent:

$ dig @127.0.0.1 -p 8600 nomad-client.service.do1.consul

; <<>> DiG 9.16.1-Ubuntu <<>> @127.0.0.1 -p 8600 nomad-client.service.do1.consul
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 49055
;; flags: qr aa rd; QUERY: 1, ANSWER: 4, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;nomad-client.service.do1.consul. IN	A

;; ANSWER SECTION:
nomad-client.service.do1.consul. 0 IN	A	10.0.0.3
nomad-client.service.do1.consul. 0 IN	A	10.0.0.2
nomad-client.service.do1.consul. 0 IN	A	10.0.0.6
nomad-client.service.do1.consul. 0 IN	A	10.0.0.4

;; Query time: 4 msec
;; SERVER: 127.0.0.1#8600(127.0.0.1)
;; WHEN: Mon Nov 29 18:54:15 UTC 2021
;; MSG SIZE  rcvd: 124

However, by default the Consul nameserver service isn't exposed by docker to containers. To enable this, I configured and installed dnsmasq, disabled the built in systemd-resolve and manually updated resolve.conf:

/etc/dnsmasq.d/10-consul.conf

server=/consul/127.0.0.1#8600

/etc/resolv.conf

nameserver 10.0.0.2 # Use the IP of the host, not 127.0.0.1
nameserver 8.8.8.8

execute

apt-get install -y dnsmasq
systemctl disable systemd-resolved
systemctl stop systemd-resolved
systemctl enable dnsmasq
systemctl start dnsmasq
systemctl restart docker

Importantly, resolv.conf needs to include the IP of the host running a consul agent (reasonable to just use the private IP of the host you're using) - if you use 127.0.0.1, docker will remove the nameserver when you launch the container as the container won't be able to resolve the host with that as the IP (it will be the containers loopback IP).

Once this is done you can verify that containers running on the host can use Consul to resolve services with a command like docker run ubuntu ping <service name>.service.<datacenter name>.consul.

These steps need to be taken on every host for containers on the host to resolve services with DNS. I use Terraform/Packer to deploy my cluster, which is how the change got applied to every node.

Now that I've made this change I updated the Nomad jobs for the db and cache services to use static host ports. This means my nextcloud config can be completely static, and connections will be correctly routed to backend services, which may move between hosts dynamically.