Multihomed DNS with octodns

12 minute read Published: 2021-06-18

DNS failure has an outsized impact on service availability. Azure's DNS recently failed, which took out a bunch of Azure services (including the status site... of course), customer sites, etc.

It's not much comfort having spent enormous amounts of money on engineering systems for multiple availability zones in "the cloud" when DNS takes down your entire site.

No one should be single homing their DNS in 2021. No one should be signing contracts or paying $$$ for DNS in 2021, either, but that's a different topic altogether.

Anyway. Luckily, octodns exists to solve this problem for you.

octodns is one of those tools which:

  1. Works.
  2. Works well.
  3. Actually does what it says. Reliably.

octodns grew out of an internal project at GitHub. It allows you to manage your zone at an unlimited number of DNS providers from a single YAML file. So you can commit your DNS configuration to git, version it, audit it, etc, etc. DevOps (!) and what not. Cool.

Some of the octodns providers (as in, the Python classes that know how to talk to each service) aren't as full-featured or up-to-date as others, and you'll have to test that out yourself. Since it's OSS, you should fix the providers that you need and submit some PRs. But most of the big names (AWS Route53, GCP DNS, Azure DNS (heh), Dyn, etc) are well supported.

This post is going to detail how to set up triple homed DNS on AWS, GCP and Azure.

At the end, when you run a whois example.com it's going to look something like:

   Name Server: NS-1417.AWSDNS-49.ORG
   Name Server: NS-185.AWSDNS-23.COM
   Name Server: NS-1964.AWSDNS-53.CO.UK
   Name Server: NS-915.AWSDNS-50.NET
   Name Server: NS-CLOUD-E1.GOOGLEDOMAINS.COM
   Name Server: NS-CLOUD-E2.GOOGLEDOMAINS.COM
   Name Server: NS-CLOUD-E3.GOOGLEDOMAINS.COM
   Name Server: NS-CLOUD-E4.GOOGLEDOMAINS.COM
   Name Server: NS1-08.AZURE-DNS.COM
   Name Server: NS2-08.AZURE-DNS.NET
   Name Server: NS3-08.AZURE-DNS.ORG
   Name Server: NS4-08.AZURE-DNS.INFO

Step One

First, let's set up our octodns config/production.yaml:

---
providers:
  config:
    class: octodns.provider.yaml.YamlProvider
    directory: ./config
    default_ttl: 300
    enforce_order: True
  route53:
    class: octodns.provider.route53.Route53Provider
    access_key_id: env/AWS_ACCESS_KEY_ID
    secret_access_key: env/AWS_SECRET_ACCESS_KEY
  gcloud:
    class: octodns.provider.googlecloud.GoogleCloudProvider
    credentials_file: ./config/gcloud_dns.json
  azuredns:
    class: octodns.provider.azuredns.AzureProvider
    client_id: env/AZURE_CLIENT_ID
    key: env/AZURE_KEY
    directory_id: env/AZURE_DIRECTORY_ID
    sub_id: env/AZURE_SUB_ID
    resource_group: 'my-resource-group'

zones:
  example.com.:
    sources:
      - config
    targets:
      - route53
      - gcloud
      - azuredns

You'll need to go get all those credentials from the three providers in targets. That's fairly straight forward.

You most likely already have your zone set up at one of the providers you've configured.

We're going to assume the route53 provider is your current primary DNS.

octodns makes it easy to dump that entire zone into a YAML file, like so:

% octodns-dump --config-file=config/production.yaml --output-dir=tmp/ example.com. route53

Now inspect the tmp/example.com.yaml file and copy it over to config/example.com.yaml.

Step Two

Go to the other providers, if you haven't already, and create the zone in their DNS product. You'll get a bunch of name servers which you'll need for the next step.

So, right now for example.com we have:

AWS:

ns-1417.awsdns-49.org.
ns-185.awsdns-23.com.
ns-1964.awsdns-53.co.uk.
ns-915.awsdns-50.net.

Azure

ns1-08.azure-dns.com.
ns2-08.azure-dns.net.
ns3-08.azure-dns.org.
ns4-08.azure-dns.info.

GCP:

ns-cloud-e1.googledomains.com.
ns-cloud-e2.googledomains.com.
ns-cloud-e3.googledomains.com.
ns-cloud-e4.googledomains.com.

Edit the config/example.com.yaml file, and at the top you'll see a type: NS entry.

Replace it with our new triple homed NS setup:

---
? ''
: - ttl: 21600
    type: NS
    values:
    - ns-1417.awsdns-49.org.
    - ns-185.awsdns-23.com.
    - ns-1964.awsdns-53.co.uk.
    - ns-915.awsdns-50.net.
    - ns-cloud-e1.googledomains.com.
    - ns-cloud-e2.googledomains.com.
    - ns-cloud-e3.googledomains.com.
    - ns-cloud-e4.googledomains.com.
    - ns1-08.azure-dns.com.
    - ns2-08.azure-dns.net.
    - ns3-08.azure-dns.org.
    - ns4-08.azure-dns.info.

Step Three

Before we move on, let's sync up our new providers with our zone so they're ready to serve requests.

% octodns-sync --config-file=./config/production.yaml --target gcloud --doit
% octodns-sync --config-file=./config/production.yaml --target azuredns --doit

The default run mode of octodns-sync is a dry run. You can leave the --doit flag out to see what octodns would do. It'll only modify things when the flag is used.

Now that our new providers are in sync, we can bring our setup into production, but before we do...

Step Four: A Slight Wrinkle In the Plan

While it makes no real difference to the operation of our setup, most DNS lint type tools are going to complain if our SOA record isn't the same across our three providers.

That's easy enough to solve -- go to each provider and edit the SOA record to match all the others. Except in our case -- for no good reason -- Azure won't allow us to edit the Host portion of the SOA record. AWS and GCP allow us to edit all fields. So that means we'll have to use the Host that Azure has set on the other two providers.

You can see this in the Azure DNS management page or we can:

% dig soa example.com @ns1-08.azure-dns.com. 

; <<>> DiG 9.10.6 <<>> soa example.com @ns1-08.azure-dns.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46900
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;example.com.		IN	SOA

;; ANSWER SECTION:
example.com.	900	IN	SOA	ns1-08.azure-dns.com. awsdns-hostmaster.amazon.com. 2019031701 3600 600 259200 60

;; ADDITIONAL SECTION:
ns1-08.azure-dns.com.	3600	IN	A	40.90.4.8
ns1-08.azure-dns.com.	3600	IN	AAAA	2603:1061::8

;; Query time: 749 msec
;; SERVER: 2603:1061::8#53(2603:1061::8)
;; WHEN: Fri Jun 18 14:12:07 PDT 2021
;; MSG SIZE  rcvd: 170

So ns1-08.azure-dns.com. is our SOA host. Now let's make our AWS and GCP SOA match Azure (octodns doesn't support syncing SOA records, so you'll have to do this manually. But you should only need to modify your SOA record once, so... whatever).

AWS:

% dig soa example.com @ns-1417.awsdns-49.org.

; <<>> DiG 9.10.6 <<>> soa example.com @ns-1417.awsdns-49.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64845
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 12, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;example.com.		IN	SOA

;; ANSWER SECTION:
example.com.	900	IN	SOA	ns1-08.azure-dns.com. awsdns-hostmaster.amazon.com. 2019031701 3600 600 259200 60

;; AUTHORITY SECTION:
example.com.	21600	IN	NS	ns-1417.awsdns-49.org.
example.com.	21600	IN	NS	ns-185.awsdns-23.com.
example.com.	21600	IN	NS	ns-1964.awsdns-53.co.uk.
example.com.	21600	IN	NS	ns-915.awsdns-50.net.

;; Query time: 231 msec
;; SERVER: 2600:9000:5305:8900::1#53(2600:9000:5305:8900::1)
;; WHEN: Fri Jun 18 14:22:15 PDT 2021
;; MSG SIZE  rcvd: 492

GCP:

% dig soa example.com @ns-cloud-e1.googledomains.com.

; <<>> DiG 9.10.6 <<>> soa example.com @ns-cloud-e1.googledomains.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44729
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 12, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;example.com.		IN	SOA

;; ANSWER SECTION:
example.com.	900	IN	SOA	ns1-08.azure-dns.com. awsdns-hostmaster.amazon.com. 2019031701 3600 600 259200 60

;; AUTHORITY SECTION:
example.com.	21600	IN	NS	ns-cloud-e1.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e2.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e3.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e4.googledomains.com.
example.com.	21600	IN	NS	ns-915.awsdns-50.net.
example.com.	21600	IN	NS	ns-1964.awsdns-53.co.uk.
example.com.	21600	IN	NS	ns-1417.awsdns-49.org.
example.com.	21600	IN	NS	ns-185.awsdns-23.com.
example.com.	21600	IN	NS	ns1-08.azure-dns.com.
example.com.	21600	IN	NS	ns2-08.azure-dns.net.
example.com.	21600	IN	NS	ns3-08.azure-dns.org.
example.com.	21600	IN	NS	ns4-08.azure-dns.info.

;; Query time: 155 msec
;; SERVER: 2001:4860:4802:32::6e#53(2001:4860:4802:32::6e)
;; WHEN: Fri Jun 18 14:23:15 PDT 2021
;; MSG SIZE  rcvd: 492

Looks good. Let's sync up our new NS records to our existing provider:

% octodns-sync --config-file=./config/production.yaml --target route53 --doit

And let's make sure all providers are in sync one more time:

% octodns-sync --config-file=./config/production.yaml --doit

Then it's time to check your NS records at all providers.

AWS:

% dig ns example.com @ns-1417.awsdns-49.org.

; <<>> DiG 9.10.6 <<>> ns example.com @ns-1417.awsdns-49.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11881
;; flags: qr aa rd; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;example.com.		IN	NS

;; ANSWER SECTION:
example.com.	21600	IN	NS	ns-1417.awsdns-49.org.
example.com.	21600	IN	NS	ns-185.awsdns-23.com.
example.com.	21600	IN	NS	ns-1964.awsdns-53.co.uk.
example.com.	21600	IN	NS	ns-915.awsdns-50.net.
example.com.	21600	IN	NS	ns-cloud-e1.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e2.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e3.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e4.googledomains.com.
example.com.	21600	IN	NS	ns1-08.azure-dns.com.
example.com.	21600	IN	NS	ns2-08.azure-dns.net.
example.com.	21600	IN	NS	ns3-08.azure-dns.org.
example.com.	21600	IN	NS	ns4-08.azure-dns.info.

;; Query time: 99 msec
;; SERVER: 2600:9000:5305:8900::1#53(2600:9000:5305:8900::1)
;; WHEN: Fri Jun 18 14:35:12 PDT 2021
;; MSG SIZE  rcvd: 431

Azure:

% dig ns example.com @ns1-08.azure-dns.com. 

; <<>> DiG 9.10.6 <<>> ns example.com @ns1-08.azure-dns.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50887
;; flags: qr aa rd; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;example.com.		IN	NS

;; ANSWER SECTION:
example.com.	21600	IN	NS	ns1-08.azure-dns.com.
example.com.	21600	IN	NS	ns2-08.azure-dns.net.
example.com.	21600	IN	NS	ns3-08.azure-dns.org.
example.com.	21600	IN	NS	ns4-08.azure-dns.info.
example.com.	21600	IN	NS	ns-915.awsdns-50.net.
example.com.	21600	IN	NS	ns-1964.awsdns-53.co.uk.
example.com.	21600	IN	NS	ns-1417.awsdns-49.org.
example.com.	21600	IN	NS	ns-185.awsdns-23.com.
example.com.	21600	IN	NS	ns-cloud-e1.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e2.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e3.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e4.googledomains.com.

;; Query time: 144 msec
;; SERVER: 2603:1061::8#53(2603:1061::8)
;; WHEN: Fri Jun 18 14:35:34 PDT 2021
;; MSG SIZE  rcvd: 431

GCP:

% dig ns example.com @ns-cloud-e1.googledomains.com.

; <<>> DiG 9.10.6 <<>> ns example.com @ns-cloud-e1.googledomains.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13742
;; flags: qr aa rd; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;example.com.		IN	NS

;; ANSWER SECTION:
example.com.	21600	IN	NS	ns-cloud-e1.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e2.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e3.googledomains.com.
example.com.	21600	IN	NS	ns-cloud-e4.googledomains.com.
example.com.	21600	IN	NS	ns-915.awsdns-50.net.
example.com.	21600	IN	NS	ns-1964.awsdns-53.co.uk.
example.com.	21600	IN	NS	ns-1417.awsdns-49.org.
example.com.	21600	IN	NS	ns-185.awsdns-23.com.
example.com.	21600	IN	NS	ns1-08.azure-dns.com.
example.com.	21600	IN	NS	ns2-08.azure-dns.net.
example.com.	21600	IN	NS	ns3-08.azure-dns.org.
example.com.	21600	IN	NS	ns4-08.azure-dns.info.

;; Query time: 68 msec
;; SERVER: 2001:4860:4802:32::6e#53(2001:4860:4802:32::6e)
;; WHEN: Fri Jun 18 14:35:57 PDT 2021
;; MSG SIZE  rcvd: 431

If everything matches, we're ready to tell the world about our new setup.

If it doesn't, re-check your steps and make sure you used the --doit flag.

One Final Step

Go to your domain registrar and update your domain's name servers to point to all of the name servers above.

After a couple of hours you can use What's My DNS to check the NS record type of your domain to see how far the change has propagated. It shouldn't take more than a couple of days.

And that's it. A triple homed, version controlled, easily auditable DNS setup. Woo.

Does it work?

That's all great in theory, you say, but does this setup protect us against a DNS failure at one of the providers in practice?

Luckily, we run our own custom built global monitoring system to be able to answer such questions. Every 10 seconds we collect a bunch of metrics from 26 or so monitoring agents around the world.

During Azure DNS's unavailabilty between 2021-04-01 21:21 UTC and 22:00 UTC, we saw 2 failures:

Alert alert.noflow.cdn has triggered!

endpoint=www.example.com,loc=amsterdam
metric time.ttfb was absent for 60 seconds recorded at Thu, Apr  1 2021 at 21:39:22 UTC

endpoint=www.example.com,loc=mumbai
metric time.ttfb was absent for 60 seconds recorded at Thu, Apr  1 2021 at 21:39:22 UTC

Alert alert.noflow.cdn has cleared at Thu, Apr  1 2021 at 21:42:17 UTC

For a total of 3 minutes.

And it turned out that those monitoring agents had an older /etc/resolv.conf configuration which we fixed up.

I think it's safe to say that it works.