DNS failure has an outsized impact on service availability. Azure's DNS recently failed, which took out a bunch of Azure services (including the status site... of course), customer sites, etc.
It's not much comfort having spent enormous amounts of money on engineering systems for multiple availability zones in "the cloud" when DNS takes down your entire site.
No one should be single homing their DNS in 2021. No one should be signing contracts or paying $$$ for DNS in 2021, either, but that's a different topic altogether.
Anyway. Luckily, octodns
exists to solve this problem for you.
octodns is one of those tools which:
- Works.
- Works well.
- Actually does what it says. Reliably.
octodns
grew out of an internal project at GitHub. It allows you to manage your zone at an unlimited number of DNS providers from a single YAML file. So you can commit your DNS configuration to git, version it, audit it, etc, etc. DevOps (!) and what not. Cool.
Some of the octodns
providers (as in, the Python classes that know how to talk to each service) aren't as full-featured or up-to-date as others, and you'll have to test that out yourself. Since it's OSS, you should fix the providers that you need and submit some PRs. But most of the big names (AWS Route53, GCP DNS, Azure DNS (heh), Dyn, etc) are well supported.
This post is going to detail how to set up triple homed DNS on AWS, GCP and Azure.
At the end, when you run a whois example.com
it's going to look something like:
Name Server: NS-1417.AWSDNS-49.ORG
Name Server: NS-185.AWSDNS-23.COM
Name Server: NS-1964.AWSDNS-53.CO.UK
Name Server: NS-915.AWSDNS-50.NET
Name Server: NS-CLOUD-E1.GOOGLEDOMAINS.COM
Name Server: NS-CLOUD-E2.GOOGLEDOMAINS.COM
Name Server: NS-CLOUD-E3.GOOGLEDOMAINS.COM
Name Server: NS-CLOUD-E4.GOOGLEDOMAINS.COM
Name Server: NS1-08.AZURE-DNS.COM
Name Server: NS2-08.AZURE-DNS.NET
Name Server: NS3-08.AZURE-DNS.ORG
Name Server: NS4-08.AZURE-DNS.INFO
Step One
First, let's set up our octodns config/production.yaml
:
---
providers:
config:
class: octodns.provider.yaml.YamlProvider
directory: ./config
default_ttl: 300
enforce_order: True
route53:
class: octodns.provider.route53.Route53Provider
access_key_id: env/AWS_ACCESS_KEY_ID
secret_access_key: env/AWS_SECRET_ACCESS_KEY
gcloud:
class: octodns.provider.googlecloud.GoogleCloudProvider
credentials_file: ./config/gcloud_dns.json
azuredns:
class: octodns.provider.azuredns.AzureProvider
client_id: env/AZURE_CLIENT_ID
key: env/AZURE_KEY
directory_id: env/AZURE_DIRECTORY_ID
sub_id: env/AZURE_SUB_ID
resource_group: 'my-resource-group'
zones:
example.com.:
sources:
- config
targets:
- route53
- gcloud
- azuredns
You'll need to go get all those credentials from the three providers in targets
. That's fairly straight forward.
You most likely already have your zone set up at one of the providers you've configured.
We're going to assume the route53
provider is your current primary DNS.
octodns makes it easy to dump that entire zone into a YAML file, like so:
% octodns-dump --config-file=config/production.yaml --output-dir=tmp/ example.com. route53
Now inspect the tmp/example.com.yaml
file and copy it over to config/example.com.yaml
.
Step Two
Go to the other providers, if you haven't already, and create the zone in their DNS product. You'll get a bunch of name servers which you'll need for the next step.
So, right now for example.com
we have:
AWS:
ns-1417.awsdns-49.org.
ns-185.awsdns-23.com.
ns-1964.awsdns-53.co.uk.
ns-915.awsdns-50.net.
Azure
ns1-08.azure-dns.com.
ns2-08.azure-dns.net.
ns3-08.azure-dns.org.
ns4-08.azure-dns.info.
GCP:
ns-cloud-e1.googledomains.com.
ns-cloud-e2.googledomains.com.
ns-cloud-e3.googledomains.com.
ns-cloud-e4.googledomains.com.
Edit the config/example.com.yaml
file, and at the top you'll see a type: NS
entry.
Replace it with our new triple homed NS setup:
---
? ''
: - ttl: 21600
type: NS
values:
- ns-1417.awsdns-49.org.
- ns-185.awsdns-23.com.
- ns-1964.awsdns-53.co.uk.
- ns-915.awsdns-50.net.
- ns-cloud-e1.googledomains.com.
- ns-cloud-e2.googledomains.com.
- ns-cloud-e3.googledomains.com.
- ns-cloud-e4.googledomains.com.
- ns1-08.azure-dns.com.
- ns2-08.azure-dns.net.
- ns3-08.azure-dns.org.
- ns4-08.azure-dns.info.
Step Three
Before we move on, let's sync up our new providers with our zone so they're ready to serve requests.
% octodns-sync --config-file=./config/production.yaml --target gcloud --doit
% octodns-sync --config-file=./config/production.yaml --target azuredns --doit
The default run mode of octodns-sync
is a dry run. You can leave the --doit
flag out to see what octodns would do. It'll only modify things when the flag is used.
Now that our new providers are in sync, we can bring our setup into production, but before we do...
Step Four: A Slight Wrinkle In the Plan
While it makes no real difference to the operation of our setup, most DNS lint type tools are going to complain if our SOA
record isn't the same across our three providers.
That's easy enough to solve -- go to each provider and edit the SOA record to match all the others. Except in our case -- for no good reason -- Azure won't allow us to edit the Host
portion of the SOA record. AWS and GCP allow us to edit all fields. So that means we'll have to use the Host
that Azure has set on the other two providers.
You can see this in the Azure DNS management page or we can:
% dig soa example.com @ns1-08.azure-dns.com.
; <<>> DiG 9.10.6 <<>> soa example.com @ns1-08.azure-dns.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 46900
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 3
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;example.com. IN SOA
;; ANSWER SECTION:
example.com. 900 IN SOA ns1-08.azure-dns.com. awsdns-hostmaster.amazon.com. 2019031701 3600 600 259200 60
;; ADDITIONAL SECTION:
ns1-08.azure-dns.com. 3600 IN A 40.90.4.8
ns1-08.azure-dns.com. 3600 IN AAAA 2603:1061::8
;; Query time: 749 msec
;; SERVER: 2603:1061::8#53(2603:1061::8)
;; WHEN: Fri Jun 18 14:12:07 PDT 2021
;; MSG SIZE rcvd: 170
So ns1-08.azure-dns.com.
is our SOA host. Now let's make our AWS and GCP SOA match Azure (octodns doesn't support syncing SOA records, so you'll have to do this manually. But you should only need to modify your SOA record once, so... whatever).
AWS:
% dig soa example.com @ns-1417.awsdns-49.org.
; <<>> DiG 9.10.6 <<>> soa example.com @ns-1417.awsdns-49.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 64845
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 12, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;example.com. IN SOA
;; ANSWER SECTION:
example.com. 900 IN SOA ns1-08.azure-dns.com. awsdns-hostmaster.amazon.com. 2019031701 3600 600 259200 60
;; AUTHORITY SECTION:
example.com. 21600 IN NS ns-1417.awsdns-49.org.
example.com. 21600 IN NS ns-185.awsdns-23.com.
example.com. 21600 IN NS ns-1964.awsdns-53.co.uk.
example.com. 21600 IN NS ns-915.awsdns-50.net.
;; Query time: 231 msec
;; SERVER: 2600:9000:5305:8900::1#53(2600:9000:5305:8900::1)
;; WHEN: Fri Jun 18 14:22:15 PDT 2021
;; MSG SIZE rcvd: 492
GCP:
% dig soa example.com @ns-cloud-e1.googledomains.com.
; <<>> DiG 9.10.6 <<>> soa example.com @ns-cloud-e1.googledomains.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 44729
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 12, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;example.com. IN SOA
;; ANSWER SECTION:
example.com. 900 IN SOA ns1-08.azure-dns.com. awsdns-hostmaster.amazon.com. 2019031701 3600 600 259200 60
;; AUTHORITY SECTION:
example.com. 21600 IN NS ns-cloud-e1.googledomains.com.
example.com. 21600 IN NS ns-cloud-e2.googledomains.com.
example.com. 21600 IN NS ns-cloud-e3.googledomains.com.
example.com. 21600 IN NS ns-cloud-e4.googledomains.com.
example.com. 21600 IN NS ns-915.awsdns-50.net.
example.com. 21600 IN NS ns-1964.awsdns-53.co.uk.
example.com. 21600 IN NS ns-1417.awsdns-49.org.
example.com. 21600 IN NS ns-185.awsdns-23.com.
example.com. 21600 IN NS ns1-08.azure-dns.com.
example.com. 21600 IN NS ns2-08.azure-dns.net.
example.com. 21600 IN NS ns3-08.azure-dns.org.
example.com. 21600 IN NS ns4-08.azure-dns.info.
;; Query time: 155 msec
;; SERVER: 2001:4860:4802:32::6e#53(2001:4860:4802:32::6e)
;; WHEN: Fri Jun 18 14:23:15 PDT 2021
;; MSG SIZE rcvd: 492
Looks good. Let's sync up our new NS
records to our existing provider:
% octodns-sync --config-file=./config/production.yaml --target route53 --doit
And let's make sure all providers are in sync one more time:
% octodns-sync --config-file=./config/production.yaml --doit
Then it's time to check your NS
records at all providers.
AWS:
% dig ns example.com @ns-1417.awsdns-49.org.
; <<>> DiG 9.10.6 <<>> ns example.com @ns-1417.awsdns-49.org.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 11881
;; flags: qr aa rd; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;example.com. IN NS
;; ANSWER SECTION:
example.com. 21600 IN NS ns-1417.awsdns-49.org.
example.com. 21600 IN NS ns-185.awsdns-23.com.
example.com. 21600 IN NS ns-1964.awsdns-53.co.uk.
example.com. 21600 IN NS ns-915.awsdns-50.net.
example.com. 21600 IN NS ns-cloud-e1.googledomains.com.
example.com. 21600 IN NS ns-cloud-e2.googledomains.com.
example.com. 21600 IN NS ns-cloud-e3.googledomains.com.
example.com. 21600 IN NS ns-cloud-e4.googledomains.com.
example.com. 21600 IN NS ns1-08.azure-dns.com.
example.com. 21600 IN NS ns2-08.azure-dns.net.
example.com. 21600 IN NS ns3-08.azure-dns.org.
example.com. 21600 IN NS ns4-08.azure-dns.info.
;; Query time: 99 msec
;; SERVER: 2600:9000:5305:8900::1#53(2600:9000:5305:8900::1)
;; WHEN: Fri Jun 18 14:35:12 PDT 2021
;; MSG SIZE rcvd: 431
Azure:
% dig ns example.com @ns1-08.azure-dns.com.
; <<>> DiG 9.10.6 <<>> ns example.com @ns1-08.azure-dns.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 50887
;; flags: qr aa rd; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 1232
;; QUESTION SECTION:
;example.com. IN NS
;; ANSWER SECTION:
example.com. 21600 IN NS ns1-08.azure-dns.com.
example.com. 21600 IN NS ns2-08.azure-dns.net.
example.com. 21600 IN NS ns3-08.azure-dns.org.
example.com. 21600 IN NS ns4-08.azure-dns.info.
example.com. 21600 IN NS ns-915.awsdns-50.net.
example.com. 21600 IN NS ns-1964.awsdns-53.co.uk.
example.com. 21600 IN NS ns-1417.awsdns-49.org.
example.com. 21600 IN NS ns-185.awsdns-23.com.
example.com. 21600 IN NS ns-cloud-e1.googledomains.com.
example.com. 21600 IN NS ns-cloud-e2.googledomains.com.
example.com. 21600 IN NS ns-cloud-e3.googledomains.com.
example.com. 21600 IN NS ns-cloud-e4.googledomains.com.
;; Query time: 144 msec
;; SERVER: 2603:1061::8#53(2603:1061::8)
;; WHEN: Fri Jun 18 14:35:34 PDT 2021
;; MSG SIZE rcvd: 431
GCP:
% dig ns example.com @ns-cloud-e1.googledomains.com.
; <<>> DiG 9.10.6 <<>> ns example.com @ns-cloud-e1.googledomains.com.
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 13742
;; flags: qr aa rd; QUERY: 1, ANSWER: 12, AUTHORITY: 0, ADDITIONAL: 1
;; WARNING: recursion requested but not available
;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;example.com. IN NS
;; ANSWER SECTION:
example.com. 21600 IN NS ns-cloud-e1.googledomains.com.
example.com. 21600 IN NS ns-cloud-e2.googledomains.com.
example.com. 21600 IN NS ns-cloud-e3.googledomains.com.
example.com. 21600 IN NS ns-cloud-e4.googledomains.com.
example.com. 21600 IN NS ns-915.awsdns-50.net.
example.com. 21600 IN NS ns-1964.awsdns-53.co.uk.
example.com. 21600 IN NS ns-1417.awsdns-49.org.
example.com. 21600 IN NS ns-185.awsdns-23.com.
example.com. 21600 IN NS ns1-08.azure-dns.com.
example.com. 21600 IN NS ns2-08.azure-dns.net.
example.com. 21600 IN NS ns3-08.azure-dns.org.
example.com. 21600 IN NS ns4-08.azure-dns.info.
;; Query time: 68 msec
;; SERVER: 2001:4860:4802:32::6e#53(2001:4860:4802:32::6e)
;; WHEN: Fri Jun 18 14:35:57 PDT 2021
;; MSG SIZE rcvd: 431
If everything matches, we're ready to tell the world about our new setup.
If it doesn't, re-check your steps and make sure you used the --doit
flag.
One Final Step
Go to your domain registrar and update your domain's name servers to point to all of the name servers above.
After a couple of hours you can use What's My DNS to check the NS record type of your domain to see how far the change has propagated. It shouldn't take more than a couple of days.
And that's it. A triple homed, version controlled, easily auditable DNS setup. Woo.
Does it work?
That's all great in theory, you say, but does this setup protect us against a DNS failure at one of the providers in practice?
Luckily, we run our own custom built global monitoring system to be able to answer such questions. Every 10 seconds we collect a bunch of metrics from 26 or so monitoring agents around the world.
During Azure DNS's unavailabilty between 2021-04-01 21:21 UTC and 22:00 UTC, we saw 2 failures:
Alert alert.noflow.cdn has triggered!
endpoint=www.example.com,loc=amsterdam
metric time.ttfb was absent for 60 seconds recorded at Thu, Apr 1 2021 at 21:39:22 UTC
endpoint=www.example.com,loc=mumbai
metric time.ttfb was absent for 60 seconds recorded at Thu, Apr 1 2021 at 21:39:22 UTC
Alert alert.noflow.cdn has cleared at Thu, Apr 1 2021 at 21:42:17 UTC
For a total of 3 minutes.
And it turned out that those monitoring agents had an older /etc/resolv.conf
configuration which we fixed up.
I think it's safe to say that it works.