DNS Troubleshooting (the practical guide)
DNS outages waste time because people test the wrong layer. This guide starts with the fastest checks, then walks you through common search-intent failures like DNS_PROBE_FINISHED_NXDOMAIN, SERVFAIL, “DNS server not responding”, and “slow DNS”.
Rule 1: Prove whether this is DNS or reachability before changing records.
What you’ll do
Triage in 90 seconds
You want one outcome: determine if the failure is name resolution, routing, or the service itself. Run these in order:
1) Can you reach the site by IP?
- If the service has a known IP, try it directly (or test a known good endpoint).
- If IP works but name doesn’t: this is probably DNS or resolver policy.
If you don’t know the IP, start with the DNS tool and compare A/AAAA outputs.
2) Is it just you?
- Test from a second network (phone hotspot) or a second device.
- If multiple networks fail, suspect authoritative DNS or the service.
- If only one ISP fails, suspect resolver filtering, peering, or an ISP incident.
Use the outage tool if you suspect the ISP path is the issue.
3) Is DNS slow or failing?
- Slow page loads with normal pings can be slow DNS.
- NXDOMAIN / SERVFAIL indicates a resolution failure pattern (below).
Measure end-to-end latency and jitter separately from DNS.
Match the symptom to the failure pattern
Most DNS incidents collapse into a small set of patterns. Identify the pattern first, then you’ll know which layer to inspect.
| Symptom users search | What it usually means | Where to look first |
|---|---|---|
| DNS_PROBE_FINISHED_NXDOMAIN | Name does not exist (or your resolver is returning NXDOMAIN due to policy) | Authoritative records, zone delegation, typos, expired domain |
| SERVFAIL | Resolver couldn’t complete the query (DNSSEC issues, lame delegation, upstream timeout) | DNSSEC/DS mismatch, authoritative reachability, broken NS |
| DNS server not responding | Client can’t reach a resolver, or resolver can’t reach authoritative | Local network, firewall, ISP resolver outage, UDP/53 blocked |
| It resolves on Wi-Fi but not on VPN | Split-horizon DNS, internal zones, VPN DNS override | VPN DNS settings, internal resolvers, search domains |
| Slow DNS | Resolver latency, cache misses, unreachable/slow authoritative | Resolver selection, EDNS/client subnet, authoritative latency |
Identify the layer: client → resolver → authoritative
DNS is a pipeline. Don’t debug it backwards. Use the checks below to narrow the fault domain.
Client layer (your machine / browser)
- Flush local cache (OS + browser) when changes “don’t propagate”.
- Check if IPv6-only resolution is breaking apps (AAAA exists, but routing is poor).
- VPNs often rewrite resolvers and search domains—verify what resolver you’re actually using.
If only one device fails, it’s probably client cache, local firewall, or a resolver override.
Resolver layer (recursive DNS)
- Compare results from multiple resolvers (ISP vs public).
- If one resolver returns NXDOMAIN but others resolve: suspect policy filtering or stale cache.
- If many resolvers SERVFAIL: suspect authoritative breakage or DNSSEC issues.
This is where “it works for me” conflicts usually come from.
Authoritative layer (zone owner / DNS host)
- Validate NS delegation: parent zone points to correct authoritative name servers.
- Confirm A/AAAA/CNAME chains aren’t broken or looping.
- Watch for DNSSEC mismatch (DS record at registrar doesn’t match zone signing).
If the zone is broken, every downstream resolver can only fail differently.
Fixes that actually work
These are the highest-signal remediations, ordered by how often they solve real incidents.
1) Verify the zone exists and delegation is sane
- Confirm the domain is not expired and is using the intended authoritative DNS provider.
- Check NS records at the registrar/parent zone match the active DNS host.
- Confirm authoritative name servers are reachable and answering publicly.
2) Fix NXDOMAIN the right way
- If you recently created the zone/records: ensure you published the record in the correct zone.
- If using CNAME: ensure the target exists and doesn’t CNAME to itself.
- If only some resolvers show NXDOMAIN: clear/expire the bad cache by changing the record (or wait TTL).
3) Treat SERVFAIL as a “resolver couldn’t validate” clue
- DNSSEC is a common cause: DS mismatch or expired signatures can trigger SERVFAIL.
- Broken authoritative servers also produce SERVFAIL (timeouts, refused, lame delegation).
- Test from a different network/ISP to differentiate local resolver issues vs global authoritative issues.
4) If “DNS server not responding”, prove whether UDP/53 is blocked
- Some networks block UDP/53; corporate firewalls often do.
- Move DNS to a known resolver that supports encrypted DNS (DoH/DoT) when policy allows.
- Confirm you can reach the resolver IPs (routing, ACLs, captive portals).
5) If DNS is “slow”, separate lookup latency from path latency
- Slow DNS can look like “the internet is slow” even when ping is fine.
- Use a latency monitor for RTT/jitter and compare with DNS timings.
- Authoritative DNS in a distant region + cache misses can create consistent delays.
Tools that help (use them as instruments)
Tools don’t fix DNS. They shorten your time-to-truth by proving which layer is broken.
- DNS Lookup — Verify A/AAAA/CNAME/MX/NS/SOA/TXT/PTR quickly; compare expected vs actual.
- Ping Monitor — Validate baseline RTT/jitter/loss so you don’t misdiagnose path issues as DNS slowness.
- Port Scanner — If you suspect DNS transport blocking or service reachability, confirm ports (53/443/etc.) are reachable where appropriate.
- Traceroute Map — When resolution is fine but access isn’t, map the path and identify where routing breaks.
- ISP Outage Detector — If only one ISP or region fails, validate upstream incidents before changing DNS.
- ISP Support Insights — Pull ASN/org/contacts/peering context so escalations go to the right owner.
Adjacent pillars you’ll want next: Latency & Packet Loss, Routing & Traceroute, Ports & Firewalls, ISP & Cloud Outages.
FAQ
Why does it work on my phone but not my home Wi-Fi?
Different resolvers and different paths. Your phone may use a carrier resolver (or DoH in the browser), while your home network uses your ISP resolver. Compare DNS answers and then test reachability.
Why do I see different A records?
CDNs return different IPs by location and sometimes by resolver. That’s normal. The question is whether the returned IP is reachable and serving the correct certificate/service.
How long does DNS propagation take?
“Propagation” is mostly cache expiration. If a resolver cached an old answer for the TTL, it can keep it until the TTL expires. Lower TTLs help before changes, not after.
Should I change DNS providers during an incident?
Only if you’ve proven authoritative DNS is the fault domain. If the issue is resolver policy, peering, or the service itself, swapping providers adds risk and usually doesn’t fix the outage.