Question 1

What's the order I should troubleshoot network issues in?

Accepted Answer

Bottom-up: physical link → switching/ARP → routing → transport/firewall → DNS → application. Each layer depends on the one below. Most engineers waste hours debugging L7 when the real problem is a flapping interface or a full conntrack table.

Question 2

Ping works but my app can't connect — what's wrong?

Accepted Answer

ICMP and TCP go through different code paths in firewalls. Use `nc -zv host port` or `curl -v` against the actual port. Common causes: stateful firewall blocking the port, application not listening on the right interface, or SELinux/AppArmor denying the bind.

Question 3

How do I find packet loss between me and a remote host?

Accepted Answer

Use `mtr -rwzbc 100 <host>` — it sends 100 probes and reports per-hop loss + latency. Look for the first hop with consistent loss; that's your problem. Loss at a single intermediate hop only (with no loss after) is usually ICMP rate-limiting on that router, not real loss.

Question 4

What does 'no route to host' mean?

Accepted Answer

The kernel has no route that matches the destination. Run `ip r get <destination>` to see what route would be used. Either you're missing a route, a gateway is down, or your interface is in a state that disqualifies it (no carrier, no IP).

Question 5

Why does my connection work for a while then drop?

Accepted Answer

Three usual suspects: stateful firewall idle timeout (often 1 hour for TCP), conntrack table filling up (`cat /proc/sys/net/netfilter/nf_conntrack_count`), or TCP keepalive disabled while a NAT in the path drops idle sessions. Enable keepalive on the app or lower its interval.

Question 6

What's the fastest way to know if it's DNS?

Accepted Answer

Ping the destination by IP. If IP works and name doesn't, it's DNS. Then `dig name` against your resolver and against `1.1.1.1` to isolate local vs upstream. `dig +trace` walks the delegation chain when something deeper is broken.

Layer	Question	Commands / Notes
L1 — link	Cable, optic, port up?	`ethtool eth0`, switch port counters, optic dB levels. No link = stop here.
L2 — switching	ARP / MAC learned?	`ip neigh`, `show mac address-table`. Wrong VLAN and duplicate MACs live here.
L3 — routing	Route to destination?	`ip r get <dst>`, traceroute. Asymmetric routing is sneaky — check return path too.
L4 — transport	Port reachable?	`nc -zv host port`, `ss -tnp`. Firewalls and conntrack fills happen here.
L7 — application	Service actually answering?	`curl -v`, app logs, DNS resolution, TLS handshake. Most 'network outages' end up being L7.

Command	Rules out	How to read it
ping -c 4 8.8.8.8	Internet reachable?	If yes, L1-L3 to the internet is fine. Move to DNS / app layer.
ping -c 4 google.com	DNS working?	IP ping works but name doesn't → resolver/DNS issue. Check /etc/resolv.conf and `dig`.
mtr -rwzbc 50 <dst>	Where's the loss?	Look for consistent loss starting at a hop — that's the culprit. Loss at one hop only often = ICMP rate-limit, ignore.
nc -zv <host> <port>	Port open from here?	Connection refused = service down. Timeout = firewall. No route = routing issue.
curl -v https://host	Full HTTP transaction	Times every phase: DNS, connect, TLS, server. Add `-w '@curl-format.txt'` for detailed timing.
dig +trace example.com	DNS delegation chain	Walks root → TLD → authoritative. Shows exactly where resolution breaks.
ss -s	Socket summary	Conntrack full or TIME_WAIT explosion? You'll see it here in seconds.

Symptom	Usual cause	What to do
Intermittent timeouts	Packet loss or session limits	Run `mtr` for 5+ minutes. Check conntrack table size and NIC error counters.
Slow but works	MTU, congestion or DNS	Try `ping -M do -s 1472 <dst>`. If fragmentation needed = MTU issue. Check DNS with `dig +stats`.
DNS resolves wrong IP	Stale cache or split-horizon	`systemd-resolve --flush-caches` or `resolvectl flush-caches`. Compare `dig @1.1.1.1` vs `dig @local`.
TLS handshake fails	Cert, SNI or TLS version	`openssl s_client -connect host:443 -servername host`. Check expiry, chain and accepted protocols.
Connection reset by peer	Firewall or app crash	RST mid-flow = something killed the session. Check stateful firewall idle timeouts and app-side OOM.
Asymmetric / one-way traffic	Routing + stateful FW	Reply path takes a different firewall that has no session state → drop. Use `ip r get` from both ends.
Works locally, fails remotely	MTU, path MTU discovery	ICMP fragmentation-needed blocked somewhere. Lower MSS or enable PMTUD properly.

Network Troubleshooting Commands

Triage by layer (bottom-up)

First-60-seconds triage

Symptom → root cause

Run a real NOC instead of guessing

Get the Network Engineer Starter Pack

FAQ

Related