// triage
Network Troubleshooting Commands
A working triage playbook: what to check first, what each command actually tells you, and how common symptoms map to root causes.
Updated
Triage by layer (bottom-up)
Don't debug L7 until L1-L4 are clean. Saves hours.
| Layer | Question | Commands / Notes |
|---|---|---|
| L1 — link | Cable, optic, port up? | `ethtool eth0`, switch port counters, optic dB levels. No link = stop here. |
| L2 — switching | ARP / MAC learned? | `ip neigh`, `show mac address-table`. Wrong VLAN and duplicate MACs live here. |
| L3 — routing | Route to destination? | `ip r get <dst>`, traceroute. Asymmetric routing is sneaky — check return path too. |
| L4 — transport | Port reachable? | `nc -zv host port`, `ss -tnp`. Firewalls and conntrack fills happen here. |
| L7 — application | Service actually answering? | `curl -v`, app logs, DNS resolution, TLS handshake. Most 'network outages' end up being L7. |
First-60-seconds triage
Run these in order. Each one rules out a whole class of problem.
| Command | Rules out | How to read it |
|---|---|---|
| ping -c 4 8.8.8.8 | Internet reachable? | If yes, L1-L3 to the internet is fine. Move to DNS / app layer. |
| ping -c 4 google.com | DNS working? | IP ping works but name doesn't → resolver/DNS issue. Check /etc/resolv.conf and `dig`. |
| mtr -rwzbc 50 <dst> | Where's the loss? | Look for consistent loss starting at a hop — that's the culprit. Loss at one hop only often = ICMP rate-limit, ignore. |
| nc -zv <host> <port> | Port open from here? | Connection refused = service down. Timeout = firewall. No route = routing issue. |
| curl -v https://host | Full HTTP transaction | Times every phase: DNS, connect, TLS, server. Add `-w '@curl-format.txt'` for detailed timing. |
| dig +trace example.com | DNS delegation chain | Walks root → TLD → authoritative. Shows exactly where resolution breaks. |
| ss -s | Socket summary | Conntrack full or TIME_WAIT explosion? You'll see it here in seconds. |
Symptom → root cause
| Symptom | Usual cause | What to do |
|---|---|---|
| Intermittent timeouts | Packet loss or session limits | Run `mtr` for 5+ minutes. Check conntrack table size and NIC error counters. |
| Slow but works | MTU, congestion or DNS | Try `ping -M do -s 1472 <dst>`. If fragmentation needed = MTU issue. Check DNS with `dig +stats`. |
| DNS resolves wrong IP | Stale cache or split-horizon | `systemd-resolve --flush-caches` or `resolvectl flush-caches`. Compare `dig @1.1.1.1` vs `dig @local`. |
| TLS handshake fails | Cert, SNI or TLS version | `openssl s_client -connect host:443 -servername host`. Check expiry, chain and accepted protocols. |
| Connection reset by peer | Firewall or app crash | RST mid-flow = something killed the session. Check stateful firewall idle timeouts and app-side OOM. |
| Asymmetric / one-way traffic | Routing + stateful FW | Reply path takes a different firewall that has no session state → drop. Use `ip r get` from both ends. |
| Works locally, fails remotely | MTU, path MTU discovery | ICMP fragmentation-needed blocked somewhere. Lower MSS or enable PMTUD properly. |
Run a real NOC instead of guessing
NOC-in-a-Box is a Docker-Compose stack with Prometheus, Grafana, Alertmanager, blackbox-exporter and synthetic checks already wired up — so you see packet loss, latency drift and DNS slowness before tickets land. Part of the pingtraceSSH Arsenal.
→ Get NOC-in-a-Box// free download
Get the Network Engineer Starter Pack
A printable 5-page PDF: first-60-seconds triage, modern Linux network commands, BGP show commands & path-selection order, and a symptom → root-cause map. Free, no fluff.
FAQ
- What's the order I should troubleshoot network issues in?
- Bottom-up: physical link → switching/ARP → routing → transport/firewall → DNS → application. Each layer depends on the one below. Most engineers waste hours debugging L7 when the real problem is a flapping interface or a full conntrack table.
- Ping works but my app can't connect — what's wrong?
- ICMP and TCP go through different code paths in firewalls. Use `nc -zv host port` or `curl -v` against the actual port. Common causes: stateful firewall blocking the port, application not listening on the right interface, or SELinux/AppArmor denying the bind.
- How do I find packet loss between me and a remote host?
- Use `mtr -rwzbc 100 <host>` — it sends 100 probes and reports per-hop loss + latency. Look for the first hop with consistent loss; that's your problem. Loss at a single intermediate hop only (with no loss after) is usually ICMP rate-limiting on that router, not real loss.
- What does 'no route to host' mean?
- The kernel has no route that matches the destination. Run `ip r get <destination>` to see what route would be used. Either you're missing a route, a gateway is down, or your interface is in a state that disqualifies it (no carrier, no IP).
- Why does my connection work for a while then drop?
- Three usual suspects: stateful firewall idle timeout (often 1 hour for TCP), conntrack table filling up (`cat /proc/sys/net/netfilter/nf_conntrack_count`), or TCP keepalive disabled while a NAT in the path drops idle sessions. Enable keepalive on the app or lower its interval.
- What's the fastest way to know if it's DNS?
- Ping the destination by IP. If IP works and name doesn't, it's DNS. Then `dig name` against your resolver and against `1.1.1.1` to isolate local vs upstream. `dig +trace` walks the delegation chain when something deeper is broken.