TCP Troubleshooting

Common TCP problems and how to identify them with JitterTrap. Each section describes what to look for in the charts and how to diagnose the root cause.

Contents


Bufferbloat

Symptoms: Latency increases dramatically under load. A connection that shows 20ms RTT when idle may spike to 500ms+ when saturated. Interactive applications become sluggish during bulk transfers.

What It Looks Like in JitterTrap

In the TCP RTT chart:

  • Baseline RTT is low (e.g., 20ms) when idle
  • RTT climbs steadily as throughput increases
  • RTT may reach 500ms or more at full load
  • RTT returns to baseline when transfer completes

In the Throughput chart:

  • High throughput correlates with high RTT
  • The correlation is the signature—RTT tracks throughput

How to Test for Bufferbloat

  1. Start JitterTrap and establish baseline RTT to a remote host
  2. Begin a large file transfer (saturate the link)
  3. Watch the RTT chart—if it climbs from 20ms to 200ms+, you have bufferbloat
  4. Stop the transfer and confirm RTT returns to baseline

Causes: Oversized buffers in routers, switches, or host network stacks that allow excessive queuing.

Solutions:

  • Enable Active Queue Management (AQM) like fq_codel on routers
  • Reduce buffer sizes on network equipment
  • Use TCP congestion control algorithms designed for bufferbloat (BBR, CUBIC with ECN)

References: RFC 7567 (AQM Recommendations), RFC 8289 (CoDel), Bufferbloat.net

Receive Window Starvation

Symptoms: Throughput is limited even though the network has capacity. The receiver can't process data fast enough.

What It Looks Like in JitterTrap

In the TCP Window chart:

  • Advertised window drops toward zero
  • Zero Window markers (⚠) appear
  • Window may oscillate between zero and small values
  • Pattern is consistent regardless of RTT

In the Throughput chart:

  • Throughput drops when window shrinks
  • May see "staircase" pattern as window opens and closes

How to Diagnose

  1. Watch the TCP Window chart for a suspect flow
  2. If window drops to zero while throughput also drops, receiver is the bottleneck
  3. Capture packets during the event
  4. In Wireshark, look for Window Full and Zero Window events

Causes: Slow application not reading from socket buffers, or socket receive buffer too small.

Solutions:

  • Profile and optimize the receiving application
  • Increase socket receive buffer size (SO_RCVBUF)
  • Check for application-level backpressure

References: RFC 793 (TCP Flow Control), RFC 7323 (Window Scaling)

Retransmission Storms

Symptoms: Poor throughput despite adequate bandwidth. High CPU usage on endpoints.

What It Looks Like in JitterTrap

In the TCP Window chart:

  • Frequent Retransmit markers (↩)
  • Markers may be clustered (burst loss) or evenly distributed (steady loss)
  • Window size may fluctuate as congestion control reacts

In the TCP RTT chart:

  • RTT may spike during retransmission events
  • Erratic RTT pattern if loss is causing timeout-based retransmits
  • Smoother RTT if fast retransmit (duplicate ACKs) is working

How to Diagnose

  1. Count retransmit markers over time—occasional is normal, frequent indicates a problem
  2. Note if retransmits are clustered (burst loss) or distributed (random loss)
  3. Set a trap to capture packets when retransmits exceed a threshold
  4. Analyze in Wireshark to determine if loss is at a specific hop

Causes: Packet loss from congestion, bad links, or MTU issues.

Solutions:

  • Identify where loss is occurring (use packet capture)
  • Check for duplex mismatches
  • Verify MTU is consistent across path
  • Look for congested links or failing hardware

References: RFC 5681 (Fast Retransmit), RFC 6298 (RTO Calculation)

Head-of-Line Blocking

Symptoms: Periodic stalls in data delivery even when packets are arriving.

What It Looks Like in JitterTrap

In the Throughput chart:

  • Gaps or dips that don't correlate with network congestion
  • Throughput returns to normal after brief pause
  • Pattern may be periodic if the same packet position is vulnerable

In the TCP Window chart:

  • Dup ACK markers during the stall
  • Window may remain healthy (receiver has space, just waiting for in-order data)

How to Diagnose

  1. Look for throughput dips that don't match RTT spikes
  2. Check for Dup ACK markers (indicate out-of-order arrival)
  3. If application streams multiple independent data types over one TCP connection, head-of-line blocking is likely
  4. Capture during a stall to see the out-of-order packets in Wireshark

Causes: TCP's in-order delivery requirement means one lost packet stalls all following data.

Solutions:

  • Consider QUIC or other protocols with stream multiplexing
  • Use multiple TCP connections for independent data streams
  • Reduce RTT to minimize stall duration

References: RFC 793 (In-Order Delivery), RFC 9000 (QUIC)

Nagle's Algorithm + Delayed ACK

Symptoms: Small writes have unexpectedly high latency (often ~40ms).

What It Looks Like in JitterTrap

In the TCP RTT chart:

  • Very consistent ~40ms RTT spikes
  • The regularity is the key signature—network jitter is random, this is fixed
  • Pattern appears on request/response workloads with small messages

In the Throughput chart:

  • Low throughput with periodic bursts
  • Each burst separated by ~40ms gaps

How to Diagnose

  1. Look for suspiciously consistent 40ms RTT
  2. Check if the pattern occurs only with small messages
  3. Capture packets and look for delayed ACKs (200ms timer reduced to 40ms in most stacks)
  4. Test with TCP_NODELAY to confirm—if RTT drops dramatically, this was the cause

Causes: Nagle's algorithm waits for ACK before sending small packets. Delayed ACK waits ~40ms before acknowledging. Together they create artificial delays.

Solutions:

  • Set TCP_NODELAY on latency-sensitive sockets
  • Use TCP_QUICKACK on the receiver
  • Batch small writes into larger ones

References: RFC 896 (Nagle's Algorithm), RFC 1122 §4.2.3.2 (Delayed ACK)

Congestion Window Collapse

Symptoms: Throughput drops sharply and recovers slowly after packet loss.

What It Looks Like in JitterTrap

In the Throughput chart:

  • Sawtooth pattern: gradual increase, sharp drop, slow recovery
  • Each cycle takes several RTTs to recover
  • May see multiple cycles during sustained transfer

In the TCP RTT chart:

  • RTT increases as congestion builds (bufferbloat)
  • Retransmit markers appear
  • RTT drops when congestion control backs off

How to Diagnose

  1. Look for the sawtooth throughput pattern
  2. Note if RTT spikes precede the throughput drops (bufferbloat triggering loss)
  3. Time the recovery—slow ramp indicates traditional AIMD congestion control
  4. Compare behavior with different congestion control algorithms (BBR vs CUBIC)

Causes: TCP's congestion control cuts the sending rate dramatically after detecting loss.

Solutions:

  • Reduce packet loss (the real fix)
  • Consider BBR congestion control for lossy links
  • Use ECN to get early congestion signals before loss occurs

References: RFC 5681 (Congestion Control), RFC 8312 (CUBIC), RFC 3168 (ECN)

Retransmission Timeout (RTO) Stalls

Symptoms: Long stalls (1-3+ seconds) followed by a burst of activity. Much worse than typical packet loss recovery.

What It Looks Like in JitterTrap

In the TCP RTT chart:

  • Gaps of 1+ seconds with no data
  • Multiple Retransmit markers (↩) clustered after the gap
  • Pattern: silence, then burst of retransmits, then recovery

In the Throughput chart:

  • Complete stop, then sudden burst
  • Much longer pause than normal retransmission

How to Diagnose

  1. Time the stall duration—1+ seconds indicates RTO, not fast retransmit
  2. Check if retransmits cluster after the gap (RTO fired)
  3. Look for patterns—tail loss (end of burst) often triggers RTO
  4. Capture packets and check if fast retransmit (3 dup ACKs) failed

Causes: When fast retransmit (3 duplicate ACKs) fails, TCP falls back to RTO-based recovery. The minimum RTO is often 200ms-1s, and it doubles with each failed attempt (exponential backoff). A lost retransmit can cause multi-second stalls.

Solutions:

  • Investigate why fast retransmit is failing (tail loss, small windows)
  • Enable TLP (Tail Loss Probe) and RACK if available
  • For latency-sensitive applications, these stalls may be unacceptable—consider UDP

References: RFC 6298 (RTO Calculation), RFC 5681 §3.2 (Fast Retransmit)

Silly Window Syndrome

Symptoms: High packet rate but low throughput. Lots of small packets instead of full-sized segments.

What It Looks Like in JitterTrap

In the TCP Window chart:

  • Very small advertised window values (bytes, not KB)
  • Window may oscillate between tiny values

In the Top Talkers:

  • High packet count relative to byte count
  • Throughput is a fraction of expected

How to Diagnose

  1. Compare packet rate to byte rate—if packet rate is high but throughput is low, packets are small
  2. Check TCP Window for tiny values
  3. Look for recovery pattern after window starvation

Causes: Receiver advertises tiny windows (e.g., after window starvation recovery). Sender sends tiny segments to fill the advertised window. Overhead dominates.

Solutions:

  • Most TCP stacks have SWS avoidance built in
  • If you're seeing this, check for broken or embedded TCP implementations
  • Increase receive buffer sizes

References: RFC 813 (Window and Acknowledgement Strategy), RFC 1122 §4.2.3.4 (SWS Avoidance)

TCP vs UDP: When TCP Hurts

TCP is designed for reliable, ordered delivery of bulk data. These guarantees come at a cost that's often invisible until you look closely:

TCP BehaviorCost for Real-Time Systems
Guaranteed deliveryStalls waiting for retransmits of data that may no longer be relevant
In-order deliveryHead-of-line blocking—one lost packet blocks everything behind it
Congestion controlThroughput collapse after loss; slow recovery; competing flows affect each other
Connection establishment1.5 RTT before first data byte; connection state on both ends
Flow controlSlow receiver blocks fast sender, even if data could be dropped

Consider UDP when:

  • You can tolerate some loss
  • Need lowest latency
  • Data has a "freshness" deadline
  • You want application-level control over retransmission decisions

Examples: VoIP, video conferencing, gaming, live telemetry, sensor data, financial trading, DNS.

General Diagnostic Workflow

For any TCP performance issue:

  1. Establish baseline — Observe charts during normal operation. Know what "good" looks like.

  2. Identify the flow — Use Top Talkers to find the specific connection with issues.

  3. Check RTT first — High or variable RTT affects almost everything else.

    • High RTT → check for bufferbloat, long paths, or congestion
    • Variable RTT → check for jitter, route changes, or competing traffic
  4. Check the Window — If throughput is limited but RTT is reasonable:

    • Small window → receiver issue (application not reading, buffer too small)
    • Window collapse → congestion control reacting to loss
  5. Look for markers — Retransmit (↩) and Zero Window (⚠) markers tell you what's happening:

    • Many retransmits → packet loss problem
    • Zero window → receiver backpressure
  6. Correlate events — The most useful insights come from correlating multiple charts:

    • RTT spike + throughput drop → bufferbloat
    • Window drop + throughput drop → receiver starvation
    • Retransmit + throughput drop → packet loss
  7. Capture packets — Set traps to automatically capture when thresholds are exceeded. Analyze in Wireshark for definitive diagnosis.

References

Key RFCs

RFCTitle
RFC 793Transmission Control Protocol
RFC 896Congestion Control in IP/TCP (Nagle)
RFC 1122Requirements for Internet Hosts
RFC 3168Explicit Congestion Notification
RFC 5681TCP Congestion Control
RFC 6298Computing TCP's Retransmission Timer
RFC 7323TCP Extensions for High Performance
RFC 7567IETF Recommendations Regarding AQM
RFC 8312CUBIC Congestion Control
RFC 9000QUIC Transport Protocol