3420 lines
134 KiB
Plaintext
3420 lines
134 KiB
Plaintext
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Network Working Group V. Paxson
|
|||
|
|
Request for Comments: 2525 Editor
|
|||
|
|
Category: Informational ACIRI / ICSI
|
|||
|
|
M. Allman
|
|||
|
|
NASA Glenn Research Center/Sterling Software
|
|||
|
|
S. Dawson
|
|||
|
|
Real-Time Computing Laboratory
|
|||
|
|
W. Fenner
|
|||
|
|
Xerox PARC
|
|||
|
|
J. Griner
|
|||
|
|
NASA Glenn Research Center
|
|||
|
|
I. Heavens
|
|||
|
|
Spider Software Ltd.
|
|||
|
|
K. Lahey
|
|||
|
|
NASA Ames Research Center/MRJ
|
|||
|
|
J. Semke
|
|||
|
|
Pittsburgh Supercomputing Center
|
|||
|
|
B. Volz
|
|||
|
|
Process Software Corporation
|
|||
|
|
March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Known TCP Implementation Problems
|
|||
|
|
|
|||
|
|
Status of this Memo
|
|||
|
|
|
|||
|
|
This memo provides information for the Internet community. It does
|
|||
|
|
not specify an Internet standard of any kind. Distribution of this
|
|||
|
|
memo is unlimited.
|
|||
|
|
|
|||
|
|
Copyright Notice
|
|||
|
|
|
|||
|
|
Copyright (C) The Internet Society (1999). All Rights Reserved.
|
|||
|
|
|
|||
|
|
Table of Contents
|
|||
|
|
|
|||
|
|
1. INTRODUCTION....................................................2
|
|||
|
|
2. KNOWN IMPLEMENTATION PROBLEMS...................................3
|
|||
|
|
2.1 No initial slow start........................................3
|
|||
|
|
2.2 No slow start after retransmission timeout...................6
|
|||
|
|
2.3 Uninitialized CWND...........................................9
|
|||
|
|
2.4 Inconsistent retransmission.................................11
|
|||
|
|
2.5 Failure to retain above-sequence data.......................13
|
|||
|
|
2.6 Extra additive constant in congestion avoidance.............17
|
|||
|
|
2.7 Initial RTO too low.........................................23
|
|||
|
|
2.8 Failure of window deflation after loss recovery.............26
|
|||
|
|
2.9 Excessively short keepalive connection timeout..............28
|
|||
|
|
2.10 Failure to back off retransmission timeout..................31
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 1]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
2.11 Insufficient interval between keepalives....................34
|
|||
|
|
2.12 Window probe deadlock.......................................36
|
|||
|
|
2.13 Stretch ACK violation.......................................40
|
|||
|
|
2.14 Retransmission sends multiple packets.......................43
|
|||
|
|
2.15 Failure to send FIN notification promptly...................45
|
|||
|
|
2.16 Failure to send a RST after Half Duplex Close...............47
|
|||
|
|
2.17 Failure to RST on close with data pending...................50
|
|||
|
|
2.18 Options missing from TCP MSS calculation....................54
|
|||
|
|
3. SECURITY CONSIDERATIONS........................................56
|
|||
|
|
4. ACKNOWLEDGEMENTS...............................................56
|
|||
|
|
5. REFERENCES.....................................................57
|
|||
|
|
6. AUTHORS' ADDRESSES.............................................58
|
|||
|
|
7. FULL COPYRIGHT STATEMENT.......................................60
|
|||
|
|
|
|||
|
|
1. Introduction
|
|||
|
|
|
|||
|
|
This memo catalogs a number of known TCP implementation problems.
|
|||
|
|
The goal in doing so is to improve conditions in the existing
|
|||
|
|
Internet by enhancing the quality of current TCP/IP implementations.
|
|||
|
|
It is hoped that both performance and correctness issues can be
|
|||
|
|
resolved by making implementors aware of the problems and their
|
|||
|
|
solutions. In the long term, it is hoped that this will provide a
|
|||
|
|
reduction in unnecessary traffic on the network, the rate of
|
|||
|
|
connection failures due to protocol errors, and load on network
|
|||
|
|
servers due to time spent processing both unsuccessful connections
|
|||
|
|
and retransmitted data. This will help to ensure the stability of
|
|||
|
|
the global Internet.
|
|||
|
|
|
|||
|
|
Each problem is defined as follows:
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
The name associated with the problem. In this memo, the name is
|
|||
|
|
given as a subsection heading.
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
One or more problem categories for which the problem is
|
|||
|
|
classified: "congestion control", "performance", "reliability",
|
|||
|
|
"resource management".
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
A definition of the problem, succinct but including necessary
|
|||
|
|
background material.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
A brief summary of the sorts of environments for which the problem
|
|||
|
|
is significant.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 2]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Why the problem is viewed as a problem.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
The RFCs defining the TCP specification with which the problem
|
|||
|
|
conflicts. These RFCs often qualify behavior using terms such as
|
|||
|
|
MUST, SHOULD, MAY, and others written capitalized. See RFC 2119
|
|||
|
|
for the exact interpretation of these terms.
|
|||
|
|
|
|||
|
|
Trace file demonstrating the problem
|
|||
|
|
One or more ASCII trace files demonstrating the problem, if
|
|||
|
|
applicable.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
One or more examples of how correct behavior appears in a trace,
|
|||
|
|
if applicable.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
References that further discuss the problem.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
How to test an implementation to see if it exhibits the problem.
|
|||
|
|
This discussion may include difficulties and subtleties associated
|
|||
|
|
with causing the problem to manifest itself, and with interpreting
|
|||
|
|
traces to detect the presence of the problem (if applicable).
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
For known causes of the problem, how to correct the
|
|||
|
|
implementation.
|
|||
|
|
|
|||
|
|
2. Known implementation problems
|
|||
|
|
|
|||
|
|
2.1.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
No initial slow start
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When a TCP begins transmitting data, it is required by RFC 1122,
|
|||
|
|
4.2.2.15, to engage in a "slow start" by initializing its
|
|||
|
|
congestion window, cwnd, to one packet (one segment of the maximum
|
|||
|
|
size). (Note that an experimental change to TCP, documented in
|
|||
|
|
[RFC2414], allows an initial value somewhat larger than one
|
|||
|
|
packet.) It subsequently increases cwnd by one packet for each
|
|||
|
|
ACK it receives for new data. The minimum of cwnd and the
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 3]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
receiver's advertised window bounds the highest sequence number
|
|||
|
|
the TCP can transmit. A TCP that fails to initialize and
|
|||
|
|
increment cwnd in this fashion exhibits "No initial slow start".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In congested environments, detrimental to the performance of other
|
|||
|
|
connections, and possibly to the connection itself.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
A TCP failing to slow start when beginning a connection results in
|
|||
|
|
traffic bursts that can stress the network, leading to excessive
|
|||
|
|
queueing delays and packet loss.
|
|||
|
|
|
|||
|
|
Implementations exhibiting this problem might do so because they
|
|||
|
|
suffer from the general problem of not including the required
|
|||
|
|
congestion window. These implementations will also suffer from
|
|||
|
|
"No slow start after retransmission timeout".
|
|||
|
|
|
|||
|
|
There are different shades of "No initial slow start". From the
|
|||
|
|
perspective of stressing the network, the worst is a connection
|
|||
|
|
that simply always sends based on the receiver's advertised
|
|||
|
|
window, with no notion of a separate congestion window. Another
|
|||
|
|
form is described in "Uninitialized CWND" below.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 requires use of slow start. RFC 2001 gives the specifics
|
|||
|
|
of slow start.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump [Jacobson89] recording at the connection
|
|||
|
|
responder. No losses reported by the packet filter.
|
|||
|
|
|
|||
|
|
10:40:42.244503 B > A: S 1168512000:1168512000(0) win 32768
|
|||
|
|
<mss 1460,nop,wscale 0> (DF) [tos 0x8]
|
|||
|
|
10:40:42.259908 A > B: S 3688169472:3688169472(0)
|
|||
|
|
ack 1168512001 win 32768 <mss 1460>
|
|||
|
|
10:40:42.389992 B > A: . ack 1 win 33580 (DF) [tos 0x8]
|
|||
|
|
10:40:42.664975 A > B: P 1:513(512) ack 1 win 32768
|
|||
|
|
10:40:42.700185 A > B: . 513:1973(1460) ack 1 win 32768
|
|||
|
|
10:40:42.718017 A > B: . 1973:3433(1460) ack 1 win 32768
|
|||
|
|
10:40:42.762945 A > B: . 3433:4893(1460) ack 1 win 32768
|
|||
|
|
10:40:42.811273 A > B: . 4893:6353(1460) ack 1 win 32768
|
|||
|
|
10:40:42.829149 A > B: . 6353:7813(1460) ack 1 win 32768
|
|||
|
|
10:40:42.853687 B > A: . ack 1973 win 33580 (DF) [tos 0x8]
|
|||
|
|
10:40:42.864031 B > A: . ack 3433 win 33580 (DF) [tos 0x8]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 4]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
After the third packet, the connection is established. A, the
|
|||
|
|
connection responder, begins transmitting to B, the connection
|
|||
|
|
initiator. Host A quickly sends 6 packets comprising 7812 bytes,
|
|||
|
|
even though the SYN exchange agreed upon an MSS of 1460 bytes
|
|||
|
|
(implying an initial congestion window of 1 segment corresponds to
|
|||
|
|
1460 bytes), and so A should have sent at most 1460 bytes.
|
|||
|
|
|
|||
|
|
The ACKs sent by B to A in the last two lines indicate that this
|
|||
|
|
trace is not a measurement error (slow start really occurring but
|
|||
|
|
the corresponding ACKs having been dropped by the packet filter).
|
|||
|
|
|
|||
|
|
A second trace confirmed that the problem is repeatable.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using tcpdump recording at the connection originator. No
|
|||
|
|
losses reported by the packet filter.
|
|||
|
|
|
|||
|
|
12:35:31.914050 C > D: S 1448571845:1448571845(0)
|
|||
|
|
win 4380 <mss 1460>
|
|||
|
|
12:35:32.068819 D > C: S 1755712000:1755712000(0)
|
|||
|
|
ack 1448571846 win 4096
|
|||
|
|
12:35:32.069341 C > D: . ack 1 win 4608
|
|||
|
|
12:35:32.075213 C > D: P 1:513(512) ack 1 win 4608
|
|||
|
|
12:35:32.286073 D > C: . ack 513 win 4096
|
|||
|
|
12:35:32.287032 C > D: . 513:1025(512) ack 1 win 4608
|
|||
|
|
12:35:32.287506 C > D: . 1025:1537(512) ack 1 win 4608
|
|||
|
|
12:35:32.432712 D > C: . ack 1537 win 4096
|
|||
|
|
12:35:32.433690 C > D: . 1537:2049(512) ack 1 win 4608
|
|||
|
|
12:35:32.434481 C > D: . 2049:2561(512) ack 1 win 4608
|
|||
|
|
12:35:32.435032 C > D: . 2561:3073(512) ack 1 win 4608
|
|||
|
|
12:35:32.594526 D > C: . ack 3073 win 4096
|
|||
|
|
12:35:32.595465 C > D: . 3073:3585(512) ack 1 win 4608
|
|||
|
|
12:35:32.595947 C > D: . 3585:4097(512) ack 1 win 4608
|
|||
|
|
12:35:32.596414 C > D: . 4097:4609(512) ack 1 win 4608
|
|||
|
|
12:35:32.596888 C > D: . 4609:5121(512) ack 1 win 4608
|
|||
|
|
12:35:32.733453 D > C: . ack 4097 win 4096
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Paxson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
For implementations always manifesting this problem, it shows up
|
|||
|
|
immediately in a packet trace or a sequence plot, as illustrated
|
|||
|
|
above.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 5]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
If the root problem is that the implementation lacks a notion of a
|
|||
|
|
congestion window, then unfortunately this requires significant
|
|||
|
|
work to fix. However, doing so is important, as such
|
|||
|
|
implementations also exhibit "No slow start after retransmission
|
|||
|
|
timeout".
|
|||
|
|
|
|||
|
|
2.2.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
No slow start after retransmission timeout
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When a TCP experiences a retransmission timeout, it is required by
|
|||
|
|
RFC 1122, 4.2.2.15, to engage in "slow start" by initializing its
|
|||
|
|
congestion window, cwnd, to one packet (one segment of the maximum
|
|||
|
|
size). It subsequently increases cwnd by one packet for each ACK
|
|||
|
|
it receives for new data until it reaches the "congestion
|
|||
|
|
avoidance" threshold, ssthresh, at which point the congestion
|
|||
|
|
avoidance algorithm for updating the window takes over. A TCP
|
|||
|
|
that fails to enter slow start upon a timeout exhibits "No slow
|
|||
|
|
start after retransmission timeout".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In congested environments, severely detrimental to the performance
|
|||
|
|
of other connections, and also the connection itself.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Entering slow start upon timeout forms one of the cornerstones of
|
|||
|
|
Internet congestion stability, as outlined in [Jacobson88]. If
|
|||
|
|
TCPs fail to do so, the network becomes at risk of suffering
|
|||
|
|
"congestion collapse" [RFC896].
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 requires use of slow start after loss. RFC 2001 gives
|
|||
|
|
the specifics of how to implement slow start. RFC 896 describes
|
|||
|
|
congestion collapse.
|
|||
|
|
|
|||
|
|
The retransmission timeout discussed here should not be confused
|
|||
|
|
with the separate "fast recovery" retransmission mechanism
|
|||
|
|
discussed in RFC 2001.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump recording at the sending TCP (A). No losses
|
|||
|
|
reported by the packet filter.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 6]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
10:40:59.090612 B > A: . ack 357125 win 33580 (DF) [tos 0x8]
|
|||
|
|
10:40:59.222025 A > B: . 357125:358585(1460) ack 1 win 32768
|
|||
|
|
10:40:59.868871 A > B: . 357125:358585(1460) ack 1 win 32768
|
|||
|
|
10:41:00.016641 B > A: . ack 364425 win 33580 (DF) [tos 0x8]
|
|||
|
|
10:41:00.036709 A > B: . 364425:365885(1460) ack 1 win 32768
|
|||
|
|
10:41:00.045231 A > B: . 365885:367345(1460) ack 1 win 32768
|
|||
|
|
10:41:00.053785 A > B: . 367345:368805(1460) ack 1 win 32768
|
|||
|
|
10:41:00.062426 A > B: . 368805:370265(1460) ack 1 win 32768
|
|||
|
|
10:41:00.071074 A > B: . 370265:371725(1460) ack 1 win 32768
|
|||
|
|
10:41:00.079794 A > B: . 371725:373185(1460) ack 1 win 32768
|
|||
|
|
10:41:00.089304 A > B: . 373185:374645(1460) ack 1 win 32768
|
|||
|
|
10:41:00.097738 A > B: . 374645:376105(1460) ack 1 win 32768
|
|||
|
|
10:41:00.106409 A > B: . 376105:377565(1460) ack 1 win 32768
|
|||
|
|
10:41:00.115024 A > B: . 377565:379025(1460) ack 1 win 32768
|
|||
|
|
10:41:00.123576 A > B: . 379025:380485(1460) ack 1 win 32768
|
|||
|
|
10:41:00.132016 A > B: . 380485:381945(1460) ack 1 win 32768
|
|||
|
|
10:41:00.141635 A > B: . 381945:383405(1460) ack 1 win 32768
|
|||
|
|
10:41:00.150094 A > B: . 383405:384865(1460) ack 1 win 32768
|
|||
|
|
10:41:00.158552 A > B: . 384865:386325(1460) ack 1 win 32768
|
|||
|
|
10:41:00.167053 A > B: . 386325:387785(1460) ack 1 win 32768
|
|||
|
|
10:41:00.175518 A > B: . 387785:389245(1460) ack 1 win 32768
|
|||
|
|
10:41:00.210835 A > B: . 389245:390705(1460) ack 1 win 32768
|
|||
|
|
10:41:00.226108 A > B: . 390705:392165(1460) ack 1 win 32768
|
|||
|
|
10:41:00.241524 B > A: . ack 389245 win 8760 (DF) [tos 0x8]
|
|||
|
|
|
|||
|
|
The first packet indicates the ack point is 357125. 130 msec
|
|||
|
|
after receiving the ACK, A transmits the packet after the ACK
|
|||
|
|
point, 357125:358585. 640 msec after this transmission, it
|
|||
|
|
retransmits 357125:358585, in an apparent retransmission timeout.
|
|||
|
|
At this point, A's cwnd should be one MSS, or 1460 bytes, as A
|
|||
|
|
enters slow start. The trace is consistent with this possibility.
|
|||
|
|
|
|||
|
|
B replies with an ACK of 364425, indicating that A has filled a
|
|||
|
|
sequence hole. At this point, A's cwnd should be 1460*2 = 2920
|
|||
|
|
bytes, since in slow start receiving an ACK advances cwnd by MSS.
|
|||
|
|
However, A then launches 19 consecutive packets, which is
|
|||
|
|
inconsistent with slow start.
|
|||
|
|
|
|||
|
|
A second trace confirmed that the problem is repeatable.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using tcpdump recording at the sending TCP (C). No losses
|
|||
|
|
reported by the packet filter.
|
|||
|
|
|
|||
|
|
12:35:48.442538 C > D: P 465409:465921(512) ack 1 win 4608
|
|||
|
|
12:35:48.544483 D > C: . ack 461825 win 4096
|
|||
|
|
12:35:48.703496 D > C: . ack 461825 win 4096
|
|||
|
|
12:35:49.044613 C > D: . 461825:462337(512) ack 1 win 4608
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 7]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
12:35:49.192282 D > C: . ack 465921 win 2048
|
|||
|
|
12:35:49.192538 D > C: . ack 465921 win 4096
|
|||
|
|
12:35:49.193392 C > D: P 465921:466433(512) ack 1 win 4608
|
|||
|
|
12:35:49.194726 C > D: P 466433:466945(512) ack 1 win 4608
|
|||
|
|
12:35:49.350665 D > C: . ack 466945 win 4096
|
|||
|
|
12:35:49.351694 C > D: . 466945:467457(512) ack 1 win 4608
|
|||
|
|
12:35:49.352168 C > D: . 467457:467969(512) ack 1 win 4608
|
|||
|
|
12:35:49.352643 C > D: . 467969:468481(512) ack 1 win 4608
|
|||
|
|
12:35:49.506000 D > C: . ack 467969 win 3584
|
|||
|
|
|
|||
|
|
After C transmits the first packet shown to D, it takes no action
|
|||
|
|
in response to D's ACKs for 461825, because the first packet
|
|||
|
|
already reached the advertised window limit of 4096 bytes above
|
|||
|
|
461825. 600 msec after transmitting the first packet, C
|
|||
|
|
retransmits 461825:462337, presumably due to a timeout. Its
|
|||
|
|
congestion window is now MSS (512 bytes).
|
|||
|
|
|
|||
|
|
D acks 465921, indicating that C's retransmission filled a
|
|||
|
|
sequence hole. This ACK advances C's cwnd from 512 to 1024. Very
|
|||
|
|
shortly after, D acks 465921 again in order to update the offered
|
|||
|
|
window from 2048 to 4096. This ACK does not advance cwnd since it
|
|||
|
|
is not for new data. Very shortly after, C responds to the newly
|
|||
|
|
enlarged window by transmitting two packets. D acks both,
|
|||
|
|
advancing cwnd from 1024 to 1536. C in turn transmits three
|
|||
|
|
packets.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Paxson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
Packet loss is common enough in the Internet that generally it is
|
|||
|
|
not difficult to find an Internet path that will force
|
|||
|
|
retransmission due to packet loss.
|
|||
|
|
|
|||
|
|
If the effective window prior to loss is large enough, however,
|
|||
|
|
then the TCP may retransmit using the "fast recovery" mechanism
|
|||
|
|
described in RFC 2001. In a packet trace, the signature of fast
|
|||
|
|
recovery is that the packet retransmission occurs in response to
|
|||
|
|
the receipt of three duplicate ACKs, and subsequent duplicate ACKs
|
|||
|
|
may lead to the transmission of new data, above both the ack point
|
|||
|
|
and the highest sequence transmitted so far. An absence of three
|
|||
|
|
duplicate ACKs prior to retransmission suffices to distinguish
|
|||
|
|
between timeout and fast recovery retransmissions. In the face of
|
|||
|
|
only observing fast recovery retransmissions, generally it is not
|
|||
|
|
difficult to repeat the data transfer until observing a timeout
|
|||
|
|
retransmission.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 8]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Once armed with a trace exhibiting a timeout retransmission,
|
|||
|
|
determining whether the TCP follows slow start is done by
|
|||
|
|
computing the correct progression of cwnd and comparing it to the
|
|||
|
|
amount of data transmitted by the TCP subsequent to the timeout
|
|||
|
|
retransmission.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
If the root problem is that the implementation lacks a notion of a
|
|||
|
|
congestion window, then unfortunately this requires significant
|
|||
|
|
work to fix. However, doing so is critical, for reasons outlined
|
|||
|
|
above.
|
|||
|
|
|
|||
|
|
2.3.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Uninitialized CWND
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
As described above for "No initial slow start", when a TCP
|
|||
|
|
connection begins cwnd is initialized to one segment (or perhaps a
|
|||
|
|
few segments, if experimenting with [RFC2414]). One particular
|
|||
|
|
form of "No initial slow start", worth separate mention as the bug
|
|||
|
|
is fairly widely deployed, is "Uninitialized CWND". That is,
|
|||
|
|
while the TCP implements the proper slow start mechanism, it fails
|
|||
|
|
to initialize cwnd properly, so slow start in fact fails to occur.
|
|||
|
|
|
|||
|
|
One way the bug can occur is if, during the connection
|
|||
|
|
establishment handshake, the SYN ACK packet arrives without an MSS
|
|||
|
|
option. The faulty implementation uses receipt of the MSS option
|
|||
|
|
to initialize cwnd to one segment; if the option fails to arrive,
|
|||
|
|
then cwnd is instead initialized to a very large value.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In congested environments, detrimental to the performance of other
|
|||
|
|
connections, and likely to the connection itself. The burst can
|
|||
|
|
be so large (see below) that it has deleterious effects even in
|
|||
|
|
uncongested environments.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
A TCP exhibiting this behavior is stressing the network with a
|
|||
|
|
large burst of packets, which can cause loss in the network.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 requires use of slow start. RFC 2001 gives the specifics
|
|||
|
|
of slow start.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 9]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
This trace was made using tcpdump running on host A. Host A is
|
|||
|
|
the sender and host B is the receiver. The advertised window and
|
|||
|
|
timestamp options have been omitted for clarity, except for the
|
|||
|
|
first segment sent by host A. Note that A sends an MSS option in
|
|||
|
|
its initial SYN but B does not include one in its reply.
|
|||
|
|
|
|||
|
|
16:56:02.226937 A > B: S 237585307:237585307(0) win 8192
|
|||
|
|
<mss 536,nop,wscale 0,nop,nop,timestamp[|tcp]>
|
|||
|
|
16:56:02.557135 B > A: S 1617216000:1617216000(0)
|
|||
|
|
ack 237585308 win 16384
|
|||
|
|
16:56:02.557788 A > B: . ack 1 win 8192
|
|||
|
|
16:56:02.566014 A > B: . 1:537(536) ack 1
|
|||
|
|
16:56:02.566557 A > B: . 537:1073(536) ack 1
|
|||
|
|
16:56:02.567120 A > B: . 1073:1609(536) ack 1
|
|||
|
|
16:56:02.567662 A > B: P 1609:2049(440) ack 1
|
|||
|
|
16:56:02.568349 A > B: . 2049:2585(536) ack 1
|
|||
|
|
16:56:02.568909 A > B: . 2585:3121(536) ack 1
|
|||
|
|
|
|||
|
|
[54 additional burst segments deleted for brevity]
|
|||
|
|
|
|||
|
|
16:56:02.936638 A > B: . 32065:32601(536) ack 1
|
|||
|
|
16:56:03.018685 B > A: . ack 1
|
|||
|
|
|
|||
|
|
After the three-way handshake, host A bursts 61 segments into the
|
|||
|
|
network, before duplicate ACKs on the first segment cause a
|
|||
|
|
retransmission to occur. Since host A did not wait for the ACK on
|
|||
|
|
the first segment before sending additional segments, it is
|
|||
|
|
exhibiting "Uninitialized CWND"
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
|
|||
|
|
See the example for "No initial slow start".
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Paxson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
This problem can be detected by examining a packet trace recorded
|
|||
|
|
at either the sender or the receiver. However, the bug can be
|
|||
|
|
difficult to induce because it requires finding a remote TCP peer
|
|||
|
|
that does not send an MSS option in its SYN ACK.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
This problem can be fixed by ensuring that cwnd is initialized
|
|||
|
|
upon receipt of a SYN ACK, even if the SYN ACK does not contain an
|
|||
|
|
MSS option.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 10]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
2.4.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Inconsistent retransmission
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Reliability
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
If, for a given sequence number, a sending TCP retransmits
|
|||
|
|
different data than previously sent for that sequence number, then
|
|||
|
|
a strong possibility arises that the receiving TCP will
|
|||
|
|
reconstruct a different byte stream than that sent by the sending
|
|||
|
|
application, depending on which instance of the sequence number it
|
|||
|
|
accepts.
|
|||
|
|
|
|||
|
|
Such a sending TCP exhibits "Inconsistent retransmission".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
Critical for all environments.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Reliable delivery of data is a fundamental property of TCP.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 793, section 1.5, discusses the central role of reliability in
|
|||
|
|
TCP operation.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump recording at the receiving TCP (B). No losses
|
|||
|
|
reported by the packet filter.
|
|||
|
|
|
|||
|
|
12:35:53.145503 A > B: FP 90048435:90048461(26)
|
|||
|
|
ack 393464682 win 4096
|
|||
|
|
4500 0042 9644 0000
|
|||
|
|
3006 e4c2 86b1 0401 83f3 010a b2a4 0015
|
|||
|
|
055e 07b3 1773 cb6a 5019 1000 68a9 0000
|
|||
|
|
data starts here>504f 5254 2031 3334 2c31 3737*2c34 2c31
|
|||
|
|
2c31 3738 2c31 3635 0d0a
|
|||
|
|
12:35:53.146479 B > A: R 393464682:393464682(0) win 8192
|
|||
|
|
12:35:53.851714 A > B: FP 90048429:90048463(34)
|
|||
|
|
ack 393464682 win 4096
|
|||
|
|
4500 004a 965b 0000
|
|||
|
|
3006 e4a3 86b1 0401 83f3 010a b2a4 0015
|
|||
|
|
055e 07ad 1773 cb6a 5019 1000 8bd3 0000
|
|||
|
|
data starts here>5041 5356 0d0a 504f 5254 2031 3334 2c31
|
|||
|
|
3737*2c31 3035 2c31 3431 2c34 2c31 3539
|
|||
|
|
0d0a
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 11]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
The sequence numbers shown in this trace are absolute and not
|
|||
|
|
adjusted to reflect the ISN. The 4-digit hex values show a dump
|
|||
|
|
of the packet's IP and TCP headers, as well as payload. A first
|
|||
|
|
sends to B data for 90048435:90048461. The corresponding data
|
|||
|
|
begins with hex words 504f, 5254, etc.
|
|||
|
|
|
|||
|
|
B responds with a RST. Since the recording location was local to
|
|||
|
|
B, it is unknown whether A received the RST.
|
|||
|
|
|
|||
|
|
A then sends 90048429:90048463, which includes six sequence
|
|||
|
|
positions below the earlier transmission, all 26 positions of the
|
|||
|
|
earlier transmission, and two additional sequence positions.
|
|||
|
|
|
|||
|
|
The retransmission disagrees starting just after sequence
|
|||
|
|
90048447, annotated above with a leading '*'. These two bytes
|
|||
|
|
were originally transmitted as hex 2c34 but retransmitted as hex
|
|||
|
|
2c31. Subsequent positions disagree as well.
|
|||
|
|
|
|||
|
|
This behavior has been observed in other traces involving
|
|||
|
|
different hosts. It is unknown how to repeat it.
|
|||
|
|
|
|||
|
|
In this instance, no corruption would occur, since B has already
|
|||
|
|
indicated it will not accept further packets from A.
|
|||
|
|
|
|||
|
|
A second example illustrates a slightly different instance of the
|
|||
|
|
problem. The tracing again was made with tcpdump at the receiving
|
|||
|
|
TCP (D).
|
|||
|
|
|
|||
|
|
22:23:58.645829 C > D: P 185:212(27) ack 565 win 4096
|
|||
|
|
4500 0043 90a3 0000
|
|||
|
|
3306 0734 cbf1 9eef 83f3 010a 0525 0015
|
|||
|
|
a3a2 faba 578c 70a4 5018 1000 9a53 0000
|
|||
|
|
data starts here>504f 5254 2032 3033 2c32 3431 2c31 3538
|
|||
|
|
2c32 3339 2c35 2c34 330d 0a
|
|||
|
|
22:23:58.646805 D > C: . ack 184 win 8192
|
|||
|
|
4500 0028 beeb 0000
|
|||
|
|
3e06 ce06 83f3 010a cbf1 9eef 0015 0525
|
|||
|
|
578c 70a4 a3a2 fab9 5010 2000 342f 0000
|
|||
|
|
22:31:36.532244 C > D: FP 186:213(27) ack 565 win 4096
|
|||
|
|
4500 0043 9435 0000
|
|||
|
|
3306 03a2 cbf1 9eef 83f3 010a 0525 0015
|
|||
|
|
a3a2 fabb 578c 70a4 5019 1000 9a51 0000
|
|||
|
|
data starts here>504f 5254 2032 3033 2c32 3431 2c31 3538
|
|||
|
|
2c32 3339 2c35 2c34 330d 0a
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 12]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
In this trace, sequence numbers are relative. C sends 185:212,
|
|||
|
|
but D only sends an ACK for 184 (so sequence number 184 is
|
|||
|
|
missing). C then sends 186:213. The packet payload is identical
|
|||
|
|
to the previous payload, but the base sequence number is one
|
|||
|
|
higher, resulting in an inconsistent retransmission.
|
|||
|
|
|
|||
|
|
Neither trace exhibits checksum errors.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
(Omitted, as presumably correct behavior is obvious.)
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
None known.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
This problem unfortunately can be very difficult to detect, since
|
|||
|
|
available experience indicates it is quite rare that it is
|
|||
|
|
manifested. No "trigger" has been identified that can be used to
|
|||
|
|
reproduce the problem.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
In the absence of a known "trigger", we cannot always assess how
|
|||
|
|
to fix the problem.
|
|||
|
|
|
|||
|
|
In one implementation (not the one illustrated above), the problem
|
|||
|
|
manifested itself when (1) the sender received a zero window and
|
|||
|
|
stalled; (2) eventually an ACK arrived that offered a window
|
|||
|
|
larger than that in effect at the time of the stall; (3) the
|
|||
|
|
sender transmitted out of the buffer of data it held at the time
|
|||
|
|
of the stall, but (4) failed to limit this transfer to the buffer
|
|||
|
|
length, instead using the newly advertised (and larger) offered
|
|||
|
|
window. Consequently, in addition to the valid buffer contents,
|
|||
|
|
it sent whatever garbage values followed the end of the buffer.
|
|||
|
|
If it then retransmitted the corresponding sequence numbers, at
|
|||
|
|
that point it sent the correct data, resulting in an inconsistent
|
|||
|
|
retransmission. Note that this instance of the problem reflects a
|
|||
|
|
more general problem, that of initially transmitting incorrect
|
|||
|
|
data.
|
|||
|
|
|
|||
|
|
2.5.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Failure to retain above-sequence data
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control, performance
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 13]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When a TCP receives an "above sequence" segment, meaning one with
|
|||
|
|
a sequence number exceeding RCV.NXT but below RCV.NXT+RCV.WND, it
|
|||
|
|
SHOULD queue the segment for later delivery (RFC 1122, 4.2.2.20).
|
|||
|
|
(See RFC 793 for the definition of RCV.NXT and RCV.WND.) A TCP
|
|||
|
|
that fails to do so is said to exhibit "Failure to retain above-
|
|||
|
|
sequence data".
|
|||
|
|
|
|||
|
|
It may sometimes be appropriate for a TCP to discard above-
|
|||
|
|
sequence data to reclaim memory. If they do so only rarely, then
|
|||
|
|
we would not consider them to exhibit this problem. Instead, the
|
|||
|
|
particular concern is with TCPs that always discard above-sequence
|
|||
|
|
data.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In environments prone to packet loss, detrimental to the
|
|||
|
|
performance of both other connections and the connection itself.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
In times of congestion, a failure to retain above-sequence data
|
|||
|
|
will lead to numerous otherwise-unnecessary retransmissions,
|
|||
|
|
aggravating the congestion and potentially reducing performance by
|
|||
|
|
a large factor.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 revises RFC 793 by upgrading the latter's MAY to a SHOULD
|
|||
|
|
on this issue.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump recording at the receiving TCP. No losses
|
|||
|
|
reported by the packet filter.
|
|||
|
|
|
|||
|
|
B is the TCP sender, A the receiver. A exhibits failure to retain
|
|||
|
|
above sequence-data:
|
|||
|
|
|
|||
|
|
10:38:10.164860 B > A: . 221078:221614(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.170809 B > A: . 221614:222150(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.177183 B > A: . 222150:222686(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.225039 A > B: . ack 222686 win 25800
|
|||
|
|
|
|||
|
|
Here B has sent up to (relative) sequence 222686 in-sequence, and
|
|||
|
|
A accordingly acknowledges.
|
|||
|
|
|
|||
|
|
10:38:10.268131 B > A: . 223222:223758(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.337995 B > A: . 223758:224294(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.344065 B > A: . 224294:224830(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.350169 B > A: . 224830:225366(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.356362 B > A: . 225366:225902(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 14]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
10:38:10.362445 B > A: . 225902:226438(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.368579 B > A: . 226438:226974(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.374732 B > A: . 226974:227510(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.380825 B > A: . 227510:228046(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.387027 B > A: . 228046:228582(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.393053 B > A: . 228582:229118(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.399193 B > A: . 229118:229654(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:10.405356 B > A: . 229654:230190(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
|
|||
|
|
A now receives 13 additional packets from B. These are above-
|
|||
|
|
sequence because 222686:223222 was dropped. The packets do
|
|||
|
|
however fit within the offered window of 25800. A does not
|
|||
|
|
generate any duplicate ACKs for them.
|
|||
|
|
|
|||
|
|
The trace contributor (V. Paxson) verified that these 13 packets
|
|||
|
|
had valid IP and TCP checksums.
|
|||
|
|
|
|||
|
|
10:38:11.917728 B > A: . 222686:223222(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:11.930925 A > B: . ack 223222 win 32232
|
|||
|
|
|
|||
|
|
B times out for 222686:223222 and retransmits it. Upon receiving
|
|||
|
|
it, A only acknowledges 223222. Had it retained the valid above-
|
|||
|
|
sequence packets, it would instead have ack'd 230190.
|
|||
|
|
|
|||
|
|
10:38:12.048438 B > A: . 223222:223758(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:12.054397 B > A: . 223758:224294(536) ack 1 win 33232 [tos 0x8]
|
|||
|
|
10:38:12.068029 A > B: . ack 224294 win 31696
|
|||
|
|
|
|||
|
|
B retransmits two more packets, and A only acknowledges them.
|
|||
|
|
This pattern continues as B retransmits the entire set of
|
|||
|
|
previously-received packets.
|
|||
|
|
|
|||
|
|
A second trace confirmed that the problem is repeatable.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using tcpdump recording at the receiving TCP (C). No losses
|
|||
|
|
reported by the packet filter.
|
|||
|
|
|
|||
|
|
09:11:25.790417 D > C: . 33793:34305(512) ack 1 win 61440
|
|||
|
|
09:11:25.791393 D > C: . 34305:34817(512) ack 1 win 61440
|
|||
|
|
09:11:25.792369 D > C: . 34817:35329(512) ack 1 win 61440
|
|||
|
|
09:11:25.792369 D > C: . 35329:35841(512) ack 1 win 61440
|
|||
|
|
09:11:25.793345 D > C: . 36353:36865(512) ack 1 win 61440
|
|||
|
|
09:11:25.794321 C > D: . ack 35841 win 59904
|
|||
|
|
|
|||
|
|
A sequence hole occurs because 35841:36353 has been dropped.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 15]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
09:11:25.794321 D > C: . 36865:37377(512) ack 1 win 61440
|
|||
|
|
09:11:25.794321 C > D: . ack 35841 win 59904
|
|||
|
|
09:11:25.795297 D > C: . 37377:37889(512) ack 1 win 61440
|
|||
|
|
09:11:25.795297 C > D: . ack 35841 win 59904
|
|||
|
|
09:11:25.796273 C > D: . ack 35841 win 61440
|
|||
|
|
09:11:25.798225 D > C: . 37889:38401(512) ack 1 win 61440
|
|||
|
|
09:11:25.799201 C > D: . ack 35841 win 61440
|
|||
|
|
09:11:25.807009 D > C: . 38401:38913(512) ack 1 win 61440
|
|||
|
|
09:11:25.807009 C > D: . ack 35841 win 61440
|
|||
|
|
(many additional lines omitted)
|
|||
|
|
09:11:25.884113 D > C: . 52737:53249(512) ack 1 win 61440
|
|||
|
|
09:11:25.884113 C > D: . ack 35841 win 61440
|
|||
|
|
|
|||
|
|
Each additional, above-sequence packet C receives from D elicits a
|
|||
|
|
duplicate ACK for 35841.
|
|||
|
|
|
|||
|
|
09:11:25.887041 D > C: . 35841:36353(512) ack 1 win 61440
|
|||
|
|
09:11:25.887041 C > D: . ack 53249 win 44032
|
|||
|
|
|
|||
|
|
D retransmits 35841:36353 and C acknowledges receipt of data all
|
|||
|
|
the way up to 53249.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Paxson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
Packet loss is common enough in the Internet that generally it is
|
|||
|
|
not difficult to find an Internet path that will result in some
|
|||
|
|
above-sequence packets arriving. A TCP that exhibits "Failure to
|
|||
|
|
retain ..." may not generate duplicate ACKs for these packets.
|
|||
|
|
However, some TCPs that do retain above-sequence data also do not
|
|||
|
|
generate duplicate ACKs, so failure to do so does not definitively
|
|||
|
|
identify the problem. Instead, the key observation is whether
|
|||
|
|
upon retransmission of the dropped packet, data that was
|
|||
|
|
previously above-sequence is acknowledged.
|
|||
|
|
|
|||
|
|
Two considerations in detecting this problem using a packet trace
|
|||
|
|
are that it is easiest to do so with a trace made at the TCP
|
|||
|
|
receiver, in order to unambiguously determine which packets
|
|||
|
|
arrived successfully, and that such packets may still be correctly
|
|||
|
|
discarded if they arrive with checksum errors. The latter can be
|
|||
|
|
tested by capturing the entire packet contents and performing the
|
|||
|
|
IP and TCP checksum algorithms to verify their integrity; or by
|
|||
|
|
confirming that the packets arrive with the same checksum and
|
|||
|
|
contents as that with which they were sent, with a presumption
|
|||
|
|
that the sending TCP correctly calculates checksums for the
|
|||
|
|
packets it transmits.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 16]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
It is considerably easier to verify that an implementation does
|
|||
|
|
NOT exhibit this problem. This can be done by recording a trace
|
|||
|
|
at the data sender, and observing that sometimes after a
|
|||
|
|
retransmission the receiver acknowledges a higher sequence number
|
|||
|
|
than just that which was retransmitted.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
If the root problem is that the implementation lacks buffer, then
|
|||
|
|
then unfortunately this requires significant work to fix.
|
|||
|
|
However, doing so is important, for reasons outlined above.
|
|||
|
|
|
|||
|
|
2.6.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Extra additive constant in congestion avoidance
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control / performance
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
RFC 1122 section 4.2.2.15 states that TCP MUST implement
|
|||
|
|
Jacobson's "congestion avoidance" algorithm [Jacobson88], which
|
|||
|
|
calls for increasing the congestion window, cwnd, by:
|
|||
|
|
|
|||
|
|
MSS * MSS / cwnd
|
|||
|
|
|
|||
|
|
for each ACK received for new data [RFC2001]. This has the effect
|
|||
|
|
of increasing cwnd by approximately one segment in each round trip
|
|||
|
|
time.
|
|||
|
|
|
|||
|
|
Some TCP implementations add an additional fraction of a segment
|
|||
|
|
(typically MSS/8) to cwnd for each ACK received for new data
|
|||
|
|
[Stevens94, Wright95]:
|
|||
|
|
|
|||
|
|
(MSS * MSS / cwnd) + MSS/8
|
|||
|
|
|
|||
|
|
These implementations exhibit "Extra additive constant in
|
|||
|
|
congestion avoidance".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
May be detrimental to performance even in completely uncongested
|
|||
|
|
environments (see Implications).
|
|||
|
|
|
|||
|
|
In congested environments, may also be detrimental to the
|
|||
|
|
performance of other connections.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 17]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
The extra additive term allows a TCP to more aggressively open its
|
|||
|
|
congestion window (quadratic rather than linear increase). For
|
|||
|
|
congested networks, this can increase the loss rate experienced by
|
|||
|
|
all connections sharing a bottleneck with the aggressive TCP.
|
|||
|
|
|
|||
|
|
However, even for completely uncongested networks, the extra
|
|||
|
|
additive term can lead to diminished performance, as follows. In
|
|||
|
|
congestion avoidance, a TCP sender probes the network path to
|
|||
|
|
determine its available capacity, which often equates to the
|
|||
|
|
number of buffers available at a bottleneck link. With linear
|
|||
|
|
congestion avoidance, the TCP only probes for sufficient capacity
|
|||
|
|
(buffer) to hold one extra packet per RTT.
|
|||
|
|
|
|||
|
|
Thus, when it exceeds the available capacity, generally only one
|
|||
|
|
packet will be lost (since on the previous RTT it already found
|
|||
|
|
that the path could sustain a window with one less packet in
|
|||
|
|
flight). If the congestion window is sufficiently large, then the
|
|||
|
|
TCP will recover from this single loss using fast retransmission
|
|||
|
|
and avoid an expensive (in terms of performance) retransmission
|
|||
|
|
timeout.
|
|||
|
|
|
|||
|
|
However, when the additional additive term is used, then cwnd can
|
|||
|
|
increase by more than one packet per RTT, in which case the TCP
|
|||
|
|
probes more aggressively. If in the previous RTT it had reached
|
|||
|
|
the available capacity of the path, then the excess due to the
|
|||
|
|
extra increase will again be lost, but now this will result in
|
|||
|
|
multiple losses from the flight instead of a single loss. TCPs
|
|||
|
|
that do not utilize SACK [RFC2018] generally will not recover from
|
|||
|
|
multiple losses without incurring a retransmission timeout
|
|||
|
|
[Fall96,Hoe96], significantly diminishing performance.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 requires use of the "congestion avoidance" algorithm.
|
|||
|
|
RFC 2001 outlines the fast retransmit/fast recovery algorithms.
|
|||
|
|
RFC 2018 discusses the SACK option.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Recorded using tcpdump running on the same FDDI LAN as host A.
|
|||
|
|
Host A is the sender and host B is the receiver. The connection
|
|||
|
|
establishment specified an MSS of 4,312 bytes and a window scale
|
|||
|
|
factor of 4. We omit the establishment and the first 2.5 MB of
|
|||
|
|
data transfer, as the problem is best demonstrated when the window
|
|||
|
|
has grown to a large value. At the beginning of the trace
|
|||
|
|
excerpt, the congestion window is 31 packets. The connection is
|
|||
|
|
never receiver-window limited, so we omit window advertisements
|
|||
|
|
from the trace for clarity.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 18]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
11:42:07.697951 B > A: . ack 2383006
|
|||
|
|
11:42:07.699388 A > B: . 2508054:2512366(4312)
|
|||
|
|
11:42:07.699962 A > B: . 2512366:2516678(4312)
|
|||
|
|
11:42:07.700012 B > A: . ack 2391630
|
|||
|
|
11:42:07.701081 A > B: . 2516678:2520990(4312)
|
|||
|
|
11:42:07.701656 A > B: . 2520990:2525302(4312)
|
|||
|
|
11:42:07.701739 B > A: . ack 2400254
|
|||
|
|
11:42:07.702685 A > B: . 2525302:2529614(4312)
|
|||
|
|
11:42:07.703257 A > B: . 2529614:2533926(4312)
|
|||
|
|
11:42:07.703295 B > A: . ack 2408878
|
|||
|
|
11:42:07.704414 A > B: . 2533926:2538238(4312)
|
|||
|
|
11:42:07.704989 A > B: . 2538238:2542550(4312)
|
|||
|
|
11:42:07.705040 B > A: . ack 2417502
|
|||
|
|
11:42:07.705935 A > B: . 2542550:2546862(4312)
|
|||
|
|
11:42:07.706506 A > B: . 2546862:2551174(4312)
|
|||
|
|
11:42:07.706544 B > A: . ack 2426126
|
|||
|
|
11:42:07.707480 A > B: . 2551174:2555486(4312)
|
|||
|
|
11:42:07.708051 A > B: . 2555486:2559798(4312)
|
|||
|
|
11:42:07.708088 B > A: . ack 2434750
|
|||
|
|
11:42:07.709030 A > B: . 2559798:2564110(4312)
|
|||
|
|
11:42:07.709604 A > B: . 2564110:2568422(4312)
|
|||
|
|
11:42:07.710175 A > B: . 2568422:2572734(4312) *
|
|||
|
|
|
|||
|
|
11:42:07.710215 B > A: . ack 2443374
|
|||
|
|
11:42:07.710799 A > B: . 2572734:2577046(4312)
|
|||
|
|
11:42:07.711368 A > B: . 2577046:2581358(4312)
|
|||
|
|
11:42:07.711405 B > A: . ack 2451998
|
|||
|
|
11:42:07.712323 A > B: . 2581358:2585670(4312)
|
|||
|
|
11:42:07.712898 A > B: . 2585670:2589982(4312)
|
|||
|
|
11:42:07.712938 B > A: . ack 2460622
|
|||
|
|
11:42:07.713926 A > B: . 2589982:2594294(4312)
|
|||
|
|
11:42:07.714501 A > B: . 2594294:2598606(4312)
|
|||
|
|
11:42:07.714547 B > A: . ack 2469246
|
|||
|
|
11:42:07.715747 A > B: . 2598606:2602918(4312)
|
|||
|
|
11:42:07.716287 A > B: . 2602918:2607230(4312)
|
|||
|
|
11:42:07.716328 B > A: . ack 2477870
|
|||
|
|
11:42:07.717146 A > B: . 2607230:2611542(4312)
|
|||
|
|
11:42:07.717717 A > B: . 2611542:2615854(4312)
|
|||
|
|
11:42:07.717762 B > A: . ack 2486494
|
|||
|
|
11:42:07.718754 A > B: . 2615854:2620166(4312)
|
|||
|
|
11:42:07.719331 A > B: . 2620166:2624478(4312)
|
|||
|
|
11:42:07.719906 A > B: . 2624478:2628790(4312) **
|
|||
|
|
|
|||
|
|
11:42:07.719958 B > A: . ack 2495118
|
|||
|
|
11:42:07.720500 A > B: . 2628790:2633102(4312)
|
|||
|
|
11:42:07.721080 A > B: . 2633102:2637414(4312)
|
|||
|
|
11:42:07.721739 B > A: . ack 2503742
|
|||
|
|
11:42:07.722348 A > B: . 2637414:2641726(4312)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 19]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
11:42:07.722918 A > B: . 2641726:2646038(4312)
|
|||
|
|
11:42:07.769248 B > A: . ack 2512366
|
|||
|
|
|
|||
|
|
The receiver's acknowledgment policy is one ACK per two packets
|
|||
|
|
received. Thus, for each ACK arriving at host A, two new packets
|
|||
|
|
are sent, except when cwnd increases due to congestion avoidance,
|
|||
|
|
in which case three new packets are sent.
|
|||
|
|
|
|||
|
|
With an ack-every-two-packets policy, cwnd should only increase
|
|||
|
|
one MSS per 2 RTT. However, at the point marked "*" the window
|
|||
|
|
increases after 7 ACKs have arrived, and then again at "**" after
|
|||
|
|
6 more ACKs.
|
|||
|
|
|
|||
|
|
While we do not have space to show the effect, this trace suffered
|
|||
|
|
from repeated timeout retransmissions due to multiple packet
|
|||
|
|
losses during a single RTT.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using the same host and tracing setup as above, except now
|
|||
|
|
A's TCP has been modified to remove the MSS/8 additive constant.
|
|||
|
|
Tcpdump reported 77 packet drops; the excerpt below is fully
|
|||
|
|
self-consistent so it is unlikely that any of these occurred
|
|||
|
|
during the excerpt.
|
|||
|
|
|
|||
|
|
We again begin when cwnd is 31 packets (this occurs significantly
|
|||
|
|
later in the trace, because the congestion avoidance is now less
|
|||
|
|
aggressive with opening the window).
|
|||
|
|
|
|||
|
|
14:22:21.236757 B > A: . ack 5194679
|
|||
|
|
14:22:21.238192 A > B: . 5319727:5324039(4312)
|
|||
|
|
14:22:21.238770 A > B: . 5324039:5328351(4312)
|
|||
|
|
14:22:21.238821 B > A: . ack 5203303
|
|||
|
|
14:22:21.240158 A > B: . 5328351:5332663(4312)
|
|||
|
|
14:22:21.240738 A > B: . 5332663:5336975(4312)
|
|||
|
|
14:22:21.270422 B > A: . ack 5211927
|
|||
|
|
14:22:21.271883 A > B: . 5336975:5341287(4312)
|
|||
|
|
14:22:21.272458 A > B: . 5341287:5345599(4312)
|
|||
|
|
14:22:21.279099 B > A: . ack 5220551
|
|||
|
|
14:22:21.280539 A > B: . 5345599:5349911(4312)
|
|||
|
|
14:22:21.281118 A > B: . 5349911:5354223(4312)
|
|||
|
|
14:22:21.281183 B > A: . ack 5229175
|
|||
|
|
14:22:21.282348 A > B: . 5354223:5358535(4312)
|
|||
|
|
14:22:21.283029 A > B: . 5358535:5362847(4312)
|
|||
|
|
14:22:21.283089 B > A: . ack 5237799
|
|||
|
|
14:22:21.284213 A > B: . 5362847:5367159(4312)
|
|||
|
|
14:22:21.284779 A > B: . 5367159:5371471(4312)
|
|||
|
|
14:22:21.285976 B > A: . ack 5246423
|
|||
|
|
14:22:21.287465 A > B: . 5371471:5375783(4312)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 20]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
14:22:21.288036 A > B: . 5375783:5380095(4312)
|
|||
|
|
14:22:21.288073 B > A: . ack 5255047
|
|||
|
|
14:22:21.289155 A > B: . 5380095:5384407(4312)
|
|||
|
|
14:22:21.289725 A > B: . 5384407:5388719(4312)
|
|||
|
|
14:22:21.289762 B > A: . ack 5263671
|
|||
|
|
14:22:21.291090 A > B: . 5388719:5393031(4312)
|
|||
|
|
14:22:21.291662 A > B: . 5393031:5397343(4312)
|
|||
|
|
14:22:21.291701 B > A: . ack 5272295
|
|||
|
|
14:22:21.292870 A > B: . 5397343:5401655(4312)
|
|||
|
|
14:22:21.293441 A > B: . 5401655:5405967(4312)
|
|||
|
|
14:22:21.293481 B > A: . ack 5280919
|
|||
|
|
14:22:21.294476 A > B: . 5405967:5410279(4312)
|
|||
|
|
14:22:21.295053 A > B: . 5410279:5414591(4312)
|
|||
|
|
14:22:21.295106 B > A: . ack 5289543
|
|||
|
|
14:22:21.296306 A > B: . 5414591:5418903(4312)
|
|||
|
|
14:22:21.296878 A > B: . 5418903:5423215(4312)
|
|||
|
|
14:22:21.296917 B > A: . ack 5298167
|
|||
|
|
14:22:21.297716 A > B: . 5423215:5427527(4312)
|
|||
|
|
14:22:21.298285 A > B: . 5427527:5431839(4312)
|
|||
|
|
14:22:21.298324 B > A: . ack 5306791
|
|||
|
|
14:22:21.299413 A > B: . 5431839:5436151(4312)
|
|||
|
|
14:22:21.299986 A > B: . 5436151:5440463(4312)
|
|||
|
|
14:22:21.303696 B > A: . ack 5315415
|
|||
|
|
14:22:21.305177 A > B: . 5440463:5444775(4312)
|
|||
|
|
14:22:21.305755 A > B: . 5444775:5449087(4312)
|
|||
|
|
14:22:21.308032 B > A: . ack 5324039
|
|||
|
|
14:22:21.309525 A > B: . 5449087:5453399(4312)
|
|||
|
|
14:22:21.310101 A > B: . 5453399:5457711(4312)
|
|||
|
|
14:22:21.310144 B > A: . ack 5332663 ***
|
|||
|
|
|
|||
|
|
14:22:21.311615 A > B: . 5457711:5462023(4312)
|
|||
|
|
14:22:21.312198 A > B: . 5462023:5466335(4312)
|
|||
|
|
14:22:21.341876 B > A: . ack 5341287
|
|||
|
|
14:22:21.343451 A > B: . 5466335:5470647(4312)
|
|||
|
|
14:22:21.343985 A > B: . 5470647:5474959(4312)
|
|||
|
|
14:22:21.350304 B > A: . ack 5349911
|
|||
|
|
14:22:21.351852 A > B: . 5474959:5479271(4312)
|
|||
|
|
14:22:21.352430 A > B: . 5479271:5483583(4312)
|
|||
|
|
14:22:21.352484 B > A: . ack 5358535
|
|||
|
|
14:22:21.353574 A > B: . 5483583:5487895(4312)
|
|||
|
|
14:22:21.354149 A > B: . 5487895:5492207(4312)
|
|||
|
|
14:22:21.354205 B > A: . ack 5367159
|
|||
|
|
14:22:21.355467 A > B: . 5492207:5496519(4312)
|
|||
|
|
14:22:21.356039 A > B: . 5496519:5500831(4312)
|
|||
|
|
14:22:21.357361 B > A: . ack 5375783
|
|||
|
|
14:22:21.358855 A > B: . 5500831:5505143(4312)
|
|||
|
|
14:22:21.359424 A > B: . 5505143:5509455(4312)
|
|||
|
|
14:22:21.359465 B > A: . ack 5384407
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 21]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
14:22:21.360605 A > B: . 5509455:5513767(4312)
|
|||
|
|
14:22:21.361181 A > B: . 5513767:5518079(4312)
|
|||
|
|
14:22:21.361225 B > A: . ack 5393031
|
|||
|
|
14:22:21.362485 A > B: . 5518079:5522391(4312)
|
|||
|
|
14:22:21.363057 A > B: . 5522391:5526703(4312)
|
|||
|
|
14:22:21.363096 B > A: . ack 5401655
|
|||
|
|
14:22:21.364236 A > B: . 5526703:5531015(4312)
|
|||
|
|
14:22:21.364810 A > B: . 5531015:5535327(4312)
|
|||
|
|
14:22:21.364867 B > A: . ack 5410279
|
|||
|
|
14:22:21.365819 A > B: . 5535327:5539639(4312)
|
|||
|
|
14:22:21.366386 A > B: . 5539639:5543951(4312)
|
|||
|
|
14:22:21.366427 B > A: . ack 5418903
|
|||
|
|
14:22:21.367586 A > B: . 5543951:5548263(4312)
|
|||
|
|
14:22:21.368158 A > B: . 5548263:5552575(4312)
|
|||
|
|
14:22:21.368199 B > A: . ack 5427527
|
|||
|
|
14:22:21.369189 A > B: . 5552575:5556887(4312)
|
|||
|
|
14:22:21.369758 A > B: . 5556887:5561199(4312)
|
|||
|
|
14:22:21.369803 B > A: . ack 5436151
|
|||
|
|
14:22:21.370814 A > B: . 5561199:5565511(4312)
|
|||
|
|
14:22:21.371398 A > B: . 5565511:5569823(4312)
|
|||
|
|
14:22:21.375159 B > A: . ack 5444775
|
|||
|
|
14:22:21.376658 A > B: . 5569823:5574135(4312)
|
|||
|
|
14:22:21.377235 A > B: . 5574135:5578447(4312)
|
|||
|
|
14:22:21.379303 B > A: . ack 5453399
|
|||
|
|
14:22:21.380802 A > B: . 5578447:5582759(4312)
|
|||
|
|
14:22:21.381377 A > B: . 5582759:5587071(4312)
|
|||
|
|
14:22:21.381947 A > B: . 5587071:5591383(4312) ****
|
|||
|
|
|
|||
|
|
"***" marks the end of the first round trip. Note that cwnd did
|
|||
|
|
not increase (as evidenced by each ACK eliciting two new data
|
|||
|
|
packets). Only at "****", which comes near the end of the second
|
|||
|
|
round trip, does cwnd increase by one packet.
|
|||
|
|
|
|||
|
|
This trace did not suffer any timeout retransmissions. It
|
|||
|
|
transferred the same amount of data as the first trace in about
|
|||
|
|
half as much time. This difference is repeatable between hosts A
|
|||
|
|
and B.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
[Stevens94] and [Wright95] discuss this problem. The problem of
|
|||
|
|
Reno TCP failing to recover from multiple losses except via a
|
|||
|
|
retransmission timeout is discussed in [Fall96,Hoe96].
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 22]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
If source code is available, that is generally the easiest way to
|
|||
|
|
detect this problem. Search for each modification to the cwnd
|
|||
|
|
variable; (at least) one of these will be for congestion
|
|||
|
|
avoidance, and inspection of the related code should immediately
|
|||
|
|
identify the problem if present.
|
|||
|
|
|
|||
|
|
The problem can also be detected by closely examining packet
|
|||
|
|
traces taken near the sender. During congestion avoidance, cwnd
|
|||
|
|
will increase by an additional segment upon the receipt of
|
|||
|
|
(typically) eight acknowledgements without a loss. This increase
|
|||
|
|
is in addition to the one segment increase per round trip time (or
|
|||
|
|
two round trip times if the receiver is using delayed ACKs).
|
|||
|
|
|
|||
|
|
Furthermore, graphs of the sequence number vs. time, taken from
|
|||
|
|
packet traces, are normally linear during congestion avoidance.
|
|||
|
|
When viewing packet traces of transfers from senders exhibiting
|
|||
|
|
this problem, the graphs appear quadratic instead of linear.
|
|||
|
|
|
|||
|
|
Finally, the traces will show that, with sufficiently large
|
|||
|
|
windows, nearly every loss event results in a timeout.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
This problem may be corrected by removing the "+ MSS/8" term from
|
|||
|
|
the congestion avoidance code that increases cwnd each time an ACK
|
|||
|
|
of new data is received.
|
|||
|
|
|
|||
|
|
2.7.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Initial RTO too low
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Performance
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When a TCP first begins transmitting data, it lacks the RTT
|
|||
|
|
measurements necessary to have computed an adaptive retransmission
|
|||
|
|
timeout (RTO). RFC 1122, 4.2.3.1, states that a TCP SHOULD
|
|||
|
|
initialize RTO to 3 seconds. A TCP that uses a lower value
|
|||
|
|
exhibits "Initial RTO too low".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In environments with large RTTs (where "large" means any value
|
|||
|
|
larger than the initial RTO), TCPs will experience very poor
|
|||
|
|
performance.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 23]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Whenever RTO < RTT, very poor performance can result as packets
|
|||
|
|
are unnecessarily retransmitted (because RTO will expire before an
|
|||
|
|
ACK for the packet can arrive) and the connection enters slow
|
|||
|
|
start and congestion avoidance. Generally, the algorithms for
|
|||
|
|
computing RTO avoid this problem by adding a positive term to the
|
|||
|
|
estimated RTT. However, when a connection first begins it must
|
|||
|
|
use some estimate for RTO, and if it picks a value less than RTT,
|
|||
|
|
the above problems will arise.
|
|||
|
|
|
|||
|
|
Furthermore, when the initial RTO < RTT, it can take a long time
|
|||
|
|
for the TCP to correct the problem by adapting the RTT estimate,
|
|||
|
|
because the use of Karn's algorithm (mandated by RFC 1122,
|
|||
|
|
4.2.3.1) will discard many of the candidate RTT measurements made
|
|||
|
|
after the first timeout, since they will be measurements of
|
|||
|
|
retransmitted segments.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 states that TCPs SHOULD initialize RTO to 3 seconds and
|
|||
|
|
MUST implement Karn's algorithm.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
The following trace file was taken using tcpdump at host A, the
|
|||
|
|
data sender. The advertised window and SYN options have been
|
|||
|
|
omitted for clarity.
|
|||
|
|
|
|||
|
|
07:52:39.870301 A > B: S 2786333696:2786333696(0)
|
|||
|
|
07:52:40.548170 B > A: S 130240000:130240000(0) ack 2786333697
|
|||
|
|
07:52:40.561287 A > B: P 1:513(512) ack 1
|
|||
|
|
07:52:40.753466 A > B: . 1:513(512) ack 1
|
|||
|
|
07:52:41.133687 A > B: . 1:513(512) ack 1
|
|||
|
|
07:52:41.458529 B > A: . ack 513
|
|||
|
|
07:52:41.458686 A > B: . 513:1025(512) ack 1
|
|||
|
|
07:52:41.458797 A > B: P 1025:1537(512) ack 1
|
|||
|
|
07:52:41.541633 B > A: . ack 513
|
|||
|
|
07:52:41.703732 A > B: . 513:1025(512) ack 1
|
|||
|
|
07:52:42.044875 B > A: . ack 513
|
|||
|
|
07:52:42.173728 A > B: . 513:1025(512) ack 1
|
|||
|
|
07:52:42.330861 B > A: . ack 1537
|
|||
|
|
07:52:42.331129 A > B: . 1537:2049(512) ack 1
|
|||
|
|
07:52:42.331262 A > B: P 2049:2561(512) ack 1
|
|||
|
|
07:52:42.623673 A > B: . 1537:2049(512) ack 1
|
|||
|
|
07:52:42.683203 B > A: . ack 1537
|
|||
|
|
07:52:43.044029 B > A: . ack 1537
|
|||
|
|
07:52:43.193812 A > B: . 1537:2049(512) ack 1
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 24]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Note from the SYN/SYN-ACK exchange, the RTT is over 600 msec.
|
|||
|
|
However, from the elapsed time between the third and fourth lines
|
|||
|
|
(the first packet being sent and then retransmitted), it is
|
|||
|
|
apparent the RTO was initialized to under 200 msec. The next line
|
|||
|
|
shows that this value has doubled to 400 msec (correct exponential
|
|||
|
|
backoff of RTO), but that still does not suffice to avoid an
|
|||
|
|
unnecessary retransmission.
|
|||
|
|
|
|||
|
|
Finally, an ACK from B arrives for the first segment. Later two
|
|||
|
|
more duplicate ACKs for 513 arrive, indicating that both the
|
|||
|
|
original and the two retransmissions arrived at B. (Indeed, a
|
|||
|
|
concurrent trace at B showed that no packets were lost during the
|
|||
|
|
entire connection). This ACK opens the congestion window to two
|
|||
|
|
packets, which are sent back-to-back, but at 07:52:41.703732 RTO
|
|||
|
|
again expires after a little over 200 msec, leading to an
|
|||
|
|
unnecessary retransmission, and the pattern repeats. By the end
|
|||
|
|
of the trace excerpt above, 1536 bytes have been successfully
|
|||
|
|
transmitted from A to B, over an interval of more than 2 seconds,
|
|||
|
|
reflecting terrible performance.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
The following trace file was taken using tcpdump at host C, the
|
|||
|
|
data sender. The advertised window and SYN options have been
|
|||
|
|
omitted for clarity.
|
|||
|
|
|
|||
|
|
17:30:32.090299 C > D: S 2031744000:2031744000(0)
|
|||
|
|
17:30:32.900325 D > C: S 262737964:262737964(0) ack 2031744001
|
|||
|
|
17:30:32.900326 C > D: . ack 1
|
|||
|
|
17:30:32.910326 C > D: . 1:513(512) ack 1
|
|||
|
|
17:30:34.150355 D > C: . ack 513
|
|||
|
|
17:30:34.150356 C > D: . 513:1025(512) ack 1
|
|||
|
|
17:30:34.150357 C > D: . 1025:1537(512) ack 1
|
|||
|
|
17:30:35.170384 D > C: . ack 1025
|
|||
|
|
17:30:35.170385 C > D: . 1537:2049(512) ack 1
|
|||
|
|
17:30:35.170386 C > D: . 2049:2561(512) ack 1
|
|||
|
|
17:30:35.320385 D > C: . ack 1537
|
|||
|
|
17:30:35.320386 C > D: . 2561:3073(512) ack 1
|
|||
|
|
17:30:35.320387 C > D: . 3073:3585(512) ack 1
|
|||
|
|
17:30:35.730384 D > C: . ack 2049
|
|||
|
|
|
|||
|
|
The initial SYN/SYN-ACK exchange shows that RTT is more than 800
|
|||
|
|
msec, and for some subsequent packets it rises above 1 second, but
|
|||
|
|
C's retransmit timer does not ever expire.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Paxson97].
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 25]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
This problem is readily detected by inspecting a packet trace of
|
|||
|
|
the startup of a TCP connection made over a long-delay path. It
|
|||
|
|
can be diagnosed from either a sender-side or receiver-side trace.
|
|||
|
|
Long-delay paths can often be found by locating remote sites on
|
|||
|
|
other continents.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
As this problem arises from a faulty initialization, one hopes
|
|||
|
|
fixing it requires a one-line change to the TCP source code.
|
|||
|
|
|
|||
|
|
2.8.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Failure of window deflation after loss recovery
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control / performance
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
The fast recovery algorithm allows TCP senders to continue to
|
|||
|
|
transmit new segments during loss recovery. First, fast
|
|||
|
|
retransmission is initiated after a TCP sender receives three
|
|||
|
|
duplicate ACKs. At this point, a retransmission is sent and cwnd
|
|||
|
|
is halved. The fast recovery algorithm then allows additional
|
|||
|
|
segments to be sent when sufficient additional duplicate ACKs
|
|||
|
|
arrive. Some implementations of fast recovery compute when to
|
|||
|
|
send additional segments by artificially incrementing cwnd, first
|
|||
|
|
by three segments to account for the three duplicate ACKs that
|
|||
|
|
triggered fast retransmission, and subsequently by 1 MSS for each
|
|||
|
|
new duplicate ACK that arrives. When cwnd allows, the sender
|
|||
|
|
transmits new data segments.
|
|||
|
|
|
|||
|
|
When an ACK arrives that covers new data, cwnd is to be reduced by
|
|||
|
|
the amount by which it was artificially increased. However, some
|
|||
|
|
TCP implementations fail to "deflate" the window, causing an
|
|||
|
|
inappropriate amount of data to be sent into the network after
|
|||
|
|
recovery. One cause of this problem is the "header prediction"
|
|||
|
|
code, which is used to handle incoming segments that require
|
|||
|
|
little work. In some implementations of TCP, the header
|
|||
|
|
prediction code does not check to make sure cwnd has not been
|
|||
|
|
artificially inflated, and therefore does not reduce the
|
|||
|
|
artificially increased cwnd when appropriate.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
TCP senders that exhibit this problem will transmit a burst of
|
|||
|
|
data immediately after recovery, which can degrade performance, as
|
|||
|
|
well as network stability. Effectively, the sender does not
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 26]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
reduce the size of cwnd as much as it should (to half its value
|
|||
|
|
when loss was detected), if at all. This can harm the performance
|
|||
|
|
of the TCP connection itself, as well as competing TCP flows.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
A TCP sender exhibiting this problem does not reduce cwnd
|
|||
|
|
appropriately in times of congestion, and therefore may contribute
|
|||
|
|
to congestive collapse.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 2001 outlines the fast retransmit/fast recovery algorithms.
|
|||
|
|
[Brakmo95] outlines this implementation problem and offers a fix.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
The following trace file was taken using tcpdump at host A, the
|
|||
|
|
data sender. The advertised window (which never changed) has been
|
|||
|
|
omitted for clarity, except for the first packet sent by each
|
|||
|
|
host.
|
|||
|
|
|
|||
|
|
08:22:56.825635 A.7505 > B.7505: . 29697:30209(512) ack 1 win 4608
|
|||
|
|
08:22:57.038794 B.7505 > A.7505: . ack 27649 win 4096
|
|||
|
|
08:22:57.039279 A.7505 > B.7505: . 30209:30721(512) ack 1
|
|||
|
|
08:22:57.321876 B.7505 > A.7505: . ack 28161
|
|||
|
|
08:22:57.322356 A.7505 > B.7505: . 30721:31233(512) ack 1
|
|||
|
|
08:22:57.347128 B.7505 > A.7505: . ack 28673
|
|||
|
|
08:22:57.347572 A.7505 > B.7505: . 31233:31745(512) ack 1
|
|||
|
|
08:22:57.347782 A.7505 > B.7505: . 31745:32257(512) ack 1
|
|||
|
|
08:22:57.936393 B.7505 > A.7505: . ack 29185
|
|||
|
|
08:22:57.936864 A.7505 > B.7505: . 32257:32769(512) ack 1
|
|||
|
|
08:22:57.950802 B.7505 > A.7505: . ack 29697 win 4096
|
|||
|
|
08:22:57.951246 A.7505 > B.7505: . 32769:33281(512) ack 1
|
|||
|
|
08:22:58.169422 B.7505 > A.7505: . ack 29697
|
|||
|
|
08:22:58.638222 B.7505 > A.7505: . ack 29697
|
|||
|
|
08:22:58.643312 B.7505 > A.7505: . ack 29697
|
|||
|
|
08:22:58.643669 A.7505 > B.7505: . 29697:30209(512) ack 1
|
|||
|
|
08:22:58.936436 B.7505 > A.7505: . ack 29697
|
|||
|
|
08:22:59.002614 B.7505 > A.7505: . ack 29697
|
|||
|
|
08:22:59.003026 A.7505 > B.7505: . 33281:33793(512) ack 1
|
|||
|
|
08:22:59.682902 B.7505 > A.7505: . ack 33281
|
|||
|
|
08:22:59.683391 A.7505 > B.7505: P 33793:34305(512) ack 1
|
|||
|
|
08:22:59.683748 A.7505 > B.7505: P 34305:34817(512) ack 1 ***
|
|||
|
|
08:22:59.684043 A.7505 > B.7505: P 34817:35329(512) ack 1
|
|||
|
|
08:22:59.684266 A.7505 > B.7505: P 35329:35841(512) ack 1
|
|||
|
|
08:22:59.684567 A.7505 > B.7505: P 35841:36353(512) ack 1
|
|||
|
|
08:22:59.684810 A.7505 > B.7505: P 36353:36865(512) ack 1
|
|||
|
|
08:22:59.685094 A.7505 > B.7505: P 36865:37377(512) ack 1
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 27]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
The first 12 lines of the trace show incoming ACKs clocking out a
|
|||
|
|
window of data segments. At this point in the transfer, cwnd is 7
|
|||
|
|
segments. The next 4 lines of the trace show 3 duplicate ACKs
|
|||
|
|
arriving from the receiver, followed by a retransmission from the
|
|||
|
|
sender. At this point, cwnd is halved (to 3 segments) and
|
|||
|
|
artificially incremented by the three duplicate ACKs that have
|
|||
|
|
arrived, making cwnd 6 segments. The next two lines show 2 more
|
|||
|
|
duplicate ACKs arriving, each of which increases cwnd by 1
|
|||
|
|
segment. So, after these two duplicate ACKs arrive the cwnd is 8
|
|||
|
|
segments and the sender has permission to send 1 new segment
|
|||
|
|
(since there are 7 segments outstanding). The next line in the
|
|||
|
|
trace shows this new segment being transmitted. The next packet
|
|||
|
|
shown in the trace is an ACK from host B that covers the first 7
|
|||
|
|
outstanding segments (all but the new segment sent during
|
|||
|
|
recovery). This should cause cwnd to be reduced to 3 segments and
|
|||
|
|
2 segments to be transmitted (since there is already 1 outstanding
|
|||
|
|
segment in the network). However, as shown by the last 7 lines of
|
|||
|
|
the trace, cwnd is not reduced, causing a line-rate burst of 7 new
|
|||
|
|
segments.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
The trace would appear identical to the one above, only it would
|
|||
|
|
stop after the line marked "***", because at this point host A
|
|||
|
|
would correctly reduce cwnd after recovery, allowing only 2
|
|||
|
|
segments to be transmitted, rather than producing a burst of 7
|
|||
|
|
segments.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented and the performance implications
|
|||
|
|
analyzed in [Brakmo95].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
Failure of window deflation after loss recovery can be found by
|
|||
|
|
examining sender-side packet traces recorded during periods of
|
|||
|
|
moderate loss (so cwnd can grow large enough to allow for fast
|
|||
|
|
recovery when loss occurs).
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
When this bug is caused by incorrect header prediction, the fix is
|
|||
|
|
to add a predicate to the header prediction test that checks to
|
|||
|
|
see whether cwnd is inflated; if so, the header prediction test
|
|||
|
|
fails and the usual ACK processing occurs, which (in this case)
|
|||
|
|
takes care to deflate the window. See [Brakmo95] for details.
|
|||
|
|
|
|||
|
|
2.9.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Excessively short keepalive connection timeout
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 28]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Reliability
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
Keep-alive is a mechanism for checking whether an idle connection
|
|||
|
|
is still alive. According to RFC 1122, keepalive should only be
|
|||
|
|
invoked in server applications that might otherwise hang
|
|||
|
|
indefinitely and consume resources unnecessarily if a client
|
|||
|
|
crashes or aborts a connection during a network failure.
|
|||
|
|
|
|||
|
|
RFC 1122 also specifies that if a keep-alive mechanism is
|
|||
|
|
implemented it MUST NOT interpret failure to respond to any
|
|||
|
|
specific probe as a dead connection. The RFC does not specify a
|
|||
|
|
particular mechanism for timing out a connection when no response
|
|||
|
|
is received for keepalive probes. However, if the mechanism does
|
|||
|
|
not allow ample time for recovery from network congestion or
|
|||
|
|
delay, connections may be timed out unnecessarily.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In congested networks, can lead to unwarranted termination of
|
|||
|
|
connections.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
It is possible for the network connection between two peer
|
|||
|
|
machines to become congested or to exhibit packet loss at the time
|
|||
|
|
that a keep-alive probe is sent on a connection. If the keep-
|
|||
|
|
alive mechanism does not allow sufficient time before dropping
|
|||
|
|
connections in the face of unacknowledged probes, connections may
|
|||
|
|
be dropped even when both peers of a connection are still alive.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 specifies that the keep-alive mechanism may be provided.
|
|||
|
|
It does not specify a mechanism for determining dead connections
|
|||
|
|
when keepalive probes are not acknowledged.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using the Orchestra tool at the peer of the machine using
|
|||
|
|
keep-alive. After connection establishment, incoming keep-alives
|
|||
|
|
were dropped by Orchestra to simulate a dead connection.
|
|||
|
|
|
|||
|
|
22:11:12.040000 A > B: 22666019:0 win 8192 datasz 4 SYN
|
|||
|
|
22:11:12.060000 B > A: 2496001:22666020 win 4096 datasz 4 SYN ACK
|
|||
|
|
22:11:12.130000 A > B: 22666020:2496002 win 8760 datasz 0 ACK
|
|||
|
|
(more than two hours elapse)
|
|||
|
|
00:23:00.680000 A > B: 22666019:2496002 win 8760 datasz 1 ACK
|
|||
|
|
00:23:01.770000 A > B: 22666019:2496002 win 8760 datasz 1 ACK
|
|||
|
|
00:23:02.870000 A > B: 22666019:2496002 win 8760 datasz 1 ACK
|
|||
|
|
00:23.03.970000 A > B: 22666019:2496002 win 8760 datasz 1 ACK
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 29]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
00:23.05.070000 A > B: 22666019:2496002 win 8760 datasz 1 ACK
|
|||
|
|
|
|||
|
|
The initial three packets are the SYN exchange for connection
|
|||
|
|
setup. About two hours later, the keepalive timer fires because
|
|||
|
|
the connection has been idle. Keepalive probes are transmitted a
|
|||
|
|
total of 5 times, with a 1 second spacing between probes, after
|
|||
|
|
which the connection is dropped. This is problematic because a 5
|
|||
|
|
second network outage at the time of the first probe results in
|
|||
|
|
the connection being killed.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using the Orchestra tool at the peer of the machine using
|
|||
|
|
keep-alive. After connection establishment, incoming keep-alives
|
|||
|
|
were dropped by Orchestra to simulate a dead connection.
|
|||
|
|
|
|||
|
|
16:01:52.130000 A > B: 1804412929:0 win 4096 datasz 4 SYN
|
|||
|
|
16:01:52.360000 B > A: 16512001:1804412930 win 4096 datasz 4 SYN ACK
|
|||
|
|
16:01:52.410000 A > B: 1804412930:16512002 win 4096 datasz 0 ACK
|
|||
|
|
(two hours elapse)
|
|||
|
|
18:01:57.170000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:03:12.220000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:04:27.270000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:05:42.320000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:06:57.370000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:08:12.420000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:09:27.480000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:10:43.290000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:11:57.580000 A > B: 1804412929:16512002 win 4096 datasz 0 ACK
|
|||
|
|
18:13:12.630000 A > B: 1804412929:16512002 win 4096 datasz 0 RST ACK
|
|||
|
|
|
|||
|
|
In this trace, when the keep-alive timer expires, 9 keepalive
|
|||
|
|
probes are sent at 75 second intervals. 75 seconds after the last
|
|||
|
|
probe is sent, a final RST segment is sent indicating that the
|
|||
|
|
connection has been closed. This implementation waits about 11
|
|||
|
|
minutes before timing out the connection, while the first
|
|||
|
|
implementation shown allows only 5 seconds.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Dawson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
For implementations manifesting this problem, it shows up on a
|
|||
|
|
packet trace after the keepalive timer fires if the peer machine
|
|||
|
|
receiving the keepalive does not respond. Usually the keepalive
|
|||
|
|
timer will fire at least two hours after keepalive is turned on,
|
|||
|
|
but it may be sooner if the timer value has been configured lower,
|
|||
|
|
or if the keepalive mechanism violates the specification (see
|
|||
|
|
Insufficient interval between keepalives problem). In this
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 30]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
example, suppressing the response of the peer to keepalive probes
|
|||
|
|
was accomplished using the Orchestra toolkit, which can be
|
|||
|
|
configured to drop packets. It could also have been done by
|
|||
|
|
creating a connection, turning on keepalive, and disconnecting the
|
|||
|
|
network connection at the receiver machine.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
This problem can be fixed by using a different method for timing
|
|||
|
|
out keepalives that allows a longer period of time to elapse
|
|||
|
|
before dropping the connection. For example, the algorithm for
|
|||
|
|
timing out on dropped data could be used. Another possibility is
|
|||
|
|
an algorithm such as the one shown in the trace above, which sends
|
|||
|
|
9 probes at 75 second intervals and then waits an additional 75
|
|||
|
|
seconds for a response before closing the connection.
|
|||
|
|
|
|||
|
|
2.10.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Failure to back off retransmission timeout
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control / reliability
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
The retransmission timeout is used to determine when a packet has
|
|||
|
|
been dropped in the network. When this timeout has expired
|
|||
|
|
without the arrival of an ACK, the segment is retransmitted. Each
|
|||
|
|
time a segment is retransmitted, the timeout is adjusted according
|
|||
|
|
to an exponential backoff algorithm, doubling each time. If a TCP
|
|||
|
|
fails to receive an ACK after numerous attempts at retransmitting
|
|||
|
|
the same segment, it terminates the connection. A TCP that fails
|
|||
|
|
to double its retransmission timeout upon repeated timeouts is
|
|||
|
|
said to exhibit "Failure to back off retransmission timeout".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
Backing off the retransmission timer is a cornerstone of network
|
|||
|
|
stability in the presence of congestion. Consequently, this bug
|
|||
|
|
can have severe adverse affects in congested networks. It also
|
|||
|
|
affects TCP reliability in congested networks, as discussed in the
|
|||
|
|
next section.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
It is possible for the network connection between two TCP peers to
|
|||
|
|
become congested or to exhibit packet loss at the time that a
|
|||
|
|
retransmission is sent on a connection. If the retransmission
|
|||
|
|
mechanism does not allow sufficient time before dropping
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 31]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
connections in the face of unacknowledged segments, connections
|
|||
|
|
may be dropped even when, by waiting longer, the connection could
|
|||
|
|
have continued.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 specifies mandatory exponential backoff of the
|
|||
|
|
retransmission timeout, and the termination of connections after
|
|||
|
|
some period of time (at least 100 seconds).
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump on an intermediate host:
|
|||
|
|
|
|||
|
|
16:51:12.671727 A > B: S 510878852:510878852(0) win 16384
|
|||
|
|
16:51:12.672479 B > A: S 2392143687:2392143687(0)
|
|||
|
|
ack 510878853 win 16384
|
|||
|
|
16:51:12.672581 A > B: . ack 1 win 16384
|
|||
|
|
16:51:15.244171 A > B: P 1:3(2) ack 1 win 16384
|
|||
|
|
16:51:15.244933 B > A: . ack 3 win 17518 (DF)
|
|||
|
|
|
|||
|
|
<receiving host disconnected>
|
|||
|
|
|
|||
|
|
16:51:19.381176 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:20.162016 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:21.161936 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:22.161914 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:23.161914 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:24.161879 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:25.161857 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:26.161836 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:27.161814 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:28.161791 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:29.161769 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:30.161750 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
16:51:31.161727 A > B: P 3:5(2) ack 1 win 16384
|
|||
|
|
|
|||
|
|
16:51:32.161701 A > B: R 5:5(0) ack 1 win 16384
|
|||
|
|
|
|||
|
|
The initial three packets are the SYN exchange for connection
|
|||
|
|
setup, then a single data packet, to verify that data can be
|
|||
|
|
transferred. Then the connection to the destination host was
|
|||
|
|
disconnected, and more data sent. Retransmissions occur every
|
|||
|
|
second for 12 seconds, and then the connection is terminated with
|
|||
|
|
a RST. This is problematic because a 12 second pause in
|
|||
|
|
connectivity could result in the termination of a connection.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Again, a tcpdump taken from a third host:
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 32]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
16:59:05.398301 A > B: S 2503324757:2503324757(0) win 16384
|
|||
|
|
16:59:05.399673 B > A: S 2492674648:2492674648(0)
|
|||
|
|
ack 2503324758 win 16384
|
|||
|
|
16:59:05.399866 A > B: . ack 1 win 17520
|
|||
|
|
16:59:06.538107 A > B: P 1:3(2) ack 1 win 17520
|
|||
|
|
16:59:06.540977 B > A: . ack 3 win 17518 (DF)
|
|||
|
|
|
|||
|
|
<receiving host disconnected>
|
|||
|
|
|
|||
|
|
16:59:13.121542 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
16:59:14.010928 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
16:59:16.010979 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
16:59:20.011229 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
16:59:28.011896 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
16:59:44.013200 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:00:16.015766 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:01:20.021308 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:02:24.027752 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:03:28.034569 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:04:32.041567 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:05:36.048264 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
17:06:40.054900 A > B: P 3:5(2) ack 1 win 17520
|
|||
|
|
|
|||
|
|
17:07:44.061306 A > B: R 5:5(0) ack 1 win 17520
|
|||
|
|
|
|||
|
|
In this trace, when the retransmission timer expires, 12
|
|||
|
|
retransmissions are sent at exponentially-increasing intervals,
|
|||
|
|
until the interval value reaches 64 seconds, at which time the
|
|||
|
|
interval stops growing. 64 seconds after the last retransmission,
|
|||
|
|
a final RST segment is sent indicating that the connection has
|
|||
|
|
been closed. This implementation waits about 9 minutes before
|
|||
|
|
timing out the connection, while the first implementation shown
|
|||
|
|
allows only 12 seconds.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
None known.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
A simple transfer can be easily interrupted by disconnecting the
|
|||
|
|
receiving host from the network. tcpdump or another appropriate
|
|||
|
|
tool should show the retransmissions being sent. Several trials
|
|||
|
|
in a low-rtt environment may be required to demonstrate the bug.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
For one of the implementations studied, this problem seemed to be
|
|||
|
|
the result of an error introduced with the addition of the
|
|||
|
|
Brakmo-Peterson RTO algorithm [Brakmo95], which can return a value
|
|||
|
|
of zero where the older Jacobson algorithm always returns a
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 33]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
positive value. Brakmo and Peterson specified an additional step
|
|||
|
|
of min(rtt + 2, RTO) to avoid problems with this. Unfortunately,
|
|||
|
|
in the implementation this step was omitted when calculating the
|
|||
|
|
exponential backoff for the RTO. This results in an RTO of 0
|
|||
|
|
seconds being multiplied by the backoff, yielding again zero, and
|
|||
|
|
then being subjected to a later MAX operation that increases it to
|
|||
|
|
1 second, regardless of the backoff factor.
|
|||
|
|
|
|||
|
|
A similar TCP persist failure has the same cause.
|
|||
|
|
|
|||
|
|
2.11.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Insufficient interval between keepalives
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Reliability
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
Keep-alive is a mechanism for checking whether an idle connection
|
|||
|
|
is still alive. According to RFC 1122, keep-alive may be included
|
|||
|
|
in an implementation. If it is included, the interval between
|
|||
|
|
keep-alive packets MUST be configurable, and MUST default to no
|
|||
|
|
less than two hours.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In congested networks, can lead to unwarranted termination of
|
|||
|
|
connections.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
According to RFC 1122, keep-alive is not required of
|
|||
|
|
implementations because it could: (1) cause perfectly good
|
|||
|
|
connections to break during transient Internet failures; (2)
|
|||
|
|
consume unnecessary bandwidth ("if no one is using the connection,
|
|||
|
|
who cares if it is still good?"); and (3) cost money for an
|
|||
|
|
Internet path that charges for packets. Regarding this last
|
|||
|
|
point, we note that in addition the presence of dial-on-demand
|
|||
|
|
links in the route can greatly magnify the cost penalty of excess
|
|||
|
|
keepalives, potentially forcing a full-time connection on a link
|
|||
|
|
that would otherwise only be connected a few minutes a day.
|
|||
|
|
|
|||
|
|
If keepalive is provided the RFC states that the required inter-
|
|||
|
|
keepalive distance MUST default to no less than two hours. If it
|
|||
|
|
does not, the probability of connections breaking increases, the
|
|||
|
|
bandwidth used due to keepalives increases, and cost increases
|
|||
|
|
over paths which charge per packet.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 34]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 specifies that the keep-alive mechanism may be provided.
|
|||
|
|
It also specifies the two hour minimum for the default interval
|
|||
|
|
between keepalive probes.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using the Orchestra tool at the peer of the machine using
|
|||
|
|
keep-alive. Machine A was configured to use default settings for
|
|||
|
|
the keepalive timer.
|
|||
|
|
|
|||
|
|
11:36:32.910000 A > B: 3288354305:0 win 28672 datasz 4 SYN
|
|||
|
|
11:36:32.930000 B > A: 896001:3288354306 win 4096 datasz 4 SYN ACK
|
|||
|
|
11:36:32.950000 A > B: 3288354306:896002 win 28672 datasz 0 ACK
|
|||
|
|
|
|||
|
|
11:50:01.190000 A > B: 3288354305:896002 win 28672 datasz 0 ACK
|
|||
|
|
11:50:01.210000 B > A: 896002:3288354306 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
12:03:29.410000 A > B: 3288354305:896002 win 28672 datasz 0 ACK
|
|||
|
|
12:03:29.430000 B > A: 896002:3288354306 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
12:16:57.630000 A > B: 3288354305:896002 win 28672 datasz 0 ACK
|
|||
|
|
12:16:57.650000 B > A: 896002:3288354306 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
12:30:25.850000 A > B: 3288354305:896002 win 28672 datasz 0 ACK
|
|||
|
|
12:30:25.870000 B > A: 896002:3288354306 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
12:43:54.070000 A > B: 3288354305:896002 win 28672 datasz 0 ACK
|
|||
|
|
12:43:54.090000 B > A: 896002:3288354306 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
The initial three packets are the SYN exchange for connection
|
|||
|
|
setup. About 13 minutes later, the keepalive timer fires because
|
|||
|
|
the connection is idle. The keepalive is acknowledged, and the
|
|||
|
|
timer fires again in about 13 more minutes. This behavior
|
|||
|
|
continues indefinitely until the connection is closed, and is a
|
|||
|
|
violation of the specification.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using the Orchestra tool at the peer of the machine using
|
|||
|
|
keep-alive. Machine A was configured to use default settings for
|
|||
|
|
the keepalive timer.
|
|||
|
|
|
|||
|
|
17:37:20.500000 A > B: 34155521:0 win 4096 datasz 4 SYN
|
|||
|
|
17:37:20.520000 B > A: 6272001:34155522 win 4096 datasz 4 SYN ACK
|
|||
|
|
17:37:20.540000 A > B: 34155522:6272002 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
19:37:25.430000 A > B: 34155521:6272002 win 4096 datasz 0 ACK
|
|||
|
|
19:37:25.450000 B > A: 6272002:34155522 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 35]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
21:37:30.560000 A > B: 34155521:6272002 win 4096 datasz 0 ACK
|
|||
|
|
21:37:30.570000 B > A: 6272002:34155522 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
23:37:35.580000 A > B: 34155521:6272002 win 4096 datasz 0 ACK
|
|||
|
|
23:37:35.600000 B > A: 6272002:34155522 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
01:37:40.620000 A > B: 34155521:6272002 win 4096 datasz 0 ACK
|
|||
|
|
01:37:40.640000 B > A: 6272002:34155522 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
03:37:45.590000 A > B: 34155521:6272002 win 4096 datasz 0 ACK
|
|||
|
|
03:37:45.610000 B > A: 6272002:34155522 win 4096 datasz 0 ACK
|
|||
|
|
|
|||
|
|
The initial three packets are the SYN exchange for connection
|
|||
|
|
setup. Just over two hours later, the keepalive timer fires
|
|||
|
|
because the connection is idle. The keepalive is acknowledged,
|
|||
|
|
and the timer fires again just over two hours later. This
|
|||
|
|
behavior continues indefinitely until the connection is closed.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Dawson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
For implementations manifesting this problem, it shows up on a
|
|||
|
|
packet trace. If the connection is left idle, the keepalive
|
|||
|
|
probes will arrive closer together than the two hour minimum.
|
|||
|
|
|
|||
|
|
2.12.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Window probe deadlock
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Reliability
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When an application reads a single byte from a full window, the
|
|||
|
|
window should not be updated, in order to avoid Silly Window
|
|||
|
|
Syndrome (SWS; see [RFC813]). If the remote peer uses a single
|
|||
|
|
byte of data to probe the window, that byte can be accepted into
|
|||
|
|
the buffer. In some implementations, at this point a negative
|
|||
|
|
argument to a signed comparison causes all further new data to be
|
|||
|
|
considered outside the window; consequently, it is discarded
|
|||
|
|
(after sending an ACK to resynchronize). These discards include
|
|||
|
|
the ACKs for the data packets sent by the local TCP, so the TCP
|
|||
|
|
will consider the data unacknowledged.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 36]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Consequently, the application may be unable to complete sending
|
|||
|
|
new data to the remote peer, because it has exhausted the transmit
|
|||
|
|
buffer available to its local TCP, and buffer space is never being
|
|||
|
|
freed because incoming ACKs that would do so are being discarded.
|
|||
|
|
If the application does not read any more data, which may happen
|
|||
|
|
due to its failure to complete such sends, then deadlock results.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
It's relatively rare for applications to use TCP in a manner that
|
|||
|
|
can exercise this problem. Most applications only transmit bulk
|
|||
|
|
data if they know the other end is prepared to receive the data.
|
|||
|
|
However, if a client fails to consume data, putting the server in
|
|||
|
|
persist mode, and then consumes a small amount of data, it can
|
|||
|
|
mistakenly compute a negative window. At this point the client
|
|||
|
|
will discard all further packets from the server, including ACKs
|
|||
|
|
of the client's own data, since they are not inside the
|
|||
|
|
(impossibly-sized) window. If subsequently the client consumes
|
|||
|
|
enough data to then send a window update to the server, the
|
|||
|
|
situation will be rectified. That is, this situation can only
|
|||
|
|
happen if the client consumes 1 < N < MSS bytes, so as not to
|
|||
|
|
cause a window update, and then starts its own transmission
|
|||
|
|
towards the server of more than a window's worth of data.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
TCP connections will hang and eventually time out.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 793 describes zero window probing. RFC 813 describes Silly
|
|||
|
|
Window Syndrome.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Trace made from a version of tcpdump modified to print out the
|
|||
|
|
sequence number attached to an ACK even if it's dataless. An
|
|||
|
|
unmodified tcpdump would not print seq:seq(0); however, for this
|
|||
|
|
bug, the sequence number in the ACK is important for unambiguously
|
|||
|
|
determining how the TCP is behaving.
|
|||
|
|
|
|||
|
|
[ Normal connection startup and data transmission from B to A.
|
|||
|
|
Options, including MSS of 16344 in both directions, omitted
|
|||
|
|
for clarity. ]
|
|||
|
|
16:07:32.327616 A > B: S 65360807:65360807(0) win 8192
|
|||
|
|
16:07:32.327304 B > A: S 65488807:65488807(0) ack 65360808 win 57344
|
|||
|
|
16:07:32.327425 A > B: . 1:1(0) ack 1 win 57344
|
|||
|
|
16:07:32.345732 B > A: P 1:2049(2048) ack 1 win 57344
|
|||
|
|
16:07:32.347013 B > A: P 2049:16385(14336) ack 1 win 57344
|
|||
|
|
16:07:32.347550 B > A: P 16385:30721(14336) ack 1 win 57344
|
|||
|
|
16:07:32.348683 B > A: P 30721:45057(14336) ack 1 win 57344
|
|||
|
|
16:07:32.467286 A > B: . 1:1(0) ack 45057 win 12288
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 37]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
16:07:32.467854 B > A: P 45057:57345(12288) ack 1 win 57344
|
|||
|
|
|
|||
|
|
[ B fills up A's offered window ]
|
|||
|
|
16:07:32.667276 A > B: . 1:1(0) ack 57345 win 0
|
|||
|
|
|
|||
|
|
[ B probes A's window with a single byte ]
|
|||
|
|
16:07:37.467438 B > A: . 57345:57346(1) ack 1 win 57344
|
|||
|
|
|
|||
|
|
[ A resynchronizes without accepting the byte ]
|
|||
|
|
16:07:37.467678 A > B: . 1:1(0) ack 57345 win 0
|
|||
|
|
|
|||
|
|
[ B probes A's window again ]
|
|||
|
|
16:07:45.467438 B > A: . 57345:57346(1) ack 1 win 57344
|
|||
|
|
|
|||
|
|
[ A resynchronizes and accepts the byte (per the ack field) ]
|
|||
|
|
16:07:45.667250 A > B: . 1:1(0) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ The application on A has started generating data. The first
|
|||
|
|
packet A sends is small due to a memory allocation bug. ]
|
|||
|
|
16:07:51.358459 A > B: P 1:2049(2048) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ B acks A's first packet ]
|
|||
|
|
16:07:51.467239 B > A: . 57346:57346(0) ack 2049 win 57344
|
|||
|
|
|
|||
|
|
[ This looks as though A accepted B's ACK and is sending
|
|||
|
|
another packet in response to it. In fact, A is trying
|
|||
|
|
to resynchronize with B, and happens to have data to send
|
|||
|
|
and can send it because the first small packet didn't use
|
|||
|
|
up cwnd. ]
|
|||
|
|
16:07:51.467698 A > B: . 2049:14337(12288) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ B acks all of the data that A has sent ]
|
|||
|
|
16:07:51.667283 B > A: . 57346:57346(0) ack 14337 win 57344
|
|||
|
|
|
|||
|
|
[ A tries to resynchronize. Notice that by the packets
|
|||
|
|
seen on the network, A and B *are* in fact synchronized;
|
|||
|
|
A only thinks that they aren't. ]
|
|||
|
|
16:07:51.667477 A > B: . 14337:14337(0) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ A's retransmit timer fires, and B acks all of the data.
|
|||
|
|
A once again tries to resynchronize. ]
|
|||
|
|
16:07:52.467682 A > B: . 1:14337(14336) ack 57346 win 0
|
|||
|
|
16:07:52.468166 B > A: . 57346:57346(0) ack 14337 win 57344
|
|||
|
|
16:07:52.468248 A > B: . 14337:14337(0) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ A's retransmit timer fires again, and B acks all of the data.
|
|||
|
|
A once again tries to resynchronize. ]
|
|||
|
|
16:07:55.467684 A > B: . 1:14337(14336) ack 57346 win 0
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 38]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
16:07:55.468172 B > A: . 57346:57346(0) ack 14337 win 57344
|
|||
|
|
16:07:55.468254 A > B: . 14337:14337(0) ack 57346 win 0
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made between the same two hosts after applying the bug fix
|
|||
|
|
mentioned below (and using the same modified tcpdump).
|
|||
|
|
|
|||
|
|
[ Connection starts up with data transmission from B to A.
|
|||
|
|
Note that due to a separate bug (the fact that A and B
|
|||
|
|
are communicating over a loopback driver), B erroneously
|
|||
|
|
skips slow start. ]
|
|||
|
|
17:38:09.510854 A > B: S 3110066585:3110066585(0) win 16384
|
|||
|
|
17:38:09.510926 B > A: S 3110174850:3110174850(0)
|
|||
|
|
ack 3110066586 win 57344
|
|||
|
|
17:38:09.510953 A > B: . 1:1(0) ack 1 win 57344
|
|||
|
|
17:38:09.512956 B > A: P 1:2049(2048) ack 1 win 57344
|
|||
|
|
17:38:09.513222 B > A: P 2049:16385(14336) ack 1 win 57344
|
|||
|
|
17:38:09.513428 B > A: P 16385:30721(14336) ack 1 win 57344
|
|||
|
|
17:38:09.513638 B > A: P 30721:45057(14336) ack 1 win 57344
|
|||
|
|
17:38:09.519531 A > B: . 1:1(0) ack 45057 win 12288
|
|||
|
|
17:38:09.519638 B > A: P 45057:57345(12288) ack 1 win 57344
|
|||
|
|
|
|||
|
|
[ B fills up A's offered window ]
|
|||
|
|
17:38:09.719526 A > B: . 1:1(0) ack 57345 win 0
|
|||
|
|
|
|||
|
|
[ B probes A's window with a single byte. A resynchronizes
|
|||
|
|
without accepting the byte ]
|
|||
|
|
17:38:14.499661 B > A: . 57345:57346(1) ack 1 win 57344
|
|||
|
|
17:38:14.499724 A > B: . 1:1(0) ack 57345 win 0
|
|||
|
|
|
|||
|
|
[ B probes A's window again. A resynchronizes and accepts
|
|||
|
|
the byte, as indicated by the ack field ]
|
|||
|
|
17:38:19.499764 B > A: . 57345:57346(1) ack 1 win 57344
|
|||
|
|
17:38:19.519731 A > B: . 1:1(0) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ B probes A's window with a single byte. A resynchronizes
|
|||
|
|
without accepting the byte ]
|
|||
|
|
17:38:24.499865 B > A: . 57346:57347(1) ack 1 win 57344
|
|||
|
|
17:38:24.499934 A > B: . 1:1(0) ack 57346 win 0
|
|||
|
|
|
|||
|
|
[ The application on A has started generating data.
|
|||
|
|
B acks A's data and A accepts the ACKs and the
|
|||
|
|
data transfer continues ]
|
|||
|
|
17:38:28.530265 A > B: P 1:2049(2048) ack 57346 win 0
|
|||
|
|
17:38:28.719914 B > A: . 57346:57346(0) ack 2049 win 57344
|
|||
|
|
|
|||
|
|
17:38:28.720023 A > B: . 2049:16385(14336) ack 57346 win 0
|
|||
|
|
17:38:28.720089 A > B: . 16385:30721(14336) ack 57346 win 0
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 39]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
17:38:28.720370 B > A: . 57346:57346(0) ack 30721 win 57344
|
|||
|
|
|
|||
|
|
17:38:28.720462 A > B: . 30721:45057(14336) ack 57346 win 0
|
|||
|
|
17:38:28.720526 A > B: P 45057:59393(14336) ack 57346 win 0
|
|||
|
|
17:38:28.720824 A > B: P 59393:73729(14336) ack 57346 win 0
|
|||
|
|
17:38:28.721124 B > A: . 57346:57346(0) ack 73729 win 47104
|
|||
|
|
|
|||
|
|
17:38:28.721198 A > B: P 73729:88065(14336) ack 57346 win 0
|
|||
|
|
17:38:28.721379 A > B: P 88065:102401(14336) ack 57346 win 0
|
|||
|
|
|
|||
|
|
17:38:28.721557 A > B: P 102401:116737(14336) ack 57346 win 0
|
|||
|
|
17:38:28.721863 B > A: . 57346:57346(0) ack 116737 win 36864
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
None known.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
Initiate a connection from a client to a server. Have the server
|
|||
|
|
continuously send data until its buffers have been full for long
|
|||
|
|
enough to exhaust the window. Next, have the client read 1 byte
|
|||
|
|
and then delay for long enough that the server TCP sends a window
|
|||
|
|
probe. Now have the client start sending data. At this point, if
|
|||
|
|
it ignores the server's ACKs, then the client's TCP suffers from
|
|||
|
|
the problem.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
In one implementation known to exhibit the problem (derived from
|
|||
|
|
4.3-Reno), the problem was introduced when the macro MAX() was
|
|||
|
|
replaced by the function call max() for computing the amount of
|
|||
|
|
space in the receive window:
|
|||
|
|
|
|||
|
|
tp->rcv_wnd = max(win, (int)(tp->rcv_adv - tp->rcv_nxt));
|
|||
|
|
|
|||
|
|
When data has been received into a window beyond what has been
|
|||
|
|
advertised to the other side, rcv_nxt > rcv_adv, making this
|
|||
|
|
negative. It's clear from the (int) cast that this is intended,
|
|||
|
|
but the unsigned max() function sign-extends so the negative
|
|||
|
|
number is "larger". The fix is to change max() to imax():
|
|||
|
|
|
|||
|
|
tp->rcv_wnd = imax(win, (int)(tp->rcv_adv - tp->rcv_nxt));
|
|||
|
|
|
|||
|
|
4.3-Tahoe and before did not have this bug, since it used the
|
|||
|
|
macro MAX() for this calculation.
|
|||
|
|
|
|||
|
|
2.13.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Stretch ACK violation
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 40]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion Control/Performance
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
To improve efficiency (both computer and network) a data receiver
|
|||
|
|
may refrain from sending an ACK for each incoming segment,
|
|||
|
|
according to [RFC1122]. However, an ACK should not be delayed an
|
|||
|
|
inordinate amount of time. Specifically, ACKs SHOULD be sent for
|
|||
|
|
every second full-sized segment that arrives. If a second full-
|
|||
|
|
sized segment does not arrive within a given timeout (of no more
|
|||
|
|
than 0.5 seconds), an ACK should be transmitted, according to
|
|||
|
|
[RFC1122]. A TCP receiver which does not generate an ACK for
|
|||
|
|
every second full-sized segment exhibits a "Stretch ACK
|
|||
|
|
Violation".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
TCP receivers exhibiting this behavior will cause TCP senders to
|
|||
|
|
generate burstier traffic, which can degrade performance in
|
|||
|
|
congested environments. In addition, generating fewer ACKs
|
|||
|
|
increases the amount of time needed by the slow start algorithm to
|
|||
|
|
open the congestion window to an appropriate point, which
|
|||
|
|
diminishes performance in environments with large bandwidth-delay
|
|||
|
|
products. Finally, generating fewer ACKs may cause needless
|
|||
|
|
retransmission timeouts in lossy environments, as it increases the
|
|||
|
|
possibility that an entire window of ACKs is lost, forcing a
|
|||
|
|
retransmission timeout.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
When not in loss recovery, every ACK received by a TCP sender
|
|||
|
|
triggers the transmission of new data segments. The burst size is
|
|||
|
|
determined by the number of previously unacknowledged segments
|
|||
|
|
each ACK covers. Therefore, a TCP receiver ack'ing more than 2
|
|||
|
|
segments at a time causes the sending TCP to generate a larger
|
|||
|
|
burst of traffic upon receipt of the ACK. This large burst of
|
|||
|
|
traffic can overwhelm an intervening gateway, leading to higher
|
|||
|
|
drop rates for both the connection and other connections passing
|
|||
|
|
through the congested gateway.
|
|||
|
|
|
|||
|
|
In addition, the TCP slow start algorithm increases the congestion
|
|||
|
|
window by 1 segment for each ACK received. Therefore, increasing
|
|||
|
|
the ACK interval (thus decreasing the rate at which ACKs are
|
|||
|
|
transmitted) increases the amount of time it takes slow start to
|
|||
|
|
increase the congestion window to an appropriate operating point,
|
|||
|
|
and the connection consequently suffers from reduced performance.
|
|||
|
|
This is especially true for connections using large windows.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 outlines delayed ACKs as a recommended mechanism.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 41]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Trace file taken using tcpdump at host B, the data receiver (and
|
|||
|
|
ACK originator). The advertised window (which never changed) and
|
|||
|
|
timestamp options have been omitted for clarity, except for the
|
|||
|
|
first packet sent by A:
|
|||
|
|
|
|||
|
|
12:09:24.820187 A.1174 > B.3999: . 2049:3497(1448) ack 1
|
|||
|
|
win 33580 <nop,nop,timestamp 2249877 2249914> [tos 0x8]
|
|||
|
|
12:09:24.824147 A.1174 > B.3999: . 3497:4945(1448) ack 1
|
|||
|
|
12:09:24.832034 A.1174 > B.3999: . 4945:6393(1448) ack 1
|
|||
|
|
12:09:24.832222 B.3999 > A.1174: . ack 6393
|
|||
|
|
12:09:24.934837 A.1174 > B.3999: . 6393:7841(1448) ack 1
|
|||
|
|
12:09:24.942721 A.1174 > B.3999: . 7841:9289(1448) ack 1
|
|||
|
|
12:09:24.950605 A.1174 > B.3999: . 9289:10737(1448) ack 1
|
|||
|
|
12:09:24.950797 B.3999 > A.1174: . ack 10737
|
|||
|
|
12:09:24.958488 A.1174 > B.3999: . 10737:12185(1448) ack 1
|
|||
|
|
12:09:25.052330 A.1174 > B.3999: . 12185:13633(1448) ack 1
|
|||
|
|
12:09:25.060216 A.1174 > B.3999: . 13633:15081(1448) ack 1
|
|||
|
|
12:09:25.060405 B.3999 > A.1174: . ack 15081
|
|||
|
|
|
|||
|
|
This portion of the trace clearly shows that the receiver (host B)
|
|||
|
|
sends an ACK for every third full sized packet received. Further
|
|||
|
|
investigation of this implementation found that the cause of the
|
|||
|
|
increased ACK interval was the TCP options being used. The
|
|||
|
|
implementation sent an ACK after it was holding 2*MSS worth of
|
|||
|
|
unacknowledged data. In the above case, the MSS is 1460 bytes so
|
|||
|
|
the receiver transmits an ACK after it is holding at least 2920
|
|||
|
|
bytes of unacknowledged data. However, the length of the TCP
|
|||
|
|
options being used [RFC1323] took 12 bytes away from the data
|
|||
|
|
portion of each packet. This produced packets containing 1448
|
|||
|
|
bytes of data. But the additional bytes used by the options in
|
|||
|
|
the header were not taken into account when determining when to
|
|||
|
|
trigger an ACK. Therefore, it took 3 data segments before the
|
|||
|
|
data receiver was holding enough unacknowledged data (>= 2*MSS, or
|
|||
|
|
2920 bytes in the above example) to transmit an ACK.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Trace file taken using tcpdump at host B, the data receiver (and
|
|||
|
|
ACK originator), again with window and timestamp information
|
|||
|
|
omitted except for the first packet:
|
|||
|
|
|
|||
|
|
12:06:53.627320 A.1172 > B.3999: . 1449:2897(1448) ack 1
|
|||
|
|
win 33580 <nop,nop,timestamp 2249575 2249612> [tos 0x8]
|
|||
|
|
12:06:53.634773 A.1172 > B.3999: . 2897:4345(1448) ack 1
|
|||
|
|
12:06:53.634961 B.3999 > A.1172: . ack 4345
|
|||
|
|
12:06:53.737326 A.1172 > B.3999: . 4345:5793(1448) ack 1
|
|||
|
|
12:06:53.744401 A.1172 > B.3999: . 5793:7241(1448) ack 1
|
|||
|
|
12:06:53.744592 B.3999 > A.1172: . ack 7241
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 42]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
12:06:53.752287 A.1172 > B.3999: . 7241:8689(1448) ack 1
|
|||
|
|
12:06:53.847332 A.1172 > B.3999: . 8689:10137(1448) ack 1
|
|||
|
|
12:06:53.847525 B.3999 > A.1172: . ack 10137
|
|||
|
|
|
|||
|
|
This trace shows the TCP receiver (host B) ack'ing every second
|
|||
|
|
full-sized packet, according to [RFC1122]. This is the same
|
|||
|
|
implementation shown above, with slight modifications that allow
|
|||
|
|
the receiver to take the length of the options into account when
|
|||
|
|
deciding when to transmit an ACK.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Allman97] and [Paxson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
Stretch ACK violations show up immediately in receiver-side packet
|
|||
|
|
traces of bulk transfers, as shown above. However, packet traces
|
|||
|
|
made on the sender side of the TCP connection may lead to
|
|||
|
|
ambiguities when diagnosing this problem due to the possibility of
|
|||
|
|
lost ACKs.
|
|||
|
|
|
|||
|
|
2.14.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Retransmission sends multiple packets
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Congestion control
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When a TCP retransmits a segment due to a timeout expiration or
|
|||
|
|
beginning a fast retransmission sequence, it should only transmit
|
|||
|
|
a single segment. A TCP that transmits more than one segment
|
|||
|
|
exhibits "Retransmission Sends Multiple Packets".
|
|||
|
|
|
|||
|
|
Instances of this problem have been known to occur due to
|
|||
|
|
miscomputations involving the use of TCP options. TCP options
|
|||
|
|
increase the TCP header beyond its usual size of 20 bytes. The
|
|||
|
|
total size of header must be taken into account when
|
|||
|
|
retransmitting a packet. If a TCP sender does not account for the
|
|||
|
|
length of the TCP options when determining how much data to
|
|||
|
|
retransmit, it will send too much data to fit into a single
|
|||
|
|
packet. In this case, the correct retransmission will be followed
|
|||
|
|
by a short segment (tinygram) containing data that may not need to
|
|||
|
|
be retransmitted.
|
|||
|
|
|
|||
|
|
A specific case is a TCP using the RFC 1323 timestamp option,
|
|||
|
|
which adds 12 bytes to the standard 20-byte TCP header. On
|
|||
|
|
retransmission of a packet, the 12 byte option is incorrectly
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 43]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
interpreted as part of the data portion of the segment. A
|
|||
|
|
standard TCP header and a new 12-byte option is added to the data,
|
|||
|
|
which yields a transmission of 12 bytes more data than contained
|
|||
|
|
in the original segment. This overflow causes a smaller packet,
|
|||
|
|
with 12 data bytes, to be transmitted.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
This problem is somewhat serious for congested environments
|
|||
|
|
because the TCP implementation injects more packets into the
|
|||
|
|
network than is appropriate. However, since a tinygram is only
|
|||
|
|
sent in response to a fast retransmit or a timeout, it does not
|
|||
|
|
effect the sustained sending rate.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
A TCP exhibiting this behavior is stressing the network with more
|
|||
|
|
traffic than appropriate, and stressing routers by increasing the
|
|||
|
|
number of packets they must process. The redundant tinygram will
|
|||
|
|
also elicit a duplicate ACK from the receiver, resulting in yet
|
|||
|
|
another unnecessary transmission.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 requires use of slow start after loss; RFC 2001
|
|||
|
|
explicates slow start; RFC 1323 describes the timestamp option
|
|||
|
|
that has been observed to lead to some implementations exhibiting
|
|||
|
|
this problem.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump recording at a machine on the same subnet as
|
|||
|
|
Host A. Host A is the sender and Host B is the receiver. The
|
|||
|
|
advertised window and timestamp options have been omitted for
|
|||
|
|
clarity, except for the first segment sent by host A. In
|
|||
|
|
addition, portions of the trace file not pertaining to the packet
|
|||
|
|
in question have been removed (missing packets are denoted by
|
|||
|
|
"[...]" in the trace).
|
|||
|
|
|
|||
|
|
11:55:22.701668 A > B: . 7361:7821(460) ack 1
|
|||
|
|
win 49324 <nop,nop,timestamp 3485348 3485113>
|
|||
|
|
11:55:22.702109 A > B: . 7821:8281(460) ack 1
|
|||
|
|
[...]
|
|||
|
|
|
|||
|
|
11:55:23.112405 B > A: . ack 7821
|
|||
|
|
11:55:23.113069 A > B: . 12421:12881(460) ack 1
|
|||
|
|
11:55:23.113511 A > B: . 12881:13341(460) ack 1
|
|||
|
|
11:55:23.333077 B > A: . ack 7821
|
|||
|
|
11:55:23.336860 B > A: . ack 7821
|
|||
|
|
11:55:23.340638 B > A: . ack 7821
|
|||
|
|
11:55:23.341290 A > B: . 7821:8281(460) ack 1
|
|||
|
|
11:55:23.341317 A > B: . 8281:8293(12) ack 1
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 44]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
11:55:23.498242 B > A: . ack 7821
|
|||
|
|
11:55:23.506850 B > A: . ack 7821
|
|||
|
|
11:55:23.510630 B > A: . ack 7821
|
|||
|
|
|
|||
|
|
[...]
|
|||
|
|
|
|||
|
|
11:55:23.746649 B > A: . ack 10581
|
|||
|
|
|
|||
|
|
The second line of the above trace shows the original transmission
|
|||
|
|
of a segment which is later dropped. After 3 duplicate ACKs, line
|
|||
|
|
9 of the trace shows the dropped packet (7821:8281), with a 460-
|
|||
|
|
byte payload, being retransmitted. Immediately following this
|
|||
|
|
retransmission, a packet with a 12-byte payload is unnecessarily
|
|||
|
|
sent.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
The trace file would be identical to the one above, with a single
|
|||
|
|
line:
|
|||
|
|
|
|||
|
|
11:55:23.341317 A > B: . 8281:8293(12) ack 1
|
|||
|
|
|
|||
|
|
omitted.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
[Brakmo95]
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
This problem can be detected by examining a packet trace of the
|
|||
|
|
TCP connections of a machine using TCP options, during which a
|
|||
|
|
packet is retransmitted.
|
|||
|
|
|
|||
|
|
2.15.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Failure to send FIN notification promptly
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Performance
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When an application closes a connection, the corresponding TCP
|
|||
|
|
should send the FIN notification promptly to its peer (unless
|
|||
|
|
prevented by the congestion window). If a TCP implementation
|
|||
|
|
delays in sending the FIN notification, for example due to waiting
|
|||
|
|
until unacknowledged data has been acknowledged, then it is said
|
|||
|
|
to exhibit "Failure to send FIN notification promptly".
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 45]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Also, while not strictly required, FIN segments should include the
|
|||
|
|
PSH flag to ensure expedited delivery of any pending data at the
|
|||
|
|
receiver.
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
The greatest impact occurs for short-lived connections, since for
|
|||
|
|
these the additional time required to close the connection
|
|||
|
|
introduces the greatest relative delay.
|
|||
|
|
|
|||
|
|
The additional time can be significant in the common case of the
|
|||
|
|
sender waiting for an ACK that is delayed by the receiver.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Can diminish total throughput as seen at the application layer,
|
|||
|
|
because connection termination takes longer to complete.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 793 indicates that a receiver should treat an incoming FIN
|
|||
|
|
flag as implying the push function.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump (no losses reported by the packet filter).
|
|||
|
|
|
|||
|
|
10:04:38.68 A > B: S 1031850376:1031850376(0) win 4096
|
|||
|
|
<mss 1460,wscale 0,eol> (DF)
|
|||
|
|
10:04:38.71 B > A: S 596916473:596916473(0) ack 1031850377
|
|||
|
|
win 8760 <mss 1460> (DF)
|
|||
|
|
10:04:38.73 A > B: . ack 1 win 4096 (DF)
|
|||
|
|
10:04:41.98 A > B: P 1:4(3) ack 1 win 4096 (DF)
|
|||
|
|
10:04:42.15 B > A: . ack 4 win 8757 (DF)
|
|||
|
|
10:04:42.23 A > B: P 4:7(3) ack 1 win 4096 (DF)
|
|||
|
|
10:04:42.25 B > A: P 1:11(10) ack 7 win 8754 (DF)
|
|||
|
|
10:04:42.32 A > B: . ack 11 win 4096 (DF)
|
|||
|
|
10:04:42.33 B > A: P 11:51(40) ack 7 win 8754 (DF)
|
|||
|
|
10:04:42.51 A > B: . ack 51 win 4096 (DF)
|
|||
|
|
10:04:42.53 B > A: F 51:51(0) ack 7 win 8754 (DF)
|
|||
|
|
10:04:42.56 A > B: FP 7:7(0) ack 52 win 4096 (DF)
|
|||
|
|
10:04:42.58 B > A: . ack 8 win 8754 (DF)
|
|||
|
|
|
|||
|
|
Machine B in the trace above does not send out a FIN notification
|
|||
|
|
promptly if there is any data outstanding. It instead waits for
|
|||
|
|
all unacknowledged data to be acknowledged before sending the FIN
|
|||
|
|
segment. The connection was closed at 10:04.42.33 after
|
|||
|
|
requesting 40 bytes to be sent. However, the FIN notification
|
|||
|
|
isn't sent until 10:04.42.51, after the (delayed) acknowledgement
|
|||
|
|
of the 40 bytes of data.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 46]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using tcpdump (no losses reported by the packet filter).
|
|||
|
|
|
|||
|
|
10:27:53.85 C > D: S 419744533:419744533(0) win 4096
|
|||
|
|
<mss 1460,wscale 0,eol> (DF)
|
|||
|
|
10:27:53.92 D > C: S 10082297:10082297(0) ack 419744534
|
|||
|
|
win 8760 <mss 1460> (DF)
|
|||
|
|
10:27:53.95 C > D: . ack 1 win 4096 (DF)
|
|||
|
|
10:27:54.42 C > D: P 1:4(3) ack 1 win 4096 (DF)
|
|||
|
|
10:27:54.62 D > C: . ack 4 win 8757 (DF)
|
|||
|
|
10:27:54.76 C > D: P 4:7(3) ack 1 win 4096 (DF)
|
|||
|
|
10:27:54.89 D > C: P 1:11(10) ack 7 win 8754 (DF)
|
|||
|
|
10:27:54.90 D > C: FP 11:51(40) ack7 win 8754 (DF)
|
|||
|
|
10:27:54.92 C > D: . ack 52 win 4096 (DF)
|
|||
|
|
10:27:55.01 C > D: FP 7:7(0) ack 52 win 4096 (DF)
|
|||
|
|
10:27:55.09 D > C: . ack 8 win 8754 (DF)
|
|||
|
|
|
|||
|
|
Here, Machine D sends a FIN with 40 bytes of data even before the
|
|||
|
|
original 10 octets have been acknowledged. This is correct
|
|||
|
|
behavior as it provides for the highest performance.
|
|||
|
|
|
|||
|
|
References
|
|||
|
|
This problem is documented in [Dawson97].
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
For implementations manifesting this problem, it shows up on a
|
|||
|
|
packet trace.
|
|||
|
|
|
|||
|
|
2.16.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Failure to send a RST after Half Duplex Close
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Resource management
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
RFC 1122 4.2.2.13 states that a TCP SHOULD send a RST if data is
|
|||
|
|
received after "half duplex close", i.e. if it cannot be delivered
|
|||
|
|
to the application. A TCP that fails to do so is said to exhibit
|
|||
|
|
"Failure to send a RST after Half Duplex Close".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
Potentially serious for TCP endpoints that manage large numbers of
|
|||
|
|
connections, due to exhaustion of memory and/or process slots
|
|||
|
|
available for managing connection state.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 47]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Failure to send the RST can lead to permanently hung TCP
|
|||
|
|
connections. This problem has been demonstrated when HTTP clients
|
|||
|
|
abort connections, common when users move on to a new page before
|
|||
|
|
the current page has finished downloading. The HTTP client closes
|
|||
|
|
by transmitting a FIN while the server is transmitting images,
|
|||
|
|
text, etc. The server TCP receives the FIN, but its application
|
|||
|
|
does not close the connection until all data has been queued for
|
|||
|
|
transmission. Since the server will not transmit a FIN until all
|
|||
|
|
the preceding data has been transmitted, deadlock results if the
|
|||
|
|
client TCP does not consume the pending data or tear down the
|
|||
|
|
connection: the window decreases to zero, since the client cannot
|
|||
|
|
pass the data to the application, and the server sends probe
|
|||
|
|
segments. The client acknowledges the probe segments with a zero
|
|||
|
|
window. As mandated in RFC1122 4.2.2.17, the probe segments are
|
|||
|
|
transmitted forever. Server connection state remains in
|
|||
|
|
CLOSE_WAIT, and eventually server processes are exhausted.
|
|||
|
|
|
|||
|
|
Note that there are two bugs. First, probe segments should be
|
|||
|
|
ignored if the window can never subsequently increase. Second, a
|
|||
|
|
RST should be sent when data is received after half duplex close.
|
|||
|
|
Fixing the first bug, but not the second, results in the probe
|
|||
|
|
segments eventually timing out the connection, but the server
|
|||
|
|
remains in CLOSE_WAIT for a significant and unnecessary period.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 sections 4.2.2.13 and 4.2.2.17.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using an unknown network analyzer. No drop information
|
|||
|
|
available.
|
|||
|
|
|
|||
|
|
client.1391 > server.8080: S 0:1(0) ack: 0 win: 2000 <mss: 5b4>
|
|||
|
|
server.8080 > client.1391: SA 8c01:8c02(0) ack: 1 win: 8000 <mss:100>
|
|||
|
|
client.1391 > server.8080: PA
|
|||
|
|
client.1391 > server.8080: PA 1:1c2(1c1) ack: 8c02 win: 2000
|
|||
|
|
server.8080 > client.1391: [DF] PA 8c02:8cde(dc) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A 8cde:9292(5b4) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A 9292:9846(5b4) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A 9846:9dfa(5b4) ack: 1c2 win: 8000
|
|||
|
|
client.1391 > server.8080: PA
|
|||
|
|
server.8080 > client.1391: [DF] A 9dfa:a3ae(5b4) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A a3ae:a962(5b4) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A a962:af16(5b4) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A af16:b4ca(5b4) ack: 1c2 win: 8000
|
|||
|
|
client.1391 > server.8080: PA
|
|||
|
|
server.8080 > client.1391: [DF] A b4ca:ba7e(5b4) ack: 1c2 win: 8000
|
|||
|
|
server.8080 > client.1391: [DF] A b4ca:ba7e(5b4) ack: 1c2 win: 8000
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 48]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
client.1391 > server.8080: PA
|
|||
|
|
server.8080 > client.1391: [DF] A ba7e:bdfa(37c) ack: 1c2 win: 8000
|
|||
|
|
client.1391 > server.8080: PA
|
|||
|
|
server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c2 win: 8000
|
|||
|
|
client.1391 > server.8080: PA
|
|||
|
|
|
|||
|
|
[ HTTP client aborts and enters FIN_WAIT_1 ]
|
|||
|
|
|
|||
|
|
client.1391 > server.8080: FPA
|
|||
|
|
|
|||
|
|
[ server ACKs the FIN and enters CLOSE_WAIT ]
|
|||
|
|
|
|||
|
|
server.8080 > client.1391: [DF] A
|
|||
|
|
|
|||
|
|
[ client enters FIN_WAIT_2 ]
|
|||
|
|
|
|||
|
|
server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000
|
|||
|
|
|
|||
|
|
[ server continues to try to send its data ]
|
|||
|
|
|
|||
|
|
client.1391 > server.8080: PA < window = 0 >
|
|||
|
|
server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000
|
|||
|
|
client.1391 > server.8080: PA < window = 0 >
|
|||
|
|
server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000
|
|||
|
|
client.1391 > server.8080: PA < window = 0 >
|
|||
|
|
server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000
|
|||
|
|
client.1391 > server.8080: PA < window = 0 >
|
|||
|
|
server.8080 > client.1391: [DF] A bdfa:bdfb(1) ack: 1c3 win: 8000
|
|||
|
|
client.1391 > server.8080: PA < window = 0 >
|
|||
|
|
|
|||
|
|
[ ... repeat ad exhaustium ... ]
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using an unknown network analyzer. No drop information
|
|||
|
|
available.
|
|||
|
|
|
|||
|
|
client > server D=80 S=59500 Syn Seq=337 Len=0 Win=8760
|
|||
|
|
server > client D=59500 S=80 Syn Ack=338 Seq=80153 Len=0 Win=8760
|
|||
|
|
client > server D=80 S=59500 Ack=80154 Seq=338 Len=0 Win=8760
|
|||
|
|
|
|||
|
|
[ ... normal data omitted ... ]
|
|||
|
|
|
|||
|
|
client > server D=80 S=59500 Ack=14559 Seq=596 Len=0 Win=8760
|
|||
|
|
server > client D=59500 S=80 Ack=596 Seq=114559 Len=1460 Win=8760
|
|||
|
|
|
|||
|
|
[ client closes connection ]
|
|||
|
|
|
|||
|
|
client > server D=80 S=59500 Fin Seq=596 Len=0 Win=8760
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 49]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
server > client D=59500 S=80 Ack=597 Seq=116019 Len=1460 Win=8760
|
|||
|
|
|
|||
|
|
[ client sends RST (RFC1122 4.2.2.13) ]
|
|||
|
|
|
|||
|
|
client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0
|
|||
|
|
server > client D=59500 S=80 Ack=597 Seq=117479 Len=1460 Win=8760
|
|||
|
|
client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0
|
|||
|
|
server > client D=59500 S=80 Ack=597 Seq=118939 Len=1460 Win=8760
|
|||
|
|
client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0
|
|||
|
|
server > client D=59500 S=80 Ack=597 Seq=120399 Len=892 Win=8760
|
|||
|
|
client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0
|
|||
|
|
server > client D=59500 S=80 Ack=597 Seq=121291 Len=1460 Win=8760
|
|||
|
|
client > server D=80 S=59500 Rst Seq=597 Len=0 Win=0
|
|||
|
|
|
|||
|
|
"client" sends a number of RSTs, one in response to each incoming
|
|||
|
|
packet from "server". One might wonder why "server" keeps sending
|
|||
|
|
data packets after it has received a RST from "client"; the
|
|||
|
|
explanation is that "server" had already transmitted all five of
|
|||
|
|
the data packets before receiving the first RST from "client", so
|
|||
|
|
it is too late to avoid transmitting them.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
The problem can be detected by inspecting packet traces of a
|
|||
|
|
large, interrupted bulk transfer.
|
|||
|
|
|
|||
|
|
2.17.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Failure to RST on close with data pending
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Resource management
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When an application closes a connection in such a way that it can
|
|||
|
|
no longer read any received data, the TCP SHOULD, per section
|
|||
|
|
4.2.2.13 of RFC 1122, send a RST if there is any unread received
|
|||
|
|
data, or if any new data is received. A TCP that fails to do so
|
|||
|
|
exhibits "Failure to RST on close with data pending".
|
|||
|
|
|
|||
|
|
Note that, for some TCPs, this situation can be caused by an
|
|||
|
|
application "crashing" while a peer is sending data.
|
|||
|
|
|
|||
|
|
We have observed a number of TCPs that exhibit this problem. The
|
|||
|
|
problem is less serious if any subsequent data sent to the now-
|
|||
|
|
closed connection endpoint elicits a RST (see illustration below).
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 50]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
This problem is most significant for endpoints that engage in
|
|||
|
|
large numbers of connections, as their ability to do so will be
|
|||
|
|
curtailed as they leak away resources.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
Failure to reset the connection can lead to permanently hung
|
|||
|
|
connections, in which the remote endpoint takes no further action
|
|||
|
|
to tear down the connection because it is waiting on the local TCP
|
|||
|
|
to first take some action. This is particularly the case if the
|
|||
|
|
local TCP also allows the advertised window to go to zero, and
|
|||
|
|
fails to tear down the connection when the remote TCP engages in
|
|||
|
|
"persist" probes (see example below).
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 section 4.2.2.13. Also, 4.2.2.17 for the zero-window
|
|||
|
|
probing discussion below.
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Made using tcpdump. No drop information available.
|
|||
|
|
|
|||
|
|
13:11:46.04 A > B: S 458659166:458659166(0) win 4096
|
|||
|
|
<mss 1460,wscale 0,eol> (DF)
|
|||
|
|
13:11:46.04 B > A: S 792320000:792320000(0) ack 458659167
|
|||
|
|
win 4096
|
|||
|
|
13:11:46.04 A > B: . ack 1 win 4096 (DF)
|
|||
|
|
13:11.55.80 A > B: . 1:513(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11.55.80 A > B: . 513:1025(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11:55.83 B > A: . ack 1025 win 3072
|
|||
|
|
13:11.55.84 A > B: . 1025:1537(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11.55.84 A > B: . 1537:2049(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11.55.85 A > B: . 2049:2561(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11:56.03 B > A: . ack 2561 win 1536
|
|||
|
|
13:11.56.05 A > B: . 2561:3073(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11.56.06 A > B: . 3073:3585(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11.56.06 A > B: . 3585:4097(512) ack 1 win 4096 (DF)
|
|||
|
|
13:11:56.23 B > A: . ack 4097 win 0
|
|||
|
|
13:11:58.16 A > B: . 4096:4097(1) ack 1 win 4096 (DF)
|
|||
|
|
13:11:58.16 B > A: . ack 4097 win 0
|
|||
|
|
13:12:00.16 A > B: . 4096:4097(1) ack 1 win 4096 (DF)
|
|||
|
|
13:12:00.16 B > A: . ack 4097 win 0
|
|||
|
|
13:12:02.16 A > B: . 4096:4097(1) ack 1 win 4096 (DF)
|
|||
|
|
13:12:02.16 B > A: . ack 4097 win 0
|
|||
|
|
13:12:05.37 A > B: . 4096:4097(1) ack 1 win 4096 (DF)
|
|||
|
|
13:12:05.37 B > A: . ack 4097 win 0
|
|||
|
|
13:12:06.36 B > A: F 1:1(0) ack 4097 win 0
|
|||
|
|
13:12:06.37 A > B: . ack 2 win 4096 (DF)
|
|||
|
|
13:12:11.78 A > B: . 4096:4097(1) ack 2 win 4096 (DF)
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 51]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
13:12:11.78 B > A: . ack 4097 win 0
|
|||
|
|
13:12:24.59 A > B: . 4096:4097(1) ack 2 win 4096 (DF)
|
|||
|
|
13:12:24.60 B > A: . ack 4097 win 0
|
|||
|
|
13:12:50.22 A > B: . 4096:4097(1) ack 2 win 4096 (DF)
|
|||
|
|
13:12:50.22 B > A: . ack 4097 win 0
|
|||
|
|
|
|||
|
|
Machine B in the trace above does not drop received data when the
|
|||
|
|
socket is "closed" by the application (in this case, the
|
|||
|
|
application process was terminated). This occurred at
|
|||
|
|
approximately 13:12:06.36 and resulted in the FIN being sent in
|
|||
|
|
response to the close. However, because there is no longer an
|
|||
|
|
application to deliver the data to, the TCP should have instead
|
|||
|
|
sent a RST.
|
|||
|
|
|
|||
|
|
Note: Machine A's zero-window probing is also broken. It is
|
|||
|
|
resending old data, rather than new data. Section 3.7 in RFC 793
|
|||
|
|
and Section 4.2.2.17 in RFC 1122 discuss zero-window probing.
|
|||
|
|
|
|||
|
|
Trace file demonstrating better behavior
|
|||
|
|
Made using tcpdump. No drop information available.
|
|||
|
|
|
|||
|
|
Better, but still not fully correct, behavior, per the discussion
|
|||
|
|
below. We show this behavior because it has been observed for a
|
|||
|
|
number of different TCP implementations.
|
|||
|
|
|
|||
|
|
13:48:29.24 C > D: S 73445554:73445554(0) win 4096
|
|||
|
|
<mss 1460,wscale 0,eol> (DF)
|
|||
|
|
13:48:29.24 D > C: S 36050296:36050296(0) ack 73445555
|
|||
|
|
win 4096 <mss 1460,wscale 0,eol> (DF)
|
|||
|
|
13:48:29.25 C > D: . ack 1 win 4096 (DF)
|
|||
|
|
13:48:30.78 C > D: . 1:1461(1460) ack 1 win 4096 (DF)
|
|||
|
|
13:48:30.79 C > D: . 1461:2921(1460) ack 1 win 4096 (DF)
|
|||
|
|
13:48:30.80 D > C: . ack 2921 win 1176 (DF)
|
|||
|
|
13:48:32.75 C > D: . 2921:4097(1176) ack 1 win 4096 (DF)
|
|||
|
|
13:48:32.82 D > C: . ack 4097 win 0 (DF)
|
|||
|
|
13:48:34.76 C > D: . 4096:4097(1) ack 1 win 4096 (DF)
|
|||
|
|
13:48:34.84 D > C: . ack 4097 win 0 (DF)
|
|||
|
|
13:48:36.34 D > C: FP 1:1(0) ack 4097 win 4096 (DF)
|
|||
|
|
13:48:36.34 C > D: . 4097:5557(1460) ack 2 win 4096 (DF)
|
|||
|
|
13:48:36.34 D > C: R 36050298:36050298(0) win 24576
|
|||
|
|
13:48:36.34 C > D: . 5557:7017(1460) ack 2 win 4096 (DF)
|
|||
|
|
13:48:36.34 D > C: R 36050298:36050298(0) win 24576
|
|||
|
|
|
|||
|
|
In this trace, the application process is terminated on Machine D
|
|||
|
|
at approximately 13:48:36.34. Its TCP sends the FIN with the
|
|||
|
|
window opened again (since it discarded the previously received
|
|||
|
|
data). Machine C promptly sends more data, causing Machine D to
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 52]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
reset the connection since it cannot deliver the data to the
|
|||
|
|
application. Ideally, Machine D SHOULD send a RST instead of
|
|||
|
|
dropping the data and re-opening the receive window.
|
|||
|
|
|
|||
|
|
Note: Machine C's zero-window probing is broken, the same as in
|
|||
|
|
the example above.
|
|||
|
|
|
|||
|
|
Trace file demonstrating correct behavior
|
|||
|
|
Made using tcpdump. No losses reported by the packet filter.
|
|||
|
|
|
|||
|
|
14:12:02.19 E > F: S 1143360000:1143360000(0) win 4096
|
|||
|
|
14:12:02.19 F > E: S 1002988443:1002988443(0) ack 1143360001
|
|||
|
|
win 4096 <mss 1460> (DF)
|
|||
|
|
14:12:02.19 E > F: . ack 1 win 4096
|
|||
|
|
14:12:10.43 E > F: . 1:513(512) ack 1 win 4096
|
|||
|
|
14:12:10.61 F > E: . ack 513 win 3584 (DF)
|
|||
|
|
14:12:10.61 E > F: . 513:1025(512) ack 1 win 4096
|
|||
|
|
14:12:10.61 E > F: . 1025:1537(512) ack 1 win 4096
|
|||
|
|
14:12:10.81 F > E: . ack 1537 win 2560 (DF)
|
|||
|
|
14:12:10.81 E > F: . 1537:2049(512) ack 1 win 4096
|
|||
|
|
14:12:10.81 E > F: . 2049:2561(512) ack 1 win 4096
|
|||
|
|
14:12:10.81 E > F: . 2561:3073(512) ack 1 win 4096
|
|||
|
|
14:12:11.01 F > E: . ack 3073 win 1024 (DF)
|
|||
|
|
14:12:11.01 E > F: . 3073:3585(512) ack 1 win 4096
|
|||
|
|
14:12:11.01 E > F: . 3585:4097(512) ack 1 win 4096
|
|||
|
|
14:12:11.21 F > E: . ack 4097 win 0 (DF)
|
|||
|
|
14:12:15.88 E > F: . 4097:4098(1) ack 1 win 4096
|
|||
|
|
14:12:16.06 F > E: . ack 4097 win 0 (DF)
|
|||
|
|
14:12:20.88 E > F: . 4097:4098(1) ack 1 win 4096
|
|||
|
|
14:12:20.91 F > E: . ack 4097 win 0 (DF)
|
|||
|
|
14:12:21.94 F > E: R 1002988444:1002988444(0) win 4096
|
|||
|
|
|
|||
|
|
When the application terminates at 14:12:21.94, F immediately
|
|||
|
|
sends a RST.
|
|||
|
|
|
|||
|
|
Note: Machine E's zero-window probing is (finally) correct.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
The problem can often be detected by inspecting packet traces of a
|
|||
|
|
transfer in which the receiving application terminates abnormally.
|
|||
|
|
When doing so, there can be an ambiguity (if only looking at the
|
|||
|
|
trace) as to whether the receiving TCP did indeed have unread data
|
|||
|
|
that it could now no longer deliver. To provoke this to happen,
|
|||
|
|
it may help to suspend the receiving application so that it fails
|
|||
|
|
to consume any data, eventually exhausting the advertised window.
|
|||
|
|
At this point, since the advertised window is zero, we know that
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 53]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
the receiving TCP has undelivered data buffered up. Terminating
|
|||
|
|
the application process then should suffice to test the
|
|||
|
|
correctness of the TCP's behavior.
|
|||
|
|
|
|||
|
|
2.18.
|
|||
|
|
|
|||
|
|
Name of Problem
|
|||
|
|
Options missing from TCP MSS calculation
|
|||
|
|
|
|||
|
|
Classification
|
|||
|
|
Reliability / performance
|
|||
|
|
|
|||
|
|
Description
|
|||
|
|
When a TCP determines how much data to send per packet, it
|
|||
|
|
calculates a segment size based on the MTU of the path. It must
|
|||
|
|
then subtract from that MTU the size of the IP and TCP headers in
|
|||
|
|
the packet. If IP options and TCP options are not taken into
|
|||
|
|
account correctly in this calculation, the resulting segment size
|
|||
|
|
may be too large. TCPs that do so are said to exhibit "Options
|
|||
|
|
missing from TCP MSS calculation".
|
|||
|
|
|
|||
|
|
Significance
|
|||
|
|
In some implementations, this causes the transmission of strangely
|
|||
|
|
fragmented packets. In some implementations with Path MTU (PMTU)
|
|||
|
|
discovery [RFC1191], this problem can actually result in a total
|
|||
|
|
failure to transmit any data at all, regardless of the environment
|
|||
|
|
(see below).
|
|||
|
|
|
|||
|
|
Arguably, especially since the wide deployment of firewalls, IP
|
|||
|
|
options appear only rarely in normal operations.
|
|||
|
|
|
|||
|
|
Implications
|
|||
|
|
In implementations using PMTU discovery, this problem can result
|
|||
|
|
in packets that are too large for the output interface, and that
|
|||
|
|
have the DF (don't fragment) bit set in the IP header. Thus, the
|
|||
|
|
IP layer on the local machine is not allowed to fragment the
|
|||
|
|
packet to send it out the interface. It instead informs the TCP
|
|||
|
|
layer of the correct MTU size of the interface; the TCP layer
|
|||
|
|
again miscomputes the MSS by failing to take into account the size
|
|||
|
|
of IP options; and the problem repeats, with no data flowing.
|
|||
|
|
|
|||
|
|
Relevant RFCs
|
|||
|
|
RFC 1122 describes the calculation of the effective send MSS. RFC
|
|||
|
|
1191 describes Path MTU discovery.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 54]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Trace file demonstrating it
|
|||
|
|
Trace file taking using tcpdump on host C. The first trace
|
|||
|
|
demonstrates the fragmentation that occurs without path MTU
|
|||
|
|
discovery:
|
|||
|
|
|
|||
|
|
13:55:25.488728 A.65528 > C.discard:
|
|||
|
|
P 567833:569273(1440) ack 1 win 17520
|
|||
|
|
<nop,nop,timestamp 3839 1026342>
|
|||
|
|
(frag 20828:1472@0+)
|
|||
|
|
(ttl 62, optlen=8 LSRR{B#} NOP)
|
|||
|
|
|
|||
|
|
13:55:25.488943 A > C:
|
|||
|
|
(frag 20828:8@1472)
|
|||
|
|
(ttl 62, optlen=8 LSRR{B#} NOP)
|
|||
|
|
|
|||
|
|
13:55:25.489052 C.discard > A.65528:
|
|||
|
|
. ack 566385 win 60816
|
|||
|
|
<nop,nop,timestamp 1026345 3839> (DF)
|
|||
|
|
(ttl 60, id 41266)
|
|||
|
|
|
|||
|
|
Host A repeatedly sends 1440-octet data segments, but these hare
|
|||
|
|
fragmented into two packets, one with 1432 octets of data, and
|
|||
|
|
another with 8 octets of data.
|
|||
|
|
|
|||
|
|
The second trace demonstrates the failure to send any data
|
|||
|
|
segments, sometimes seen with hosts doing path MTU discovery:
|
|||
|
|
|
|||
|
|
13:55:44.332219 A.65527 > C.discard:
|
|||
|
|
S 1018235390:1018235390(0) win 16384
|
|||
|
|
<mss 1460,nop,wscale 0,nop,nop,timestamp 3876 0> (DF)
|
|||
|
|
(ttl 62, id 20912, optlen=8 LSRR{B#} NOP)
|
|||
|
|
|
|||
|
|
13:55:44.333015 C.discard > A.65527:
|
|||
|
|
S 1271629000:1271629000(0) ack 1018235391 win 60816
|
|||
|
|
<mss 1460,nop,wscale 0,nop,nop,timestamp 1026383 3876> (DF)
|
|||
|
|
(ttl 60, id 41427)
|
|||
|
|
|
|||
|
|
13:55:44.333206 C.discard > A.65527:
|
|||
|
|
S 1271629000:1271629000(0) ack 1018235391 win 60816
|
|||
|
|
<mss 1460,nop,wscale 0,nop,nop,timestamp 1026383 3876> (DF)
|
|||
|
|
(ttl 60, id 41427)
|
|||
|
|
|
|||
|
|
This is all of the activity seen on this connection. Eventually
|
|||
|
|
host C will time out attempting to establish the connection.
|
|||
|
|
|
|||
|
|
How to detect
|
|||
|
|
The "netcat" utility [Hobbit96] is useful for generating source
|
|||
|
|
routed packets:
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 55]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
1% nc C discard
|
|||
|
|
(interactive typing)
|
|||
|
|
^C
|
|||
|
|
2% nc C discard < /dev/zero
|
|||
|
|
^C
|
|||
|
|
3% nc -g B C discard
|
|||
|
|
(interactive typing)
|
|||
|
|
^C
|
|||
|
|
4% nc -g B C discard < /dev/zero
|
|||
|
|
^C
|
|||
|
|
|
|||
|
|
Lines 1 through 3 should generate appropriate packets, which can
|
|||
|
|
be verified using tcpdump. If the problem is present, line 4
|
|||
|
|
should generate one of the two kinds of packet traces shown.
|
|||
|
|
|
|||
|
|
How to fix
|
|||
|
|
The implementation should ensure that the effective send MSS
|
|||
|
|
calculation includes a term for the IP and TCP options, as
|
|||
|
|
mandated by RFC 1122.
|
|||
|
|
|
|||
|
|
3. Security Considerations
|
|||
|
|
|
|||
|
|
This memo does not discuss any specific security-related TCP
|
|||
|
|
implementation problems, as the working group decided to pursue
|
|||
|
|
documenting those in a separate document. Some of the implementation
|
|||
|
|
problems discussed here, however, can be used for denial-of-service
|
|||
|
|
attacks. Those classified as congestion control present
|
|||
|
|
opportunities to subvert TCPs used for legitimate data transfer into
|
|||
|
|
excessively loading network elements. Those classified as
|
|||
|
|
"performance", "reliability" and "resource management" may be
|
|||
|
|
exploitable for launching surreptitious denial-of-service attacks
|
|||
|
|
against the user of the TCP. Both of these types of attacks can be
|
|||
|
|
extremely difficult to detect because in most respects they look
|
|||
|
|
identical to legitimate network traffic.
|
|||
|
|
|
|||
|
|
4. Acknowledgements
|
|||
|
|
|
|||
|
|
Thanks to numerous correspondents on the tcp-impl mailing list for
|
|||
|
|
their input: Steve Alexander, Larry Backman, Jerry Chu, Alan Cox,
|
|||
|
|
Kevin Fall, Richard Fox, Jim Gettys, Rick Jones, Allison Mankin, Neal
|
|||
|
|
McBurnett, Perry Metzger, der Mouse, Thomas Narten, Andras Olah,
|
|||
|
|
Steve Parker, Francesco Potorti`, Luigi Rizzo, Allyn Romanow, Al
|
|||
|
|
Smith, Jerry Toporek, Joe Touch, and Curtis Villamizar.
|
|||
|
|
|
|||
|
|
Thanks also to Josh Cohen for the traces documenting the "Failure to
|
|||
|
|
send a RST after Half Duplex Close" problem; and to John Polstra, who
|
|||
|
|
analyzed the "Window probe deadlock" problem.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 56]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
5. References
|
|||
|
|
|
|||
|
|
[Allman97] M. Allman, "Fixing Two BSD TCP Bugs," Technical Report
|
|||
|
|
CR-204151, NASA Lewis Research Center, Oct. 1997.
|
|||
|
|
http://roland.grc.nasa.gov/~mallman/papers/bug.ps
|
|||
|
|
|
|||
|
|
[RFC2414] Allman, M., Floyd, S. and C. Partridge, "Increasing
|
|||
|
|
TCP's Initial Window", RFC 2414, September 1998.
|
|||
|
|
|
|||
|
|
[RFC1122] Braden, R., Editor, "Requirements for Internet Hosts --
|
|||
|
|
Communication Layers", STD 3, RFC 1122, October 1989.
|
|||
|
|
|
|||
|
|
[RFC2119] Bradner, S., "Key words for use in RFCs to Indicate
|
|||
|
|
Requirement Levels", BCP 14, RFC 2119, March 1997.
|
|||
|
|
|
|||
|
|
[Brakmo95] L. Brakmo and L. Peterson, "Performance Problems in
|
|||
|
|
BSD4.4 TCP," ACM Computer Communication Review,
|
|||
|
|
25(5):69-86, 1995.
|
|||
|
|
|
|||
|
|
[RFC813] Clark, D., "Window and Acknowledgement Strategy in TCP,"
|
|||
|
|
RFC 813, July 1982.
|
|||
|
|
|
|||
|
|
[Dawson97] S. Dawson, F. Jahanian, and T. Mitton, "Experiments on
|
|||
|
|
Six Commercial TCP Implementations Using a Software
|
|||
|
|
Fault Injection Tool," to appear in Software Practice &
|
|||
|
|
Experience, 1997. A technical report version of this
|
|||
|
|
paper can be obtained at
|
|||
|
|
ftp://rtcl.eecs.umich.edu/outgoing/sdawson/CSE-TR-298-
|
|||
|
|
96.ps.gz.
|
|||
|
|
|
|||
|
|
[Fall96] K. Fall and S. Floyd, "Simulation-based Comparisons of
|
|||
|
|
Tahoe, Reno, and SACK TCP," ACM Computer Communication
|
|||
|
|
Review, 26(3):5-21, 1996.
|
|||
|
|
|
|||
|
|
[Hobbit96] Hobbit, Avian Research, netcat, available via anonymous
|
|||
|
|
ftp to ftp.avian.org, 1996.
|
|||
|
|
|
|||
|
|
[Hoe96] J. Hoe, "Improving the Start-up Behavior of a Congestion
|
|||
|
|
Control Scheme for TCP," Proc. SIGCOMM '96.
|
|||
|
|
|
|||
|
|
[Jacobson88] V. Jacobson, "Congestion Avoidance and Control," Proc.
|
|||
|
|
SIGCOMM '88. ftp://ftp.ee.lbl.gov/papers/congavoid.ps.Z
|
|||
|
|
|
|||
|
|
[Jacobson89] V. Jacobson, C. Leres, and S. McCanne, tcpdump,
|
|||
|
|
available via anonymous ftp to ftp.ee.lbl.gov, Jun.
|
|||
|
|
1989.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 57]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
[RFC2018] Mathis, M., Mahdavi, J., Floyd, S. and A. Romanow, "TCP
|
|||
|
|
Selective Acknowledgement Options", RFC 2018, October
|
|||
|
|
1996.
|
|||
|
|
|
|||
|
|
[RFC1191] Mogul, J. and S. Deering, "Path MTU discovery", RFC
|
|||
|
|
1191, November 1990.
|
|||
|
|
|
|||
|
|
[RFC896] Nagle, J., "Congestion Control in IP/TCP Internetworks",
|
|||
|
|
RFC 896, January 1984.
|
|||
|
|
|
|||
|
|
[Paxson97] V. Paxson, "Automated Packet Trace Analysis of TCP
|
|||
|
|
Implementations," Proc. SIGCOMM '97, available from
|
|||
|
|
ftp://ftp.ee.lbl.gov/papers/vp-tcpanaly-sigcomm97.ps.Z.
|
|||
|
|
|
|||
|
|
[RFC793] Postel, J., Editor, "Transmission Control Protocol," STD
|
|||
|
|
7, RFC 793, September 1981.
|
|||
|
|
|
|||
|
|
[RFC2001] Stevens, W., "TCP Slow Start, Congestion Avoidance, Fast
|
|||
|
|
Retransmit, and Fast Recovery Algorithms", RFC 2001,
|
|||
|
|
January 1997.
|
|||
|
|
|
|||
|
|
[Stevens94] W. Stevens, "TCP/IP Illustrated, Volume 1", Addison-
|
|||
|
|
Wesley Publishing Company, Reading, Massachusetts, 1994.
|
|||
|
|
|
|||
|
|
[Wright95] G. Wright and W. Stevens, "TCP/IP Illustrated, Volume
|
|||
|
|
2", Addison-Wesley Publishing Company, Reading
|
|||
|
|
Massachusetts, 1995.
|
|||
|
|
|
|||
|
|
6. Authors' Addresses
|
|||
|
|
|
|||
|
|
Vern Paxson
|
|||
|
|
ACIRI / ICSI
|
|||
|
|
1947 Center Street
|
|||
|
|
Suite 600
|
|||
|
|
Berkeley, CA 94704-1198
|
|||
|
|
|
|||
|
|
Phone: +1 510/642-4274 x302
|
|||
|
|
EMail: vern@aciri.org
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 58]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Mark Allman <mallman@grc.nasa.gov>
|
|||
|
|
NASA Glenn Research Center/Sterling Software
|
|||
|
|
Lewis Field
|
|||
|
|
21000 Brookpark Road
|
|||
|
|
MS 54-2
|
|||
|
|
Cleveland, OH 44135
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 216/433-6586
|
|||
|
|
Email: mallman@grc.nasa.gov
|
|||
|
|
|
|||
|
|
Scott Dawson
|
|||
|
|
Real-Time Computing Laboratory
|
|||
|
|
EECS Building
|
|||
|
|
University of Michigan
|
|||
|
|
Ann Arbor, MI 48109-2122
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 313/763-5363
|
|||
|
|
EMail: sdawson@eecs.umich.edu
|
|||
|
|
|
|||
|
|
|
|||
|
|
William C. Fenner
|
|||
|
|
Xerox PARC
|
|||
|
|
3333 Coyote Hill Road
|
|||
|
|
Palo Alto, CA 94304
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 650/812-4816
|
|||
|
|
EMail: fenner@parc.xerox.com
|
|||
|
|
|
|||
|
|
|
|||
|
|
Jim Griner <jgriner@grc.nasa.gov>
|
|||
|
|
NASA Glenn Research Center
|
|||
|
|
Lewis Field
|
|||
|
|
21000 Brookpark Road
|
|||
|
|
MS 54-2
|
|||
|
|
Cleveland, OH 44135
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 216/433-5787
|
|||
|
|
EMail: jgriner@grc.nasa.gov
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 59]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
Ian Heavens
|
|||
|
|
Spider Software Ltd.
|
|||
|
|
8 John's Place, Leith
|
|||
|
|
Edinburgh EH6 7EL
|
|||
|
|
UK
|
|||
|
|
|
|||
|
|
Phone: +44 131/475-7015
|
|||
|
|
EMail: ian@spider.com
|
|||
|
|
|
|||
|
|
Kevin Lahey
|
|||
|
|
NASA Ames Research Center/MRJ
|
|||
|
|
MS 258-6
|
|||
|
|
Moffett Field, CA 94035
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 650/604-4334
|
|||
|
|
EMail: kml@nas.nasa.gov
|
|||
|
|
|
|||
|
|
|
|||
|
|
Jeff Semke
|
|||
|
|
Pittsburgh Supercomputing Center
|
|||
|
|
4400 Fifth Ave
|
|||
|
|
Pittsburgh, PA 15213
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 412/268-4960
|
|||
|
|
EMail: semke@psc.edu
|
|||
|
|
|
|||
|
|
|
|||
|
|
Bernie Volz
|
|||
|
|
Process Software Corporation
|
|||
|
|
959 Concord Street
|
|||
|
|
Framingham, MA 01701
|
|||
|
|
USA
|
|||
|
|
|
|||
|
|
Phone: +1 508/879-6994
|
|||
|
|
EMail: volz@process.com
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 60]
|
|||
|
|
|
|||
|
|
RFC 2525 TCP Implementation Problems March 1999
|
|||
|
|
|
|||
|
|
|
|||
|
|
7. Full Copyright Statement
|
|||
|
|
|
|||
|
|
Copyright (C) The Internet Society (1999). All Rights Reserved.
|
|||
|
|
|
|||
|
|
This document and translations of it may be copied and furnished to
|
|||
|
|
others, and derivative works that comment on or otherwise explain it
|
|||
|
|
or assist in its implementation may be prepared, copied, published
|
|||
|
|
and distributed, in whole or in part, without restriction of any
|
|||
|
|
kind, provided that the above copyright notice and this paragraph are
|
|||
|
|
included on all such copies and derivative works. However, this
|
|||
|
|
document itself may not be modified in any way, such as by removing
|
|||
|
|
the copyright notice or references to the Internet Society or other
|
|||
|
|
Internet organizations, except as needed for the purpose of
|
|||
|
|
developing Internet standards in which case the procedures for
|
|||
|
|
copyrights defined in the Internet Standards process must be
|
|||
|
|
followed, or as required to translate it into languages other than
|
|||
|
|
English.
|
|||
|
|
|
|||
|
|
The limited permissions granted above are perpetual and will not be
|
|||
|
|
revoked by the Internet Society or its successors or assigns.
|
|||
|
|
|
|||
|
|
This document and the information contained herein is provided on an
|
|||
|
|
"AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING
|
|||
|
|
TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING
|
|||
|
|
BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION
|
|||
|
|
HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF
|
|||
|
|
MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
|
|||
|
|
Paxson, et. al. Informational [Page 61]
|
|||
|
|
|