TCP Selective Acknowledgement option (and related changes) for FreeBSD
The file sack.diffs
includes a number of modifications to TCP designed to improve
performance in presence of losses, namely:
- MODIFIED FAST RETRANSMIT
When 3 or fewer packets are in transit, there is no chance that
Fast Retransmit can work. Given the small (16-32k) windows
commonly in use, and the size of ethernet packets,
this is a very common case at
connection establishment or after some losses.
With this change we lower the threshold for Fast Retransmit to 2
or 1 dup ack, depending on the number of outstanding packets.
This has proven to be very helpful on slow/lossy links.
- Newreno
Newreno implements an idea from J.Hoe, related to not shrinking
the send window in some cases after a Fast Retransmit.
- SACK (Selective Acknowledgement)
This implements the TCP Selective Acknowledgement options as
specified in RFC20xx. No change is made to the retransmit strategy.
- TSACK (Selective Acknowledgements in RFC1323 timestamps)
TSACK use the value sent in RFC1323 timestamps as a cookie
which, when returned back to the sender, allows its use as a
Selective Acknowledgement for a specific packet. It would not
require changes to the receiver, but that RFC13233 specify
"Which timestamp to echo" in a way that I believe to be overly
conservative and which prevents TSACK to work. The modification
to make TSACK work is trivial (2 lines of code!).
Additionally, some cleanup of the TCP code is also present.
The software is in alpha stage, although it has been running for
a couple of weeks in intermediate formats, and it is running on a
couple of our systems since Aug.26 1996.
MODIFIED FAST RETRANSMIT is really helpful on lossy links, and does
not need modifications at the receive side. Same for NEWRENO.
SACK (and/or TSACK), especially if sided by MODIFIED FAST RETRANSMIT,
can give some improvements to the throughput, but only with
sufficiently large windows and a low loss rate.
The diffs are against FreeBSD 2.1R, although they should be easily
ported to other BSD-derived systems. Most options must be enabled in
the kernel config file via
option SPECIFICOPTIONNAME
and need to be enabled via a sysctl variable in order to activate
them.
Since this code is evolving, please check here
(http://www.iet.unipi.it/~luigi/research.html) to see
if there is a newer version. In particular, this code still has some
diagnostic output which goes to /var/log/messages.
Bugs, fixes and suggestions can be reported to me at
rizzo@iet.unipi.it
A brief description of all changes included in sack.diffs follows:
CLEANUP OF BSD CODE
- BSD code has some strange ways of updating the count of duplicate
acks. The count gets reset by some unexpected events (window
updates, old segments) and does not get checked/reset properly in
the header prediction code. A number of small fixes tries to count
dupacks more consistently.
- added a flag, TF_FAST_RXMT, to indicate that we are in
fast retransmit/fast recovery. This is needed to support a different fast
retransmit policy, and makes the code somewhat easier to read.
- the count of retransmitted and dup bytes is now accumulated per
connection as well as globally. This is useful for statistical
purposes, and can be used later to determine if a connection is
experiencing losses or duplicate data.
The same ought to be done for retransmitted/dup packets.
- additional variables are added to tcpstat, to count for various
events.
MODIFIED FAST RETRANSMIT
- BSD enters fast retransmit when there are 3 consecutive duplicate
acks; the number 3 was chosen to reduce the chance that a reordering
of packets in the net is seen as a segment loss.
However, in presence of large losses, or when the amount of
outstanding data is small, or the window is narrow, there are so few
packets in transit that 3 dupacks cannot happen, and the chance of a
reordering is low. In these situations, 1 or 2 dupacks almost certainly
mean that a segment has been lost. Instead of waiting for a timeout,
fast retransmit can be started earlier. This code identifies these
cases, and lowers the threshold for fast retransmit to 2 or 1 dup.
Note 1: in many cases (e.g. telnet, http), there are still a lot
of timeouts which occur after 0 dupacks, because in many cases
there is only one segment in flight. We cannot do much on this.
Note 2: since the tcp control block accumulates statistics on the
amount of dup/retransmitted data, perhaps this behaviour can be made
more adaptive if the connection shows a significant reordering of
segments.
NEWRENO (following a suggestion by J. Hoe)
In Reno, after a fast retransmit, a non-dup ack causes exit from
fast recovery. However, in case of multiple losses in the same
window, there might need three more dupacks to detect this, and
a subsequent fastretrans would shrink the window even further.
We save the value of snd_max in snd_max_rxmt at the time of the
fast retransmit; then if snd_una does not advance to snd_max_rxmt
the segment at snd_una has been lost and can be retransmitted
immediately.
SACK
This is an implementation of the SACK options as described in the
recent internet draft, to which it is fully compliant. The maximum
lifetime of SACK can be set to 0 or more timeouts. The retransmission
strategy, during fast recovery, is as follows: if new data can
be sent within snd_wnd and snd_cwnd, then do it. Otherwise, old
blocks (up to, but not beyond, the last SACKed block) are sent
again. There is currently no provision to resent the block snd_una
if this has been lost twice (a solution is in the works).
TSACK
This is a simplified version of SACK, which carries SACK information
embedded in slightly modified RFC1323 timestamps. There are some
tradeoffs in using TSACKs (almost no need for receiver support, less
precise SACKs) instead of ACKs, but TSACKs have some advantage over
SACKs in some cases.
ARTIFICIAL LOSSES
In order to test the behaviour of the above code, there is a new
function, tcp_dropit(), which allows some incoming data and ack
packets to be dropped. Currently the drop rate is 10% for data
segment, 5% for pure acks. Segments are dropped using a repetitive
pattern of 499 segments, in order to make results a bit more
reproducible (they aren't reproducible anyways, because the actual
generation of ACKs depends on the behaviour of the receiver process
and there is some interaction with timeouts).
All the above mechanisms can be enabled by setting the variable
net.inet.tcp.sack
as follows:
SACK lifetime 0..15 (0 and 1 are equivalent)
SACK 0x10 enables sack negotiation and processing
TSACK 0x20 enables TSACK generation
MODIFIED_FR 0x40 enables modified fast retransmit
NEWRENO 0x80 enables newreno
LOSSY 0x100 enables dropping incoming data/acks
The following kernel options are needed:
option TSACK enables TSACK generation
option SACK enables SACK code, TSACK processing, LOSSY
Newreno and modified fast retransmit are compiled in by default.
You might also need the following changes to sysctl and netstat. The former
needs to be recompiled with the new tcp_var.h The patch below just
allows you to enter values as hex numbers instead of decimal ones.
The patch to netstat (which also needs to be recompiled) is there to
allow you to see the additional statistic variables in the tcpstat
structure. Since these variables are allocated at the bottom of the
structure, older netstat will work, just don't write all available info.
diff -cbwr /usr.sbin/sysctl/sysctl.c ./sysctl.c
*** /cdrom/usr/src/usr.sbin/sysctl/sysctl.c Sun Jun 11 06:32:58 1995
--- ./sysctl.c Mon Aug 19 16:28:31 1996
***************
*** 342,348 ****
if (newsize > 0) {
switch (type) {
case CTLTYPE_INT:
! intval = atoi(newval);
newval = &intval;
newsize = sizeof intval;
break;
--- 342,349 ----
if (newsize > 0) {
switch (type) {
case CTLTYPE_INT:
! sscanf(newval, "%i", &intval); /* XXX */
! /* intval = atoi(newval); */
newval = &intval;
newsize = sizeof intval;
break;
diff -cbwr netstat/inet.c /usr/src/usr.bin/netstat/inet.c
*** netstat/inet.c Sat Jul 29 11:42:54 1995
--- /usr/src/usr.bin/netstat/inet.c Fri Aug 23 17:02:49 1996
***************
*** 227,233 ****
--- 227,243 ----
p(tcps_conndrops, "\t%d embryonic connection%s dropped\n");
p2(tcps_rttupdated, tcps_segstimed,
"\t%d segment%s updated rtt (of %d attempt%s)\n");
+ p(tcps_zerodupw, "\t%d invalid invalid dupack reset on window update\n");
p(tcps_rexmttimeo, "\t%d retransmit timeout%s\n");
+ p(tcps_rexmt[0], "\t\t%d retransmit timeout with 0 dup acks\n");
+ p(tcps_rexmt[1], "\t\t%d retransmit timeout with 1 dup acks\n");
+ p(tcps_rexmt[2], "\t\t%d retransmit timeout with 2 dup acks\n");
+ p(tcps_fastretransmit, "\t%d fast retransmit%s\n");
+ p(tcps_fastrexmt[0], "\t\t%d with 1 dup ack\n");
+ p(tcps_fastrexmt[1], "\t\t%d with 2 dup ack\n");
+ p(tcps_fastrexmt[2], "\t\t%d with 3 dup ack\n");
+ p(tcps_newreno, "\t%d newreno retrans\n");
+ p(tcps_fastrecovery, "\t%d fast recovery\n");
p(tcps_timeoutdrop, "\t\t%d connection%s dropped by rexmit timeout\n");
p(tcps_persisttimeo, "\t%d persist timeout%s\n");
p(tcps_persistdrop, "\t\t%d connection%s dropped by persist timeout\n");