“Once you eliminate the impossible, whatever remains, no matter how improbable, must be the truth.” — Sherlock Holmes
Intro
It’s not every day that you get to debug what may well be a packet of death. It was certainly the first time for me.
What do I mean by “a packet of death”? A software bug where the network stack crashes in reaction to a single received network packet, taking down the whole operating system with it. Like in the well known case of Windows ping of death.
Challenge accepted.
It starts with an oops
Around a year ago we started seeing kernel crashes in the Linux ipv4 stack. Servers were crashing sporadically, but we learned the hard way to never ignore cases like that — when possible we always trace crashes. We also couldn’t tie it to a particular kernel version, which could indicate a regression which hopefully could be tracked down to a single faulty change in the Linux kernel.
The crashed servers were leaving behind only a crash report, affectionately known as a “kernel oops”. Let’s take a look at it and go over what information we have there.
Parts of the oops, like offsets into functions, need to be decoded in order to be human-readable. Fortunately Linux comes with the decode_stacktrace.sh
script that did the work for us.
All we need is to install a kernel debug and source packages before running the script. We will use the latest version of the script as it has been significantly improved since Linux v5.4 came out.
$ RELEASE=`uname -r`
$ apt install linux-image-$RELEASE-dbg linux-source-$RELEASE
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decode_stacktrace.sh
$ curl -sLO https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/plain/scripts/decodecode
$ chmod +x decode_stacktrace.sh decodecode
$ ./decode_stacktrace.sh -r 5.4.14-cloudflare-2020.1.11 < oops.txt > oops-decoded.txt
When decoded, the oops report is even longer than before! But that is a good thing. There is new information there that can help us.
What has happened?
With this much input we can start sketching a picture of what could have happened. First thing to check is where exactly did we crash?
The report points at line 5160 in the skb_gso_transport_seglen()
function. If we take a look at the source code, we can get a rough idea of what happens there. We are processing a Generic Segmentation Offload (GSO) packet carrying an encapsulated TCP packet. What is a GSO packet? In this context it's a batch of consecutive TCP segments, travelling through the network stack together to amortize the processing cost. We will look more at the GSO later.
net/core/skbuff.c:
5150) static unsigned int skb_gso_transport_seglen(const struct sk_buff *skb)
5151) {
…
5155) if (skb->encapsulation) {
…
5159) if (likely(shinfo->gso_type & (SKB_GSO_TCPV4 | SKB_GSO_TCPV6)))
5160) thlen += inner_tcp_hdrlen(skb); ?
5161) } else if (…) {
…
5172) return thlen + shinfo->gso_size;
5173) }
The exact line where we crashed belongs to an if-branch that handles tunnel traffic. It calculates the length of the TCP header of the inner packet, that is the encapsulated one. We do that to compute the length of the outer L4 segment, which accounts for the inner packet length:
To understand how the length of the inner TCP header is computed we have to peel off a few layers of inlined function calls:
inner_tcp_hdrlen(skb)
⇓
inner_tcp_hdr(skb)->doff * 4
⇓
((struct tcphdr *)skb_inner_transport_header(skb))->doff * 4
⇓
((struct tcphdr *)(skb->head + skb->inner_transport_header))->doff * 4
Now it is clear that inner_tcp_hdrlen(skb)
simply reads the Data Offset field (doff
) inside the inner TCP header. Because Data Offset carries the number of 32-bit words in the TCP header, we multiply it by 4 to get the TCP header length in bytes.
From the memory access point of view, to read the Data Offset value we need to:
load
skb->head
value from addressskb + offsetof(struct sk_buff, head)
load
skb->inner_transport_header
value from addressskb + offsetof(struct sk_buff, inner_transport_header)
,load the TCP Data Offset from
skb->head + skb->inner_transport_header + offsetof(struct tcphdr, doff)
Potentially, any of these loads could trigger a page fault. But it’s unlikely that skb
contains an invalid address since we accessed the skb->encapsulation
field without crashing just a few lines earlier. Our main suspect is the last load.
The invalid memory address we attempt to load from should be in one of the CPU registers at the time of the exception. And we have the CPU register snapshot in the oops report. Which register holds the address? That has been decided by the compiler. We will need to take a look at the instruction stream to discover that.
Remember the disassembly in the decoded kernel oops? Now is the time to go back to it. Hint, it’s in AT&T syntax. But to give everyone a fair chance to follow along, here’s the same disassembly but in Intel syntax. (Alright, alright. You caught me. I just can’t read the AT&T syntax.)
All code
========
0: c0 41 83 e0 rol BYTE PTR [rcx-0x7d],0xe0
4: 11 f6 adc esi,esi
6: 87 81 00 00 00 20 xchg DWORD PTR [rcx+0x20000000],eax
c: 74 30 je 0x3e
e: 0f b7 87 aa 00 00 00 movzx eax,WORD PTR [rdi+0xaa]
15: 0f b7 b7 b2 00 00 00 movzx esi,WORD PTR [rdi+0xb2]
1c: 48 01 c1 add rcx,rax
1f: 48 29 f0 sub rax,rsi
22: 45 85 c0 test r8d,r8d
25: 48 89 c6 mov rsi,rax
28: 74 0d je 0x37
2a:* 0f b6 41 0c movzx eax,BYTE PTR [rcx+0xc] <-- trapping instruction
2e: c0 e8 04 shr al,0x4
31: 0f b6 c0 movzx eax,al
34: 8d 04 86 lea eax,[rsi+rax*4]
37: 0f b7 52 04 movzx edx,WORD PTR [rdx+0x4]
3b: 01 d0 add eax,edx
3d: c3 ret
3e: 45 rex.RB
3f: 85 .byte 0x85
Code starting with the faulting instruction
===========================================
0: 0f b6 41 0c movzx eax,BYTE PTR [rcx+0xc]
4: c0 e8 04 shr al,0x4
7: 0f b6 c0 movzx eax,al
a: 8d 04 86 lea eax,[rsi+rax*4]
d: 0f b7 52 04 movzx edx,WORD PTR [rdx+0x4]
11: 01 d0 add eax,edx
13: c3 ret
14: 45 rex.RB
15: 85 .byte 0x85
When the trapped page fault happened, we tried to load from address %rcx + 0xc
, or 12 bytes from whatever memory location %rcx
held. Which is hardly a coincidence since the Data Offset field is 12 bytes into the TCP header.
This means that %rcx
holds the computed skb->head + skb->inner_transport_header
address. Let’s take a look at it:
RSP: 0018:ffffa4740d344ba0 EFLAGS: 00010202
RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900
…
The RCX value doesn’t look particularly suspicious. We can say that:
it’s in a kernel virtual address space because it is greater than
0xffff000000000000
- expected, andit is very close to the 4 KiB page boundary (
0xffff9d9624bbb000
- 4),
... and not much more.
We must go back further in the instruction stream. Where did the value in %rcx
come from? What I like to do is try to correlate the machine code leading up to the crash with pseudo source code:
<function entry> # %rdi = skb
…
movzx eax,WORD PTR [rdi+0xaa] # %eax = skb->inner_transport_header
movzx esi,WORD PTR [rdi+0xb2] # %esi = skb->transport_header
add rcx,rax # %rcx = skb->head + skb->inner_transport_header
sub rax,rsi # %rax = skb->inner_transport_header - skb->transport_header
test r8d,r8d
mov rsi,rax # %rsi = skb->inner_transport_header - skb->transport_header
je 0x37
movzx eax,BYTE PTR [rcx+0xc] # %eax = *(skb->head + skb->inner_transport_header + offsetof(struct tcphdr, doff))
How did I decode that assembly snippet? We know that the skb
address was passed to our function in the %rdi
register because the System V AMD64 ABI calling convention dictates that. If the %rdi
register hasn’t been clobbered by any function calls, or reused because the compiler decided so, then maybe, just maybe, it still holds the skb
address.
If 0xaa
and 0xb2
are offsets into an sk_buff
structure, then pahole
tool can tell us which fields they correspond to:
$ pahole --hex -C sk_buff /usr/lib/debug/vmlinux-5.4.14-cloudflare-2020.1.11 | grep '\(head\|inner_transport_header\|transport_header\);'
__u16 inner_transport_header; /* 0xaa 0x2 */
__u16 transport_header; /* 0xb2 0x2 */
unsigned char * head; /* 0xc0 0x8 */
To confirm our guesswork, we can disassemble the whole function in gdb
.
It would be great to find out the value of the inner_transport_header
and transport_header
offsets. But the registers that were holding them, %rax
and %rsi
, respectively, were reused after the offset values were loaded.
However, we can still examine the difference between inner_transport_header
and transport_header
that both %rax
and %rsi
hold. Let’s take a look.
The suspicious offset
Here are the register values from the oops as a reminder:
RAX: 000000000000feda RBX: ffff9d982becc900 RCX: ffff9d9624bbaffc
RDX: ffff9d9624babec0 RSI: 000000000000feda RDI: ffff9d982becc900
From the register snapshot we can tell that:
%rax = %rsi = skb->inner_transport_header - skb->transport_header = 0xfeda = 65242
That is clearly suspicious. We expect that skb->transport_header < skb->inner_transport_header
, so either
skb->inner_transport_header > 0xfeda
, which would mean that between outer and inner L4 packets there is 65k+ bytes worth of headers - unlikely, or0xfeda
is a garbage value, perhaps an effect of an underflow ifskb->inner_transport_header < skb->transport_header
.
Let’s entertain the theory that an underflow has occurred.
Any other scenario, be it an out-of-bounds write or a use-after-free that corrupted the memory, is a scary prospect where we don’t stand much chance of debugging it without help from tools like KASAN report.
But if we assume for a moment that it’s an underflow, then the task is simple ?. We “just” need to audit all places where skb->inner_transport_header
or skb->transport_header
offsets could have been updated while the skb
buffer travelled through the network stack.
That raises the question — what path did the packet take through the network stack before it brought the machine down?
Packet path
It is time to take a look at the call trace in the oops report. If we walk through it, it is apparent that a veth device received a packet. The packet then got routed and forwarded to some other network device. The kernel crashed before the egress device transmitted the packet out.
What immediately draws our attention is the veth_poll()
function in the call trace. Polling inside a virtual device that acts as a simple pipe joining two network namespaces together? Puzzling!
The regular operation mode of a veth
device is that transmission of a packet from one side of a veth-pair
results in immediate, in-line, on the same CPU, reception of the packet by the other side of the pair. There shouldn't be any polling, context switches or such.
However, in Linux v4.19 veth driver gained support for native mode eXpress Data Path (XDP). XDP relies on NAPI, an interface between the network drivers and the Linux network stack. NAPI requires that drivers register a poll()
callback for fetching received packets.
The NAPI receive path in the veth
driver is taken only when there is an XDP program attached. The fork occurs in veth_forward_skb
, where the TX path ends and a RX path on the other side begins.
This is an important observation because only on the NAPI/XDP path in the veth
driver, received packets might get aggregated by the Generic Receive Offload.
Super-packets
Early on we’ve noted that the crash happens when processing a GSO packet. I’ve promised we will get back to it and now is the time.
Generic Segmentation Offload (GSO) is all about delaying the L4 segmentation process until the very last moment. So called super-packets, that exceed the egress route MTU in size, travel all the way through the network stack, only to be cut into MTU-sized segments just before handing the data over to the network driver for transmission. This way we process just one big packet on the transmit path, instead of a few smaller ones and save on CPU cycles in all the IP-level stack functions like routing, nftables, traffic control
Where do these super-packets come from? They can be a result of large write to a socket, or as is our case, they can be received from one network and forwarded to another network.
The latter case, that is forwarding a super-packet, happens when Generic Receive Offload (GRO) kicks in during receive. GRO is the opposite process of GSO. Smaller, MTU-sized packets get merged to form a super-packet early on the receive path. The goal is the same — process less by pushing just one packet through the network stack layers.
Not just any packets can be fused together by GRO. Loosely speaking, any two packets to be merged must form a logical sequence in the network flow, and carry the same metadata in protocol headers. It is critical that no information is lost in the aggregation process. Otherwise, GSO won’t be able to reconstruct the segment stream when serializing packets in the network card transmission code.
To this end, each network protocol that supports GRO provides a callback which signals whether the above conditions hold true. GRO implementation (dev_gro_receive()
) then walks through the packet headers, the outer as well as the inner ones, and delegates the pre-merge check to the right protocol callback. If all stars align, the packets get spliced at the end of the callback chain (skb_gro_receive()
).
I will be frank. The code that performs GRO is pretty complex, and I spent a significant amount of time staring into it. Hat tip to its authors. However, for our little investigation it will be enough to understand that a TCP stream encapsulated with GRE1 would trigger callback chain like so:
Armed with basic GRO/GSO understanding we are ready to take a shot at reproducing the crash.
The reproducer
Let’s recap what we know:
a super-packet was received from a
veth
device,the
veth
device had anXDP
program attached,the packet was forwarded to another device,
the egress device was transmitting a GSO super-packet,
the packet was encapsulated,
the super-packet must have been produced by GRO on ingress.
This paints a pretty clear picture on what the setup should look like:
We can work with that. A simple shell script will be our setup machinery.
We will be sending traffic from 10.1.1.1
to 10.2.2.2
. Our traffic pattern will be a TCP stream consisting of two consecutive segments so that GRO can merge something. A Scapy script will be great for that. Let’s call it send-a-pair.py
and give it a run:
$ { sleep 5; sudo ip netns exec A ./send-a-pair.py; } &
[1] 1603
$ sudo ip netns exec B tcpdump -i BA -n -nn -ttt 'ip and not arp'
…
00:00:00.020506 IP 10.1.1.1 > 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 0:1436, ack 1, win 8192, length 1436
00:00:00.000082 IP 10.1.1.1 > 10.2.2.2: GREv0, length 1480: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 1436:2872, ack 1, win 8192, length 1436
Where is our super-packet? Look at the packet sizes, the GRO didn’t merge anything.
Turns out NAPI is just too fast at fetching the packets from the Rx ring. We need a little buffering on transmit to increase our chances of GRO batching:
# Help GRO
ip netns exec A tc qdisc add dev AB root netem delay 200us slot 5ms 10ms packets 2 bytes 64k
With the delay in place, things look better:
00:00:00.016972 IP 10.1.1.1 > 10.2.2.2: GREv0, length 2916: IP 192.168.1.1.12345 > 192.168.2.2.443: Flags [.], seq 0:2872, ack 1, win 8192, length 2872
2,872 bytes shown by tcpdump
clearly indicate GRO in action. And we are even hitting the crash point:
$ sudo bpftrace -e 'kprobe:skb_gso_transport_seglen { print(kstack()); }' -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 1 probe...
skb_gso_transport_seglen+1
skb_gso_validate_network_len+17
__ip_finish_output+293
ip_output+113
ip_forward+876
ip_rcv+188
__netif_receive_skb_one_core+128
netif_receive_skb_internal+47
napi_gro_flush+151
napi_complete_done+183
veth_poll+1697
net_rx_action+314
…
^C
…but we are not crashing. We will need to dig deeper.
We know what packet metadata skb_gso_transport_seglen()
looks at — the header offsets, then encapsulation
flag, and GSO info. Let’s dump all of it:
$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV LEN NH TH ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1 294 254 | 1436 2 0x41 skb_gso_transport_seglen
Since the skb->encapsulation
flag (ENC
) is set, both outer and inner header offsets should be valid. Are they?
The outer network / L3 header (NH
) looks sane. When XDP is enabled, it reserves 256 bytes of headroom before the headers. 14 byte long Ethernet header follows the headroom. The IPv4 header should then start at 270 bytes into the packet buffer.
The outer transport / L4 header offset is as expected as well. The IPv4 header takes 20 bytes, and the GRE header follows it.
The inner network header (INH
) begins at the offset of 294 bytes. This makes sense because the GRE header in its most basic form is 4 bytes long.
The surprise comes last. The inner transport header offset points somewhere near the end of headroom which XDP reserves. Instead, it should start at 314, following the inner IPv4 header.
Is this the smoking gun we were looking for?
The bug
skb_gso_transport_seglen()
calculates the length of the outer L4 segment when given a GSO packet. If the inner_transport_header
offset is off, then the result of the calculation might be off as well. Worth checking.
We know that our segments are 1500 bytes long. That makes the L4 part 1480 bytes long. What does skb_gso_transport_seglen()
say though?
$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1460
Seems that we don’t agree. But if skb_gso_transport_seglen()
is getting garbage on input we can’t really blame it.
If inner_transport_header
is not correct, that TCP Data Offset read that we know happens inside the function cannot end well.
If we map it out, it looks like we are loading part of the source MAC address (upper 4 bits of the 5th byte, to be precise) and interpreting it as TCP Data Offset.
Are we? There is an easy way to check.
If we ask nicely, tcpdump
will tell us what the MAC addresses are:
Plugging that into the calculations that skb_gso_transport_seglen()
…
thlen = inner_transport_header(skb) - transport_header(skb) = 254 - 290 = -36
thlen += inner_transport_header(skb)->doff * 4 = -36 + (0xf * 4) = -36 + 60 = 24
retval = gso_size + thlen = 1436 + 24 = 1460
…checks out!
Does this mean that I can control the return value by setting source MAC address?!
?
$ sudo ip -n A link set AB address be:d6:07:5e:05:11 # Change the MAC address
$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1400
Yes! 1436 + (-36) + (0 * 4) = 1400
. This is it.
However, how does all this tie it to the original crash? The badly calculated L4 segment length will make GSO emit shorter segments on egress. But that’s all.
Remember the suspicious offset from the crash report?
%rax = %rsi = skb->inner_transport_header - skb->transport_header = 0xfeda = 65242
We now know that skb->transport_header
should be 290. That makes skb->inner_transport_header = 65242 + 290 = 65532 = 0xfffc
.
Which means that when we triggered the page fault we were trying to load memory from a location at
skb->head + skb->inner_transport_header + offsetof(tcphdr, doff) = skb->head + 0xfffc + 12 = 0xffff9d9624bbb008
Solving it for skb->head yields 0xffff9d9624bbb008 - 0xfffc - 12 = 0xffff9d9624bab000
.
And this makes sense. The skb->head buffer
is page-aligned, meaning it’s a multiple of 4 KiB on x86-64 platforms — the 12 least significant bits the address are 0.
However, the address we were trying to read was (0xfffc+12)/4096 ~= 16 pages
(or 64 KiB) past the skb->head
page boundary (0xffff9d9624babfff)
.
Who knows if there was memory mapped to this address?! Looks like from time to time there wasn’t anything mapped there and the kernel page fault handling code went “oops!”.
The fix
It is finally time to understand who sets the header offsets in a super-packet.
Once GRO is done merging segments, it flushes the super-packet down the pipe by kicking off a chain of gro_complete
callbacks:
napi_gro_complete → inet_gro_complete → gre_gro_complete → inet_gro_complete → tcp4_gro_complete → tcp_gro_complete
These callbacks are responsible for updating the header offsets and populating the GSO-related fields in skb_shared_info
struct. Later on the transmit side will consume this data.
Let’s see how the packet metadata changes as it travels through the gro_complete callbacks
2 by adding a few more tracepoint to our bpftrace script:
$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 7 probes...
DEV LEN NH TH ENC INH ITH GSO SIZE SEGS TYPE FUNC
BA 2936 294 314 0 254 254 | 1436 0 0x00 napi_gro_complete
BA 2936 294 314 0 254 254 | 1436 0 0x00 inet_gro_complete
BA 2936 294 314 0 254 254 | 1436 0 0x00 gre_gro_complete
BA 2936 294 314 1 254 254 | 1436 0 0x40 inet_gro_complete
BA 2936 294 314 1 294 254 | 1436 0 0x40 tcp4_gro_complete
BA 2936 294 314 1 294 254 | 1436 0 0x41 tcp_gro_complete
sink 2936 270 290 1 294 254 | 1436 2 0x41 skb_gso_transport_seglen
As the packet travels through the gro_complete
callbacks, the inner network header (INH
) offset gets updated after we have processed the inner IPv4 header.
However, the same thing did not happen to the inner transport header (ITH) that is causing trouble. We need to fix that.
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -298,6 +298,9 @@ int tcp_gro_complete(struct sk_buff *skb)
if (th->cwr)
skb_shinfo(skb)->gso_type |= SKB_GSO_TCP_ECN;
+ if (skb->encapsulation)
+ skb->inner_transport_header = skb->transport_header;
+
return 0;
}
EXPORT_SYMBOL(tcp_gro_complete);
With the patch in place, the header offsets are finally all sane and skb_gso_transport_seglen()
return value is as expected:
$ sudo bpftrace ./why-no-crash.bt -c '/usr/bin/ip netns exec A ./send-a-pair.py'
Attaching 2 probes...
DEV LEN NH TH ENC INH ITH GSO SIZE SEGS TYPE FUNC
sink 2936 270 290 1 294 314 | 1436 2 0x41 skb_gso_transport_seglen
$ sudo bpftrace -e 'kretprobe:skb_gso_transport_seglen { print(retval); }' -c …
Attaching 1 probe...
1480
Don’t worry, though. The fix is already likely in your kernel long time ago. Patch d51c5907e980 (“net, gro: Set inner transport header offset in tcp/udp GRO hook”) has been merged into Linux v5.14, and backported to v5.10.58 and v5.4.140 LTS kernels. The Linux kernel community has got you covered. But please, keep on updating your production kernels.
Outro
What a journey! We have learned a ton and fixed a real bug in the Linux kernel. In the end it was not a Packet of Death. Maybe next time we can find one ;-)
Enjoyed the read? Why not join Cloudflare and help us fix the remaining bugs in the Linux kernel? We are hiring in Lisbon, London, and Austin.
And if you would like to see more kernel blog posts, please let us know!
...1Why GRE and not some other type of encapsulation? If you follow our blog closely, you might already know that Cloudflare Magic Transit uses veth pairs to route traffic into and out of network namespaces. It also happens to use GRE encapsulation. If you are curious why we chose network namespaces linked with veth pairs, be sure to watch the How we built Magic Transit talk.2Just turn off GRO on all other network devices in use to get a clean output (sudo ethtool -K enp0s5 gro off).