The Cloudflare Blog

Mitigating a 754 Million PPS DDoS Attack Automatically

Omer Yoachimik — Thu, 09 Jul 2020 11:00:00 GMT

On June 21, Cloudflare automatically mitigated a highly volumetric DDoS attack that peaked at 754 million packets per second. The attack was part of an organized four day campaign starting on June 18 and ending on June 21: attack traffic was sent from over 316,000 IP addresses towards a single Cloudflare IP address that was mostly used for websites on our Free plan. No downtime or service degradation was reported during the attack, and no charges accrued to customers due to our unmetered mitigation guarantee.

The attack was detected and handled automatically by Gatebot, our global DDoS detection and mitigation system without any manual intervention by our teams. Notably, because our automated systems were able to mitigate the attack without issue, no alerts or pages were sent to our on-call teams and no humans were involved at all.

Attack Snapshot - Peaking at 754 Mpps. The two different colors in the graph represent two separate systems dropping packets.

During those four days, the attack utilized a combination of three attack vectors over the TCP protocol: SYN floods, ACK floods and SYN-ACK floods. The attack campaign sustained for multiple hours at rates exceeding 400-600 million packets per second and peaked multiple times above 700 million packets per second, with a top peak of 754 million packets per second. Despite the high and sustained packet rates, our edge continued serving our customers during the attack without impacting performance at all

The Three Types of DDoS: Bits, Packets & Requests

Attacks with high bits per second rates aim to saturate the Internet link by sending more bandwidth per second than the link can handle. Mitigating a bit-intensive flood is similar to a dam blocking gushing water in a canal with limited capacity, allowing just a portion through.

Bit Intensive DDoS Attacks as a Gushing River Blocked By Gatebot

In such cases, the Internet service provider may block or throttle the traffic above the allowance resulting in denial of service for legitimate users that are trying to connect to the website but are blocked by the service provider. In other cases, the link is simply saturated and everything behind that connection is offline.

Swarm of Mosquitoes as a Packet Intensive DDoS Attack

However in this DDoS campaign, the attack peaked at a mere 250 Gbps (I say, mere, but ¼ Tbps is enough to knock pretty much anything offline if it isn’t behind some DDoS mitigation service) so it does not seem as the attacker intended to saturate our Internet links, perhaps because they know that our global capacity exceeds 37 Tbps. Instead, it appears the attacker attempted (and failed) to overwhelm our routers and data center appliances with high packet rates reaching 754 million packets per second. As opposed to water rushing towards a dam, flood of packets can be thought of as a swarm of millions of mosquitoes that you need to zap one by one.

Zapping Mosquitoes with Gatebot

Depending on the ‘weakest link’ in a data center, a packet intensive DDoS attack may impact the routers, switches, web servers, firewalls, DDoS mitigation devices or any other appliance that is used in-line. Typically, a high packet rate may cause the memory buffer to overflow and thus voiding the router’s ability to process additional packets. This is because there’s a small fixed CPU cost of handing each packet and so if you can send a lot of small packets you can block an Internet connection not by filling it but by causing the hardware that handles the connection to be overwhelmed with processing.

Another form of DDoS attack is one with a high HTTP request per second rate. An HTTP request intensive DDoS attack aims to overwhelm a web server’s resources with more HTTP requests per second than the server can handle. The goal of a DDoS attack with a high request per second rate is to max out the CPU and memory utilization of the server in order to crash it or prevent it from being able to respond to legitimate requests. Request intensive DDoS attacks allow the attacker to generate much less bandwidth, as opposed to bit intensive attacks, and still cause a denial of service.

Automated DDoS Detection & Mitigation

So how did we handle 754 million packets per second? First, Cloudflare’s network utilizes BGP Anycast to spread attack traffic globally across our fleet of data centers. Second, we built our own DDoS protection systems, Gatebot and dosd, which drop packets inside the Linux kernel for maximum efficiency in order to handle massive floods of packets. And third, we built our own L4 load-balancer, Unimog, which uses our appliances' health and other various metrics to load-balance traffic intelligently within a data center.

In 2017, we published a blog introducing Gatebot, one of our two DDoS protection systems. The blog was titled Meet Gatebot - a bot that allows us to sleep, and that’s exactly what happened during this attack. The attack surface was spread out globally by our Anycast, then Gatebot detected and mitigated the attack automatically without human intervention. And traffic inside each datacenter was load-balanced intelligently to avoid overwhelming any one machine. And as promised in the blog title, the attack peak did in fact occur while our London team was asleep.

So how does Gatebot work? Gatebot asynchronously samples traffic from every one of our data centers in over 200 locations around the world. It also monitors our customers’ origin server health. It then analyzes the samples to identify patterns and traffic anomalies that can indicate attacks. Once an attack is detected, Gatebot sends mitigation instructions to the edge data centers.

To complement Gatebot, last year we released a new system codenamed dosd (denial of service daemon) which runs in every one of our data centers around the world in over 200 cities. Similarly to Gatebot, dosd detects and mitigates attacks autonomously but in the scope of a single server or data center. You can read more about dosd in our recent blog.

The DDoS Landscape

While in recent months we’ve observed a decrease in the size and duration of DDoS attacks, highly volumetric and globally distributed DDoS attacks such as this one still persist. Regardless of the size, type or sophistication of the attack, Cloudflare offers unmetered DDoS protection to all customers and plan levels—including the Free plans.

When TCP sockets refuse to die

Marek Majkowski — Fri, 20 Sep 2019 15:53:33 GMT

While working on our Spectrum server, we noticed something weird: the TCP sockets which we thought should have been closed were lingering around. We realized we don't really understand when TCP sockets are supposed to time out!

Image by Sergiodc2 CC BY SA 3.0

In our code, we wanted to make sure we don't hold connections to dead hosts. In our early code we naively thought enabling TCP keepalives would be enough... but it isn't. It turns out a fairly modern TCP_USER_TIMEOUT socket option is equally important. Furthermore, it interacts with TCP keepalives in subtle ways. Many people are confused by this.

In this blog post, we'll try to show how these options work. We'll show how a TCP socket can time out during various stages of its lifetime, and how TCP keepalives and user timeout influence that. To better illustrate the internals of TCP connections, we'll mix the outputs of the tcpdump and the ss -o commands. This nicely shows the transmitted packets and the changing parameters of the TCP connections.

SYN-SENT

Let's start from the simplest case - what happens when one attempts to establish a connection to a server which discards inbound SYN packets?

The scripts used here are available on our GitHub.

$ sudo ./test-syn-sent.py # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,940ms,0) 00:01.028 IP host.2 > host.1: Flags [S] # first retry 00:03.044 IP host.2 > host.1: Flags [S] # second retry 00:07.236 IP host.2 > host.1: Flags [S] # third retry 00:15.427 IP host.2 > host.1: Flags [S] # fourth retry 00:31.560 IP host.2 > host.1: Flags [S] # fifth retry 01:04.324 IP host.2 > host.1: Flags [S] # sixth retry 02:10.000 connect ETIMEDOUT

Ok, this was easy. After the connect() syscall, the operating system sends a SYN packet. Since it didn't get any response the OS will by default retry sending it 6 times. This can be tweaked by the sysctl:

$ sysctl net.ipv4.tcp_syn_retries net.ipv4.tcp_syn_retries = 6

It's possible to overwrite this setting per-socket with the TCP_SYNCNT setsockopt:

setsockopt(sd, IPPROTO_TCP, TCP_SYNCNT, 6);

The retries are staggered at 1s, 3s, 7s, 15s, 31s, 63s marks (the inter-retry time starts at 2s and then doubles each time). By default, the whole process takes 130 seconds, until the kernel gives up with the ETIMEDOUT errno. At this moment in the lifetime of a connection, SO_KEEPALIVE settings are ignored, but TCP_USER_TIMEOUT is not. For example, setting it to 5000ms, will cause the following interaction:

$ sudo ./test-syn-sent.py 5000 # all packets dropped 00:00.000 IP host.2 > host.1: Flags [S] # initial SYN State Recv-Q Send-Q Local:Port Peer:Port SYN-SENT 0 1 host:2 host:1 timer:(on,996ms,0) 00:01.016 IP host.2 > host.1: Flags [S] # first retry 00:03.032 IP host.2 > host.1: Flags [S] # second retry 00:05.016 IP host.2 > host.1: Flags [S] # what is this? 00:05.024 IP host.2 > host.1: Flags [S] # what is this? 00:05.036 IP host.2 > host.1: Flags [S] # what is this? 00:05.044 IP host.2 > host.1: Flags [S] # what is this? 00:05.050 connect ETIMEDOUT

Even though we set user-timeout to 5s, we still saw the six SYN retries on the wire. This behaviour is probably a bug (as tested on 5.2 kernel): we would expect only two retries to be sent - at 1s and 3s marks and the socket to expire at 5s mark. Instead, we saw this, but also we saw further 4 retransmitted SYN packets aligned to 5s mark - which makes no sense. Anyhow, we learned a thing - the TCP_USER_TIMEOUT does affect the behaviour of connect().

SYN-RECV

SYN-RECV sockets are usually hidden from the application. They live as mini-sockets on the SYN queue. We wrote about the SYN and Accept queues in the past. Sometimes, when SYN cookies are enabled, the sockets may skip the SYN-RECV state altogether.

In SYN-RECV state, the socket will retry sending SYN+ACK 5 times as controlled by:

$ sysctl net.ipv4.tcp_synack_retries net.ipv4.tcp_synack_retries = 5

Here is how it looks on the wire:

$ sudo ./test-syn-recv.py 00:00.000 IP host.2 > host.1: Flags [S] # all subsequent packets dropped 00:00.000 IP host.1 > host.2: Flags [S.] # initial SYN+ACK State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,996ms,0) 00:01.033 IP host.1 > host.2: Flags [S.] # first retry 00:03.045 IP host.1 > host.2: Flags [S.] # second retry 00:07.301 IP host.1 > host.2: Flags [S.] # third retry 00:15.493 IP host.1 > host.2: Flags [S.] # fourth retry 00:31.621 IP host.1 > host.2: Flags [S.] # fifth retry 01:04:610 SYN-RECV disappears

With default settings, the SYN+ACK is re-transmitted at 1s, 3s, 7s, 15s, 31s marks, and the SYN-RECV socket disappears at the 64s mark.

Neither SO_KEEPALIVE nor TCP_USER_TIMEOUT affect the lifetime of SYN-RECV sockets.

Final handshake ACK

After receiving the second packet in the TCP handshake - the SYN+ACK - the client socket moves to an ESTABLISHED state. The server socket remains in SYN-RECV until it receives the final ACK packet.

Losing this ACK doesn't change anything - the server socket will just take a bit longer to move from SYN-RECV to ESTAB. Here is how it looks:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # initial ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1sec,0) ESTAB 0 0 host:2 host:1 00:01.014 IP host.1 > host.2: Flags [S.] 00:01.014 IP host.2 > host.1: Flags [.] # retried ACK, dropped State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,1.012ms,1) ESTAB 0 0 host:2 host:1

As you can see SYN-RECV, has the "on" timer, the same as in example before. We might argue this final ACK doesn't really carry much weight. This thinking lead to the development of TCP_DEFER_ACCEPT feature - it basically causes the third ACK to be silently dropped. With this flag set the socket remains in SYN-RECV state until it receives the first packet with actual data:

$ sudo ./test-syn-ack.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # delivered, but the socket stays as SYN-RECV State Recv-Q Send-Q Local:Port Peer:Port SYN-RECV 0 0 host:1 host:2 timer:(on,7.192ms,0) ESTAB 0 0 host:2 host:1 00:08.020 IP host.2 > host.1: Flags [P.], length 11 # payload moves the socket to ESTAB State Recv-Q Send-Q Local:Port Peer:Port ESTAB 11 0 host:1 host:2 ESTAB 0 0 host:2 host:1

The server socket remained in the SYN-RECV state even after receiving the final TCP-handshake ACK. It has a funny "on" timer, with the counter stuck at 0 retries. It is converted to ESTAB - and moved from the SYN to the accept queue - after the client sends a data packet or after the TCP_DEFER_ACCEPT timer expires. Basically, with DEFER ACCEPT the SYN-RECV mini-socket discards the data-less inbound ACK.

Idle ESTAB is forever

Let's move on and discuss a fully-established socket connected to an unhealthy (dead) peer. After completion of the handshake, the sockets on both sides move to the ESTABLISHED state, like:

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:2 host:1 ESTAB 0 0 host:1 host:2

These sockets have no running timer by default - they will remain in that state forever, even if the communication is broken. The TCP stack will notice problems only when one side attempts to send something. This raises a question - what to do if you don't plan on sending any data over a connection? How do you make sure an idle connection is healthy, without sending any data over it?

This is where TCP keepalives come in. Let's see it in action - in this example we used the following toggles:

SO_KEEPALIVE = 1 - Let's enable keepalives.
TCP_KEEPIDLE = 5 - Send first keepalive probe after 5 seconds of idleness.
TCP_KEEPINTVL = 3 - Send subsequent keepalive probes after 3 seconds.
TCP_KEEPCNT = 3 - Time out after three failed probes.

$ sudo ./test-idle.py 00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 0 host:2 host:1 timer:(keepalive,2.992ms,0) # all subsequent packets dropped 00:05.083 IP host.2 > host.1: Flags [.], ack 1 # first keepalive probe 00:08.155 IP host.2 > host.1: Flags [.], ack 1 # second keepalive probe 00:11.231 IP host.2 > host.1: Flags [.], ack 1 # third keepalive probe 00:14.299 IP host.2 > host.1: Flags [R.], seq 1, ack 1

Indeed! We can clearly see the first probe sent at the 5s mark, two remaining probes 3s apart - exactly as we specified. After a total of three sent probes, and a further three seconds of delay, the connection dies with ETIMEDOUT, and final the RST is transmitted.

For keepalives to work, the send buffer must be empty. You can notice the keepalive timer active in the "timer:(keepalive)" line.

Keepalives with TCP_USER_TIMEOUT are confusing

We mentioned the TCP_USER_TIMEOUT option before. It sets the maximum amount of time that transmitted data may remain unacknowledged before the kernel forcefully closes the connection. On its own, it doesn't do much in the case of idle connections. The sockets will remain ESTABLISHED even if the connectivity is dropped. However, this socket option does change the semantics of TCP keepalives. The tcp(7) manpage is somewhat confusing:

Moreover, when used with the TCP keepalive (SO_KEEPALIVE) option, TCP_USER_TIMEOUT will override keepalive to determine when to close a connection due to keepalive failure.

The original commit message has slightly more detail:

tcp: Add TCP_USER_TIMEOUT socket option

To understand the semantics, we need to look at the kernel code in linux/net/ipv4/tcp_timer.c:693:

if ((icsk->icsk_user_timeout != 0 && elapsed >= msecs_to_jiffies(icsk->icsk_user_timeout) && icsk->icsk_probes_out > 0) ||

For the user timeout to have any effect, the icsk_probes_out must not be zero. The check for user timeout is done only after the first probe went out. Let's check it out. Our connection settings:

TCP_USER_TIMEOUT = 5*1000 - 5 seconds
SO_KEEPALIVE = 1 - enable keepalives
TCP_KEEPIDLE = 1 - send first probe quickly - 1 second idle
TCP_KEEPINTVL = 11 - subsequent probes every 11 seconds
TCP_KEEPCNT = 3 - send three probes before timing out

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # all subsequent packets dropped 00:01.001 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.233 IP host.2 > host.1: Flags [R.] # timer for second probe fired, socket aborted due to TCP_USER_TIMEOUT

So what happened? The connection sent the first keepalive probe at the 1s mark. Seeing no response the TCP stack then woke up 11 seconds later to send a second probe. This time though, it executed the USER_TIMEOUT code path, which decided to terminate the connection immediately.

What if we bump TCP_USER_TIMEOUT to larger values, say between the second and third probe? Then, the connection will be closed on the third probe timer. With TCP_USER_TIMEOUT set to 12.5s:

00:01.022 IP host.2 > host.1: Flags [.] # first probe 00:12.094 IP host.2 > host.1: Flags [.] # second probe 00:23.102 IP host.2 > host.1: Flags [R.] # timer for third probe fired, socket aborted due to TCP_USER_TIMEOUT

We’ve shown how TCP_USER_TIMEOUT interacts with keepalives for small and medium values. The last case is when TCP_USER_TIMEOUT is extraordinarily large. Say we set it to 30s:

00:01.027 IP host.2 > host.1: Flags [.], ack 1 # first probe 00:12.195 IP host.2 > host.1: Flags [.], ack 1 # second probe 00:23.207 IP host.2 > host.1: Flags [.], ack 1 # third probe 00:34.211 IP host.2 > host.1: Flags [.], ack 1 # fourth probe! But TCP_KEEPCNT was only 3! 00:45.219 IP host.2 > host.1: Flags [.], ack 1 # fifth probe! 00:56.227 IP host.2 > host.1: Flags [.], ack 1 # sixth probe! 01:07.235 IP host.2 > host.1: Flags [R.], seq 1 # TCP_USER_TIMEOUT aborts conn on 7th probe timer

We saw six keepalive probes on the wire! With TCP_USER_TIMEOUT set, the TCP_KEEPCNT is totally ignored. If you want TCP_KEEPCNT to make sense, the only sensible USER_TIMEOUT value is slightly smaller than:

TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT

Busy ESTAB socket is not forever

Thus far we have discussed the case where the connection is idle. Different rules apply when the connection has unacknowledged data in a send buffer.

Let's prepare another experiment - after the three-way handshake, let's set up a firewall to drop all packets. Then, let's do a send on one end to have some dropped packets in-flight. An experiment shows the sending socket dies after ~16 minutes:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.] 00:00.000 IP host.2 > host.1: Flags [.] # All subsequent packets dropped 00:00.206 IP host.2 > host.1: Flags [P.], length 11 # first data packet 00:00.412 IP host.2 > host.1: Flags [P.], length 11 # early retransmit, doesn't count 00:00.620 IP host.2 > host.1: Flags [P.], length 11 # 1nd retry 00:01.048 IP host.2 > host.1: Flags [P.], length 11 # 2rd retry 00:01.880 IP host.2 > host.1: Flags [P.], length 11 # 3th retry State Recv-Q Send-Q Local:Port Peer:Port ESTAB 0 0 host:1 host:2 ESTAB 0 11 host:2 host:1 timer:(on,1.304ms,3) 00:03.543 IP host.2 > host.1: Flags [P.], length 11 # 4th 00:07.000 IP host.2 > host.1: Flags [P.], length 11 # 5th 00:13.656 IP host.2 > host.1: Flags [P.], length 11 # 6th 00:26.968 IP host.2 > host.1: Flags [P.], length 11 # 7th 00:54.616 IP host.2 > host.1: Flags [P.], length 11 # 8th 01:47.868 IP host.2 > host.1: Flags [P.], length 11 # 9th 03:34.360 IP host.2 > host.1: Flags [P.], length 11 # 10th 05:35.192 IP host.2 > host.1: Flags [P.], length 11 # 11th 07:36.024 IP host.2 > host.1: Flags [P.], length 11 # 12th 09:36.855 IP host.2 > host.1: Flags [P.], length 11 # 13th 11:37.692 IP host.2 > host.1: Flags [P.], length 11 # 14th 13:38.524 IP host.2 > host.1: Flags [P.], length 11 # 15th 15:39.500 connection ETIMEDOUT

The data packet is retransmitted 15 times, as controlled by:

$ sysctl net.ipv4.tcp_retries2 net.ipv4.tcp_retries2 = 15

From the ip-sysctl.txt documentation:

The default value of 15 yields a hypothetical timeout of 924.6 seconds and is a lower bound for the effective timeout. TCP will effectively time out at the first RTO which exceeds the hypothetical timeout.

The connection indeed died at ~940 seconds. Notice the socket has the "on" timer running. It doesn't matter at all if we set SO_KEEPALIVE - when the "on" timer is running, keepalives are not engaged.

TCP_USER_TIMEOUT keeps on working though. The connection will be aborted exactly after user-timeout specified time since the last received packet. With the user timeout set the tcp_retries2 value is ignored.

Zero window ESTAB is... forever?

There is one final case worth mentioning. If the sender has plenty of data, and the receiver is slow, then TCP flow control kicks in. At some point the receiver will ask the sender to stop transmitting new data. This is a slightly different condition than the one described above.

In this case, with flow control engaged, there is no in-flight or unacknowledged data. Instead the receiver throttles the sender with a "zero window" notification. Then the sender periodically checks if the condition is still valid with "window probes". In this experiment we reduced the receive buffer size for simplicity. Here's how it looks on the wire:

00:00.000 IP host.2 > host.1: Flags [S] 00:00.000 IP host.1 > host.2: Flags [S.], win 1152 00:00.000 IP host.2 > host.1: Flags [.]

00:00.202 IP host.2 > host.1: Flags [.], length 576 # first data packet 00:00.202 IP host.1 > host.2: Flags [.], ack 577, win 576 00:00.202 IP host.2 > host.1: Flags [P.], length 576 # second data packet 00:00.244 IP host.1 > host.2: Flags [.], ack 1153, win 0 # throttle it! zero-window

00:00.456 IP host.2 > host.1: Flags [.], ack 1 # zero-window probe 00:00.456 IP host.1 > host.2: Flags [.], ack 1153, win 0 # nope, still zero-window

State Recv-Q Send-Q Local:Port Peer:Port ESTAB 1152 0 host:1 host:2 ESTAB 0 129920 host:2 host:1 timer:(persist,048ms,0)

The packet capture shows a couple of things. First, we can see two packets with data, each 576 bytes long. They both were immediately acknowledged. The second ACK had "win 0" notification: the sender was told to stop sending data.

But the sender is eager to send more! The last two packets show a first "window probe": the sender will periodically send payload-less "ack" packets to check if the window size had changed. As long as the receiver keeps on answering, the sender will keep on sending such probes forever.

The socket information shows three important things:

The read buffer of the reader is filled - thus the "zero window" throttling is expected.
The write buffer of the sender is filled - we have more data to send.
The sender has a "persist" timer running, counting the time until the next "window probe".

In this blog post we are interested in timeouts - what will happen if the window probes are lost? Will the sender notice?

By default, the window probe is retried 15 times - adhering to the usual tcp_retries2 setting.

The tcp timer is in persist state, so the TCP keepalives will not be running. The SO_KEEPALIVE settings don't make any difference when window probing is engaged.

As expected, the TCP_USER_TIMEOUT toggle keeps on working. A slight difference is that similarly to user-timeout on keepalives, it's engaged only when the retransmission timer fires. During such an event, if more than user-timeout seconds since the last good packet passed, the connection will be aborted.

Note about using application timeouts

In the past we have shared an interesting war story:

The curious case of slow downloads

Our HTTP server gave up on the connection after an application-managed timeout fired. This was a bug - a slow connection might have correctly slowly drained the send buffer, but the application server didn't notice that.

We abruptly dropped slow downloads, even though this wasn't our intention. We just wanted to make sure the client connection was still healthy. It would be better to use TCP_USER_TIMEOUT than rely on application-managed timeouts.

But this is not sufficient. We also wanted to guard against a situation where a client stream is valid, but is stuck and doesn't drain the connection. The only way to achieve this is to periodically check the amount of unsent data in the send buffer, and see if it shrinks at a desired pace.

For typical applications sending data to the Internet, I would recommend:

Enable TCP keepalives. This is needed to keep some data flowing in the idle-connection case.
Set TCP_USER_TIMEOUT to TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT.
Be careful when using application-managed timeouts. To detect TCP failures use TCP keepalives and user-timeout. If you want to spare resources and make sure sockets don't stay alive for too long, consider periodically checking if the socket is draining at the desired pace. You can use ioctl(TIOCOUTQ) for that, but it counts both data buffered (notsent) on the socket and in-flight (unacknowledged) bytes. A better way is to use TCP_INFO tcpi_notsent_bytes parameter, which reports only the former counter.

An example of checking the draining pace:

while True: notsent1 = get_tcp_info(c).tcpi_notsent_bytes notsent1_ts = time.time() ... poll.poll(POLL_PERIOD) ... notsent2 = get_tcp_info(c).tcpi_notsent_bytes notsent2_ts = time.time() pace_in_bytes_per_second = (notsent1 - notsent2) / (notsent2_ts - notsent1_ts) if pace_in_bytes_per_second > 12000: # pace is above effective rate of 96Kbps, ok! else: # socket is too slow...

There are ways to further improve this logic. We could use TCP_NOTSENT_LOWAT, although it's generally only useful for situations where the send buffer is relatively empty. Then we could use the SO_TIMESTAMPING interface for notifications about when data gets delivered. Finally, if we are done sending the data to the socket, it's possible to just call close() and defer handling of the socket to the operating system. Such a socket will be stuck in FIN-WAIT-1 or LAST-ACK state until it correctly drains.

Summary

In this post we discussed five cases where the TCP connection may notice the other party going away:

SYN-SENT: The duration of this state can be controlled by TCP_SYNCNT or tcp_syn_retries.
SYN-RECV: It's usually hidden from application. It is tuned by tcp_synack_retries.
Idling ESTABLISHED connection, will never notice any issues. A solution is to use TCP keepalives.
Busy ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.
Zero-window ESTABLISHED connection, adheres to tcp_retries2 setting, and ignores TCP keepalives.

Especially the last two ESTABLISHED cases can be customized with TCP_USER_TIMEOUT, but this setting also affects other situations. Generally speaking, it can be thought of as a hint to the kernel to abort the connection after so-many seconds since the last good packet. This is a dangerous setting though, and if used in conjunction with TCP keepalives should be set to a value slightly lower than TCP_KEEPIDLE + TCP_KEEPINTVL * TCP_KEEPCNT. Otherwise it will affect, and potentially cancel out, the TCP_KEEPCNT value.

In this post we presented scripts showing the effects of timeout-related socket options under various network conditions. Interleaving the tcpdump packet capture with the output of ss -o is a great way of understanding the networking stack. We were able to create reproducible test cases showing the "on", "keepalive" and "persist" timers in action. This is a very useful framework for further experimentation.

Finally, it's surprisingly hard to tune a TCP connection to be confident that the remote host is actually up. During our debugging we found that looking at the send buffer size and currently active TCP timer can be very helpful in understanding whether the socket is actually healthy. The bug in our Spectrum application turned out to be a wrong TCP_USER_TIMEOUT setting - without it sockets with large send buffers were lingering around for way longer than we intended.

The scripts used in this article can be found on our GitHub.

Figuring this out has been a collaboration across three Cloudflare offices. Thanks to Hiren Panchasara from San Jose, Warren Nelson from Austin and Jakub Sitnicki from Warsaw. Fancy joining the team? Apply here!

SYN packet handling in the wild

Marek Majkowski — Mon, 15 Jan 2018 13:49:09 GMT

Here at Cloudflare, we have a lot of experience of operating servers on the wild Internet. But we are always improving our mastery of this black art. On this very blog we have touched on multiple dark corners of the Internet protocols: like understanding FIN-WAIT-2 or receive buffer tuning.

CC BY 2.0 image by Isaí Moreno

One subject hasn't had enough attention though - SYN floods. We use Linux and it turns out that SYN packet handling in Linux is truly complex. In this post we'll shine some light on this subject.

The tale of two queues

First we must understand that each bound socket, in the "LISTENING" TCP state has two separate queues:

The SYN Queue
The Accept Queue

In the literature these queues are often given other names such as "reqsk_queue", "ACK backlog", "listen backlog" or even "TCP backlog", but I'll stick to the names above to avoid confusion.

SYN Queue

The SYN Queue stores inbound SYN packets[1] (specifically: struct inet_request_sock). It's responsible for sending out SYN+ACK packets and retrying them on timeout. On Linux the number of retries is configured with:

$ sysctl net.ipv4.tcp_synack_retries
net.ipv4.tcp_synack_retries = 5

The docs describe this toggle:

tcp_synack_retries - INTEGER

	Number of times SYNACKs for a passive TCP connection attempt
	will be retransmitted. Should not be higher than 255. Default
	value is 5, which corresponds to 31 seconds till the last
	retransmission with the current initial RTO of 1second. With
	this the final timeout for a passive TCP connection will
	happen after 63 seconds.

After transmitting the SYN+ACK, the SYN Queue waits for an ACK packet from the client - the last packet in the three-way-handshake. All received ACK packets must first be matched against the fully established connection table, and only then against data in the relevant SYN Queue. On SYN Queue match, the kernel removes the item from the SYN Queue, happily creates a fully fledged connection (specifically: struct inet_sock), and adds it to the Accept Queue.

Accept Queue

The Accept Queue contains fully established connections: ready to be picked up by the application. When a process calls accept(), the sockets are de-queued and passed to the application.

This is a rather simplified view of SYN packet handling on Linux. With socket toggles like TCP_DEFER_ACCEPT[2] and TCP_FASTOPEN things work slightly differently.

Queue size limits

The maximum allowed length of both the Accept and SYN Queues is taken from the backlog parameter passed to the listen(2) syscall by the application. For example, this sets the Accept and SYN Queue sizes to 1,024:

listen(sfd, 1024)

Note: In kernels before 4.3 the SYN Queue length was counted differently.

This SYN Queue cap used to be configured by the net.ipv4.tcp_max_syn_backlog toggle, but this isn't the case anymore. Nowadays net.core.somaxconn caps both queue sizes. On our servers we set it to 16k:

$ sysctl net.core.somaxconn
net.core.somaxconn = 16384

Perfect backlog value

Knowing all that, we might ask the question - what is the ideal backlog parameter value?

The answer is: it depends. For the majority of trivial TCP Servers it doesn't really matter. For example, before version 1.11 Golang famously didn't support customizing backlog value. There are valid reasons to increase this value though:

When the rate of incoming connections is really large, even with a performant application, the inbound SYN Queue may need a larger number of slots.
The backlog value controls the SYN Queue size. This effectively can be read as "ACK packets in flight". The larger the average round trip time to the client, the more slots are going to be used. In the case of many clients far away from the server, hundreds of milliseconds away, it makes sense to increase the backlog value.
The TCP_DEFER_ACCEPT option causes sockets to remain in the SYN-RECV state longer and contribute to the queue limits.

Overshooting the backlog is bad as well:

Each slot in SYN Queue uses some memory. During a SYN Flood it makes no sense to waste resources on storing attack packets. Each struct inet_request_sock entry in SYN Queue takes 256 bytes of memory on kernel 4.14.

To peek into the SYN Queue on Linux we can use the ss command and look for SYN-RECV sockets. For example, on one of Cloudflare's servers we can see 119 slots used in tcp/80 SYN Queue and 78 on tcp/443.

$ ss -n state syn-recv sport = :80 | wc -l
119
$ ss -n state syn-recv sport = :443 | wc -l
78

Similar data can be shown with our overenginered SystemTap script: resq.stp.

Slow application

What happens if the application can't keep up with calling accept() fast enough?

This is when the magic happens! When the Accept Queue gets full (is of a size of backlog+1) then:

Inbound SYN packets to the SYN Queue are dropped.
Inbound ACK packets to the SYN Queue are dropped.
The TcpExtListenOverflows / LINUX_MIB_LISTENOVERFLOWS counter is incremented.
The TcpExtListenDrops / LINUX_MIB_LISTENDROPS counter is incremented.

There is a strong rationale for dropping inbound packets: it's a push-back mechanism. The other party will sooner or later resend the SYN or ACK packets by which point, the hope is, the slow application will have recovered.

This is a desirable behavior for almost all servers. For completeness: it can be adjusted with the global net.ipv4.tcp_abort_on_overflow toggle, but better not touch it.

If your server needs to handle a large number of inbound connections and is struggling with accept() throughput, consider reading our Nginx tuning / Epoll work distribution post and a follow up showing useful SystemTap scripts.

You can trace the Accept Queue overflow stats by looking at nstat counters:

$ nstat -az TcpExtListenDrops
TcpExtListenDrops     49199     0.0

This is a global counter. It's not ideal - sometimes we saw it increasing, while all applications looked healthy! The first step should always be to print the Accept Queue sizes with ss:

$ ss -plnt sport = :6443|cat
State   Recv-Q Send-Q  Local Address:Port  Peer Address:Port
LISTEN  0      1024                *:6443             *:*

The column Recv-Q shows the number of sockets in the Accept Queue, and Send-Q shows the backlog parameter. In this case we see there are no outstanding sockets to be accept()ed, but we still saw the ListenDrops counter increasing.

It turns out our application was stuck for fraction of a second. This was sufficient to let the Accept Queue overflow for a very brief period of time. Moments later it would recover. Cases like this are hard to debug with ss, so we wrote an acceptq.stp SystemTap script to help us. It hooks into kernel and prints the SYN packets which are being dropped:

$ sudo stap -v acceptq.stp
time (us)        acceptq qmax  local addr    remote_addr
1495634198449075  1025   1024  0.0.0.0:6443  10.0.1.92:28585
1495634198449253  1025   1024  0.0.0.0:6443  10.0.1.92:50500
1495634198450062  1025   1024  0.0.0.0:6443  10.0.1.92:65434
...

Here you can see precisely which SYN packets were affected by the ListenDrops. With this script it's trivial to understand which application is dropping connections.

CC BY 2.0 image by internets_dairy

SYN Flood

If it's possible to overflow the Accept Queue, it must be possible to overflow the SYN Queue as well. What happens in that case?

This is what SYN Flood attacks are all about. In the past flooding the SYN Queue with bogus spoofed SYN packets was a real problem. Before 1996 it was possible to successfully deny the service of almost any TCP server with very little bandwidth, just by filling the SYN Queues.

The solution is SYN Cookies. SYN Cookies are a construct that allows the SYN+ACK to be generated statelessly, without actually saving the inbound SYN and wasting system memory. SYN Cookies don't break legitimate traffic. When the other party is real, it will respond with a valid ACK packet including the reflected sequence number, which can be cryptographically verified.

By default SYN Cookies are enabled when needed - for sockets with a filled up SYN Queue. Linux updates a couple of counters on SYN Cookies. When a SYN cookie is being sent out:

TcpExtTCPReqQFullDoCookies / LINUX_MIB_TCPREQQFULLDOCOOKIES is incremented.
TcpExtSyncookiesSent / LINUX_MIB_SYNCOOKIESSENT is incremented.
Linux used to increment TcpExtListenDrops but doesn't from kernel 4.7.

When an inbound ACK is heading into the SYN Queue with SYN cookies engaged:

TcpExtSyncookiesRecv / LINUX_MIB_SYNCOOKIESRECV is incremented when crypto validation succeeds.
TcpExtSyncookiesFailed / LINUX_MIB_SYNCOOKIESFAILED is incremented when crypto fails.

A sysctl net.ipv4.tcp_syncookies can disable SYN Cookies or force-enable them. Default is good, don't change it.

SYN Cookies and TCP Timestamps

The SYN Cookies magic works, but isn't without disadvantages. The main problem is that there is very little data that can be saved in a SYN Cookie. Specifically, only 32 bits of the sequence number are returned in the ACK. These bits are used as follows:

+----------+--------+-------------------+
|  6 bits  | 2 bits |     24 bits       |
| t mod 32 |  MSS   | hash(ip, port, t) |
+----------+--------+-------------------+

With the MSS setting truncated to only 4 distinct values, Linux doesn't know any optional TCP parameters of the other party. Information about Timestamps, ECN, Selective ACK, or Window Scaling is lost, and can lead to degraded TCP session performance.

Fortunately Linux has a work around. If TCP Timestamps are enabled, the kernel can reuse another slot of 32 bits in the Timestamp field. It contains:

+-----------+-------+-------+--------+
|  26 bits  | 1 bit | 1 bit | 4 bits |
| Timestamp |  ECN  | SACK  | WScale |
+-----------+-------+-------+--------+

TCP Timestamps should be enabled by default, to verify see the sysctl:

$ sysctl net.ipv4.tcp_timestamps
net.ipv4.tcp_timestamps = 1

Historically there was plenty of discussion about the usefulness of TCP Timestamps.

In the past timestamps leaked server uptime (whether that matters is another discussion). This was fixed 8 months ago.
TCP Timestamps use non-trivial amount of of bandwidth - 12 bytes on each packet.
They can add additional randomness to packet checksums which can help with certain broken hardware.
As mentioned above TCP Timestamps can boost the performance of TCP connections if SYN Cookies are engaged.

Currently at Cloudflare, we have TCP Timestamps disabled.

Finally, with SYN Cookies engaged some cool features won't work - things like TCP_SAVED_SYN, TCP_DEFER_ACCEPT or TCP_FAST_OPEN.

SYN Floods at Cloudflare scale

SYN Cookies are a great invention and solve the problem of smaller SYN Floods. At Cloudflare though, we try to avoid them if possible. While sending out a couple of thousand of cryptographically verifiable SYN+ACK packets per second is okay, we see attacks of more than 200 Million packets per second. At this scale, our SYN+ACK responses would just litter the internet, bringing absolutely no benefit.

Instead, we attempt to drop the malicious SYN packets on the firewall layer. We use the p0f SYN fingerprints compiled to BPF. Read more in this blog post Introducing the p0f BPF compiler. To detect and deploy the mitigations we developed an automation system we call "Gatebot". We described that here Meet Gatebot - the bot that allows us to sleep

Evolving landscape

For more - slightly outdated - data on the subject read on an excellent explanation by Andreas Veithen from 2015 and a comprehensive paper by Gerald W. Gordon from 2013.

The Linux SYN packet handling landscape is constantly evolving. Until recently SYN Cookies were slow, due to an old fashioned lock in the kernel. This was fixed in 4.4 and now you can rely on the kernel to be able to send millions of SYN Cookies per second, practically solving the SYN Flood problem for most users. With proper tuning it's possible to mitigate even the most annoying SYN Floods without affecting the performance of legitimate connections.

Application performance is also getting significant attention. Recent ideas like SO_ATTACH_REUSEPORT_EBPF introduce a whole new layer of programmability into the network stack.

It's great to see innovations and fresh thinking funneled into the networking stack, in the otherwise stagnant world of operating systems.

Thanks to Binh Le for helping with this post.

Dealing with the internals of Linux and NGINX sound interesting? Join our world famous team in London, Austin, San Francisco and our elite office in Warsaw, Poland.

I'm simplifying, technically speaking the SYN Queue stores not yet ESTABLISHED connections, not SYN packets themselves. With TCP_SAVE_SYN it gets close enough though. ↩︎
If TCP_DEFER_ACCEPT is new to you, definitely check FreeBSD's version of it - accept filters. ↩︎

How the CloudFlare Team Got Into Bondage (It's Not What You Think)

Matthew Prince — Mon, 08 Apr 2013 07:18:00 GMT

(Imagecourtesy of ferelswirl)At CloudFlare, we're always looking for ways to eliminate bottlenecks. We're only able to deal with the very large amount of traffic that we handle, especially during large denial of service attacks, because we've built a network that can efficiently handle an extremely high volume of network requests. This post is about the nitty gritty of port bonding, one of the technologies we use, and how it allows us to get the maximum possible network throughput out of our servers.

Generation Three

A rack of equipment in CloudFlare's network has three core components: routers, switches, and servers. We own and install all our own equipment because it's impossible to have the flexibility and efficiency you need to do what we do running on someone else's gear. Over time, we've adjusted the specs of the gear we use based on the needs of our network and what we are able to cost effectively source from vendors.

Most of the equipment in our network today is based on our Generation 3 (G3) spec, which we deployed throughout 2012. Focusing just on the network connectivity for our G3 gear, our routers have multiple 10Gbps ports which connect out to the Internet as well as in to our switches. Our switches have a handful of 10Gbps ports that we use to connect to our routers and then 48 1Gbps ports that connect to the servers. Finally, our servers have 6 1Gbps ports, two on the motherboard (using Intel's chipset) and four on an Intel PCI network card. (There's an additional IPMI management port as well, but it doesn't figure into this discussion.)

To get high levels of utilization and keep our server spec consistent and flexible, each of the servers in our network can perform any of the key CloudFlare functions: DNS, front-line, caching, and logging. Cache, for example, is spread across multiple machines in a facility. This means if we add another drive to one of the servers in a data center, then the total available storage space for the cache increases for all the servers in that data center. What's good about this is that, as we need to, we can add more servers and linearly scale capacity across storage, CPU, and, in some applications, RAM. The challenge is that in order to pull this off there needs to be a significant amount of communication between servers across our local area network (LAN).

When we originally started deploying our G3 servers in early 2012, we treated each 1Gbps port on the switches and routers discretely. While each server could, in theory, handle 6Gbps of traffic, each port could only handle 1Gbps. Usually this was no big deal because we load balanced customers across multiple servers in multiple data centers so on no individual server port was a customer likely to burst over 1Gbps. However, we found that, from time to time, when a customer would come under attack, traffic to individual machines could exceed 1Gbps and overwhelm a port.

When A Problem Comes Along...

The goal of a denial of service attack is to find a bottleneck and then send enough garbage requests to fill it up and prevent legitimate requests from getting through. At the same time, our goal when mitigating such an attack is first to ensure the attack doesn't harm other customers and then to stop the attack from hurting the actual target.

For the most part, the biggest attacks by volume we see are Layer 3 attacks. In these, packets are stopped at the edge of our network and never reach our server infrastructure. As the very large attack against Spamhausshowed, we have a significant amount of network capacity at our edge and are therefore able to stop these Layer 3 attacks very effectively.

While the big Layer 3 attacks get the most attention, an attack doesn't need to be so large if it can affect another, narrower bottleneck. For example, switches and routers are largely blind to Layer 7 attacks, meaning our servers need to process the requests. That means the requests associated with the attack need to pass across the smaller, 1Gbps port on the server. From time to time, we've found that these attacks reached a large enough scale to overwhelm a 1Gbps port on one of our servers, making it a potential bottleneck.

Beyond raw bandwidth, the other bottleneck with some attacks centers on network interrupts. In most operating systems, every time a packet is received by a server, the network card generates an interrupt (known as an IRQ). An IRQ is effectively an instruction to the CPU to stop whatever else it's doing and deal with an event, in this case a packet arriving over the network. Each network adapter has multiple queues per port that receive these IRQs and then hands them to the server's CPU. The clock speed and driver efficiency in the network adapters, and message passing rate of the bus, effectively sets the maximum number of interrupts per second, and therefore packets per second, a server's network interface can handle.

In certain attacks, like large SYN floods which send a very high volume of very small packets, there can be plenty of bandwidth on a port but a CPU can be bottlenecked on IRQ handling. When this happens it can shut down a particular core on a CPU or, in the worst case if IRQs aren't properly balanced, shut down the whole CPU. To better deal with these attacks, we needed to find a way to more intelligently spread IRQs across more interfaces and, in turn, more CPU cores.

Both these problems are annoying if it affects the customer under attack, but it is unacceptable it spills over and affects customers who are not under attack. To ensure that would never happen, we needed to find a way to both increase network capacity and ensure that customer attacks were isolated from one another. To accomplish this we launched what we affectionately refer to in the office as "Project Bondage."

Getting Into Bondage

To deal with these challenges we started by implementing what is known as port bonding. The idea of port bonding is simple: use the resources of multiple ports in aggregate in order to support more traffic than any one port can on its own. We use a custom operating system based on the Debian line of Linux. Like most Linux varieties, our OS supports seven different port bonding modes:

[0] Round-robin: Packets are transmitted sequentially through list of connections
[1] Active-backup: Only one connection is active, when it fails another is activated
[2] Balance-xor: This will ensure packets to a given destination from a given source will be the same over multiple connections
[3] Broadcast: Transmits everything over every active connection
[4] 802.3ad Dynamic Link Aggregation: Creates aggregation groups that share the same speed and duplex settings. Switches upstream must support 802.3ad.
[5] Balance-tlb: Adaptive transmit load balancing — outgoing traffic is balanced based on total amount being transmitted
[6] Balance-alb: Adaptive load balancing — includes balance-tlb and balances incoming traffic by using ARP negotiation to dynamically change the source MAC addresses of outgoing packets

We use mode 4, 802.3ad Dynamic Link Aggregation. This requires switches that support 802.3ad (our workhorse switch is a Juniper 4200, which does). Our switches are configured to send packets from each stream to the same network interface. If you want to experiment with port bonding yourself, the next section covers the technical details of exactly how we set it up.

The Nitty Gritty

Port bonding is configured on each server. It requires two Linux components that you can apt-get (assuming you're using a Debian-based Linux) if they're not already installed: ifenslave and ethtool. To initialize the bonding driver we use the following command:

modprobe bonding mode=4 miimon=100 xmit_hash_policy=1 lacp_rate=1

Here's how that command breaks down:

mode=4: 802.3ad Dynamic Link Aggregation mode described above
miimon=100: indicates that the devices are polled every 100ms to check for * connection changes, such as a link being down or a link duplex having changed.
xmit_hash_policy=1: instructs the driver to spread the load over interfaces based on the source and destination IP address instead of MAC address
lacp_rate=1: sets the rate for transmitting LACPDU packets, 0 is once every 30 seconds, 1 is every 1 second, which allows our network devices to automatically configure a single logical connection at the switch quickly

After the bonding driver is initialized, we bring down the servers' network interfaces:

ifconfig eth0 downifconfig eth1 down

We then bring up the bonding interface:

ifconfig bond0 192.168.0.2/24 up

We then enslave (seriously, that's the term) the interfaces in the bond:

ifenslave bond0 eth0ifenslave bond0 eth1

Finally, we check the status of the bonded interface:

cat /proc/net/bonding/bond0

From an application perspective, bonded ports appear as a single logical network interface with a higher maximum throughput. Since our switch recognizes and supports 802.3ad Dynamic Link Aggregation, we don't have to make any changes to its configuration in order for port bonding to work. In our case, we aggregate three ports (3Gbps) for handling external traffic and the remaining three ports (3Gbps) for handling intra-server traffic across our LAN.

Working Out the Kinks

Expanding the maximum effective capacity of each logical interface is half the battle. The other half is ensuring that network interrupts (IRQs) don't become a bottleneck. By default most Linux distributions rely on a service called irqbalance to set the CPU affinity of each IRQ queue. Unfortunately, we found that irqbalance does not effectively isolate each queue from overwhelming another on the same CPU. The problem with this is, because of the traffic we need to send from machine to machine, external attack traffic risked disrupting internal LAN traffic and affecting site performance beyond the customer under attack.

To solve this, the first thing we did was disable irqbalance:

/etc/init.d/irqbalance stopupdate-rc.d irqbalance remove

Instead, we explicitly setup IRQ handling to isolate our external and internal (LAN) networks. Each of our servers has two physical CPUs (G3 hardware uses a low-watt version of Intel Westmere line of CPUs) with six physical cores each. We use Intel's hyperthreading technology which effectively doubles the number of logical CPU cores: 12 per CPU or 24 per server.

Each port on our NICs has a number of queues to handle incoming requests. These are known as RSS (Receive Side Scaling) queues. Each port has 8 RSS queues, we have 6 1Gbps NIC ports per server, so a total of 48 RSS queues. These 48 RSS queues are allocated to the 24 cores, with 2 RSS queues per core. We divide the RSS queues between internal (LAN) traffic and external traffic and bind each type of traffic to one of the two server CPUs. This ensures that even large SYN floods that may affect a machine's ability to handle more external requests won't keep it from handling requests from other servers in the data center.

The Results

The net effect of these changes allows us to handle 30% larger SYN floods per server and increases our maximum throughput per site per server by 300%. Equally importantly, by custom tuning our IRQ handling, it has allowed us to ensure that customers under attack are isolated from those who are not while still delivering the maximum performance by fully utilizing all the gear in our network.

Near the end of 2012, our ops and networking teams sat down to spec our next generation of gear, incorporating everything we've learned over the previous year. One of the biggest changes we're making with G4 is the jump from 1Gbps network interfaces up to 10Gbps network interfaces on our switches and servers. Even without bonding, our tests of the new G4 gear show that it significantly increases both maximum throughput and IRQ handling. Or, put more succinctly: this next generation of servers is smokin' fast.

The first installations of the G4 gear is now in testing in a handful of our facilities. After testing, we plan to roll out worldwide over the coming months. We're already planning a detailed tour of the gear we chose, an explanation of the decisions we made, and performance benchmarks to show you how this next generation of gear is going to make CloudFlare's network even faster, safer, and smarter. That's a blog post I'm looking forward to writing. Stay tuned!