
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 14 Apr 2026 18:47:17 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Migrating billions of records: moving our active DNS database while it’s in use]]></title>
            <link>https://blog.cloudflare.com/migrating-billions-of-records-moving-our-active-dns-database-while-in-use/</link>
            <pubDate>Tue, 29 Oct 2024 14:00:00 GMT</pubDate>
            <description><![CDATA[ DNS records have moved to a new database, bringing improved performance and reliability to all customers. ]]></description>
            <content:encoded><![CDATA[ <p>According to a survey done by <a href="https://w3techs.com/technologies/overview/dns_server"><u>W3Techs</u></a>, as of October 2024, Cloudflare is used as an <a href="https://www.cloudflare.com/en-gb/learning/dns/dns-server-types/"><u>authoritative DNS</u></a> provider by 14.5% of all websites. As an authoritative DNS provider, we are responsible for managing and serving all the DNS records for our clients’ domains. This means we have an enormous responsibility to provide the best service possible, starting at the data plane. As such, we are constantly investing in our infrastructure to ensure the reliability and performance of our systems.</p><p><a href="https://www.cloudflare.com/learning/dns/what-is-dns/"><u>DNS</u></a> is often referred to as the phone book of the Internet, and is a key component of the Internet. If you have ever used a phone book, you know that they can become extremely large depending on the size of the physical area it covers. A <a href="https://www.cloudflare.com/en-gb/learning/dns/glossary/dns-zone/#:~:text=What%20is%20a%20DNS%20zone%20file%3F"><u>zone file</u></a> in DNS is no different from a phone book. It has a list of records that provide details about a domain, usually including critical information like what IP address(es) each hostname is associated with. For example:</p>
            <pre><code>example.com      59 IN A 198.51.100.0
blog.example.com 59 IN A 198.51.100.1
ask.example.com  59 IN A 198.51.100.2</code></pre>
            <p>It is not unusual for these zone files to reach millions of records in size, just for a single domain. The biggest single zone on Cloudflare holds roughly 4 million DNS records, but the vast majority of zones hold fewer than 100 DNS records. Given our scale according to W3Techs, you can imagine how much DNS data alone Cloudflare is responsible for. Given this volume of data, and all the complexities that come at that scale, there needs to be a very good reason to move it from one database cluster to another. </p>
    <div>
      <h2>Why migrate </h2>
      <a href="#why-migrate">
        
      </a>
    </div>
    <p>When initially measured in 2022, DNS data took up approximately 40% of the storage capacity in Cloudflare’s main database cluster (<b>cfdb</b>). This database cluster, consisting of a primary system and multiple replicas, is responsible for storing DNS zones, propagated to our <a href="https://www.cloudflare.com/network/"><u>data centers in over 330 cities</u></a> via our distributed KV store <a href="https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale/"><u>Quicksilver</u></a>. <b>cfdb</b> is accessed by most of Cloudflare's APIs, including the <a href="https://developers.cloudflare.com/dns/manage-dns-records/how-to/create-dns-records/"><u>DNS Records API</u></a>. Today, the DNS Records API is the API most used by our customers, with each request resulting in a query to the database. As such, it’s always been important to optimize the DNS Records API and its surrounding infrastructure to ensure we can successfully serve every request that comes in.</p><p>As Cloudflare scaled, <b>cfdb</b> was becoming increasingly strained under the pressures of several services, many unrelated to DNS. During spikes of requests to our DNS systems, other Cloudflare services experienced degradation in the database performance. It was understood that in order to properly scale, we needed to optimize our database access and improve the systems that interact with it. However, it was evident that system level improvements could only be just so useful, and the growing pains were becoming unbearable. In late 2022, the DNS team decided, along with the help of 25 other teams, to detach itself from <b>cfdb</b> and move our DNS records data to another database cluster.</p>
    <div>
      <h2>Pre-migration</h2>
      <a href="#pre-migration">
        
      </a>
    </div>
    <p>From a DNS perspective, this migration to an improved database cluster was in the works for several years. Cloudflare initially relied on a single <a href="https://www.postgresql.org/"><u>Postgres</u></a> database cluster, <b>cfdb</b>. At Cloudflare's inception, <b>cfdb</b> was responsible for storing information about zones and accounts and the majority of services on the Cloudflare control plane depended on it. Since around 2017, as Cloudflare grew, many services moved their data out of <b>cfdb</b> to be served by a <a href="https://en.wikipedia.org/wiki/Microservices"><u>microservice</u></a>. Unfortunately, the difficulty of these migrations are directly proportional to the amount of services that depend on the data being migrated, and in this case, most services require knowledge of both zones and DNS records.</p><p>Although the term “zone” was born from the DNS point of view, it has since evolved into something more. Today, zones on Cloudflare store many different types of non-DNS related settings and help link several non-DNS related products to customers' websites. Therefore, it didn’t make sense to move both zone data and DNS record data together. This separation of two historically tightly coupled DNS concepts proved to be an incredibly challenging problem, involving many engineers and systems. In addition, it was clear that if we were going to dedicate the resources to solving this problem, we should also remove some of the legacy issues that came along with the original solution. </p><p>One of the main issues with the legacy database was that the DNS team had little control over which systems accessed exactly what data and at what rate. Moving to a new database gave us the opportunity to create a more tightly controlled interface to the DNS data. This was manifested as an internal DNS Records <a href="https://blog.cloudflare.com/moving-k8s-communication-to-grpc/"><u>gRPC API</u></a> which allows us to make sweeping changes to our data while only requiring a single change to the API, rather than coordinating with other systems.  For example, the DNS team can alter access logic and auditing procedures under the hood. In addition, it allows us to appropriately rate-limit and cache data depending on our needs. The move to this new API itself was no small feat, and with the help of several teams, we managed to migrate over 20 services, using 5 different programming languages, from direct database access to using our managed gRPC API. Many of these services touch very important areas such as <a href="https://developers.cloudflare.com/dns/dnssec/"><u>DNSSEC</u></a>, <a href="https://developers.cloudflare.com/ssl/"><u>TLS</u></a>, <a href="https://developers.cloudflare.com/email-routing/"><u>Email</u></a>, <a href="https://developers.cloudflare.com/cloudflare-one/connections/connect-networks/"><u>Tunnels</u></a>, <a href="https://developers.cloudflare.com/workers/"><u>Workers</u></a>, <a href="https://developers.cloudflare.com/spectrum/"><u>Spectrum</u></a>, and <a href="https://developers.cloudflare.com/r2/"><u>R2 storage</u></a>. Therefore, it was important to get it right. </p><p>One of the last issues to tackle was the logical decoupling of common DNS database functions from zone data. Many of these functions expect to be able to access both DNS record data and DNS zone data at the same time. For example, at record creation time, our API needs to check that the zone is not over its maximum record allowance. Originally this check occurred at the SQL level by verifying that the record count was lower than the record limit for the zone. However, once you remove access to the zone itself, you are no longer able to confirm this. Our DNS Records API also made use of SQL functions to audit record changes, which requires access to both DNS record and zone data. Luckily, over the past several years, we have migrated this functionality out of our monolithic API and into separate microservices. This allowed us to move the auditing and zone setting logic to the application level rather than the database level. Ultimately, we are still taking advantage of SQL functions in the new database cluster, but they are fully independent of any other legacy systems, and are able to take advantage of the latest Postgres version.</p><p>Now that Cloudflare DNS was mostly decoupled from the zones database, it was time to proceed with the data migration. For this, we built what would become our <b>Change Data Capture and Transfer Service (CDCTS).</b></p>
    <div>
      <h2>Requirements for the Change Data Capture and Transfer Service</h2>
      <a href="#requirements-for-the-change-data-capture-and-transfer-service">
        
      </a>
    </div>
    <p>The Database team is responsible for all Postgres clusters within Cloudflare, and were tasked with executing the data migration of two tables that store DNS data: <i>cf_rec</i> and <i>cf_archived_rec</i>, from the original <b>cfdb </b>cluster to a new cluster we called <b>dnsdb</b>.  We had several key requirements that drove our design:</p><ul><li><p><b>Don’t lose data. </b>This is the number one priority when handling any sort of data. Losing data means losing trust, and it is incredibly difficult to regain that trust once it’s lost.  Important in this is the ability to prove no data had been lost.  The migration process would, ideally, be easily auditable.</p></li><li><p><b>Minimize downtime</b>.  We wanted a solution with less than a minute of downtime during the migration, and ideally with just a few seconds of delay.</p></li></ul><p>These two requirements meant that we had to be able to migrate data changes in near real-time, meaning we either needed to implement logical replication, or some custom method to capture changes, migrate them, and apply them in a table in a separate Postgres cluster.</p><p>We first looked at using Postgres logical replication using <a href="https://github.com/2ndQuadrant/pglogical"><u>pgLogical</u></a>, but had concerns about its performance and our ability to audit its correctness.  Then some additional requirements emerged that made a pgLogical implementation of logical replication impossible:</p><ul><li><p><b>The ability to move data must be bidirectional.</b> We had to have the ability to switch back to <b>cfdb</b> without significant downtime in case of unforeseen problems with the new implementation. </p></li><li><p><b>Partition the </b><b><i>cf_rec</i></b><b> table in the new database.</b> This was a long-desired improvement and since most access to <i>cf_rec</i> is by zone_id, it was decided that <b>mod(zone_id, num_partitions)</b> would be the partition key.</p></li><li><p><b>Transferred data accessible from original database.  </b>In case we had functionality that still needed access to data, a foreign table pointing to <b>dnsdb</b> would be available in <b>cfdb</b>. This could be used as emergency access to avoid needing to roll back the entire migration for a single missed process.</p></li><li><p><b>Only allow writes in one database. </b> Applications should know where the primary database is, and should be blocked from writing to both databases at the same time.</p></li></ul>
    <div>
      <h2>Details about the tables being migrated</h2>
      <a href="#details-about-the-tables-being-migrated">
        
      </a>
    </div>
    <p>The primary table, <i>cf_rec</i>, stores DNS record information, and its rows are regularly inserted, updated, and deleted. At the time of the migration, this table had 1.7 billion records, and with several indexes took up 1.5 TB of disk. Typical daily usage would observe 3-5 million inserts, 1 million updates, and 3-5 million deletes.</p><p>The second table, <i>cf_archived_rec</i>, stores copies of <i>cf_rec</i> that are obsolete — this table generally only has records inserted and is never updated or deleted.  As such, it would see roughly 3-5 million inserts per day, corresponding to the records deleted from <i>cf_rec</i>. At the time of the migration, this table had roughly 4.3 billion records.</p><p>Fortunately, neither table made use of database triggers or foreign keys, which meant that we could insert/update/delete records in this table without triggering changes or worrying about dependencies on other tables.</p><p>Ultimately, both of these tables are highly active and are the source of truth for many highly critical systems at Cloudflare.</p>
    <div>
      <h2>Designing the Change Data Capture and Transfer Service</h2>
      <a href="#designing-the-change-data-capture-and-transfer-service">
        
      </a>
    </div>
    <p>There were two main parts to this database migration:</p><ol><li><p><b>Initial copy:</b> Take all the data from <b>cfdb </b>and put it in <b>dnsdb.</b></p></li><li><p><b>Change copy:</b> Take all the changes in <b>cfdb </b>since the initial copy and update <b>dnsdb</b> to reflect them. This is the more involved part of the process.</p></li></ol><p>Normally, logical replication replays every insert, update, and delete on a copy of the data in the same transaction order, making a single-threaded pipeline.  We considered using a queue-based system but again, speed and auditability were both concerns as any queue would typically replay one change at a time.  We wanted to be able to apply large sets of changes, so that after an initial dump and restore, we could quickly catch up with the changed data. For the rest of the blog, we will only speak about <i>cf_rec</i> for simplicity, but the process for <i>cf_archived_rec</i> is the same.</p><p>What we decided on was a simple change capture table. Rows from this capture table would be loaded in real-time by a database trigger, with a transfer service that could migrate and apply thousands of changed records to <b>dnsdb</b> in each batch. Lastly, we added some auditing logic on top to ensure that we could easily verify that all data was safely transferred without downtime.</p>
    <div>
      <h3>Basic model of change data capture </h3>
      <a href="#basic-model-of-change-data-capture">
        
      </a>
    </div>
    <p>For <i>cf_rec</i> to be migrated, we would create a change logging table, along with a trigger function and a  table trigger to capture the new state of the record after any insert/update/delete.  </p><p>The change logging table named <i>log_cf_rec</i> had the same columns as <i>cf_rec</i>, as well as four new columns:</p><ul><li><p><b>change_id</b>:  a sequence generated unique identifier of the record</p></li><li><p><b>action</b>: a single character indicating whether this record represents an [i]nsert, [u]pdate, or [d]elete</p></li><li><p><b>change_timestamp</b>: the date/time when the change record was created</p></li><li><p><b>change_user:</b> the database user that made the change.  </p></li></ul><p>A trigger was placed on the <i>cf_rec</i> table so that each insert/update would copy the new values of the record into the change table, and for deletes, create a 'D' record with the primary key value. </p><p>Here is an example of the change logging where we delete, re-insert, update, and finally select from the <i>log_cf_rec</i><b> </b>table. Note that the actual <i>cf_rec</i> and <i>log_cf_rec</i> tables have many more columns, but have been edited for simplicity.</p>
            <pre><code>dns_records=# DELETE FROM  cf_rec WHERE rec_id = 13;

dns_records=# SELECT * from log_cf_rec;
Change_id | action | rec_id | zone_id | name
----------------------------------------------
1         | D      | 13     |         |   

dns_records=# INSERT INTO cf_rec VALUES(13,299,'cloudflare.example.com');  

dns_records=# UPDATE cf_rec SET name = 'test.example.com' WHERE rec_id = 13;

dns_records=# SELECT * from log_cf_rec;
Change_id | action | rec_id | zone_id | name
----------------------------------------------
1         | D      | 13     |         |  
2         | I      | 13     | 299     | cloudflare.example.com
3         | U      | 13     | 299     | test.example.com </code></pre>
            <p>In addition to <i>log_cf_rec</i>, we also introduced 2 more tables in <b>cfdb </b>and 3 more tables in <b>dnsdb:</b></p><p><b>cfdb</b></p><ol><li><p><i>transferred_log_cf_rec</i>: Responsible for auditing the batches transferred to <b>dnsdb</b>.</p></li><li><p><i>log_change_action</i>:<i> </i>Responsible for summarizing the transfer size in order to compare with the <i>log_change_action </i>in <b>dnsdb.</b></p></li></ol><p><b>dnsdb</b></p><ol><li><p><i>migrate_log_cf_rec</i>:<i> </i>Responsible for collecting batch changes in <b>dnsdb</b>, which would later be applied to <i>cf_rec </i>in <b>dnsdb</b><i>.</i></p></li><li><p><i>applied_migrate_log_cf_rec</i>:<i> </i>Responsible for auditing the batches that had been successfully applied to cf_rec in <b>dnsdb.</b></p></li><li><p><i>log_change_action</i>:<i> </i>Responsible for summarizing the transfer size in order to compare with the <i>log_change_action </i>in <b>cfdb.</b></p></li></ol>
    <div>
      <h3>Initial copy</h3>
      <a href="#initial-copy">
        
      </a>
    </div>
    <p>With change logging in place, we were now ready to do the initial copy of the tables from <b>cfdb</b> to <b>dnsdb</b>. Because we were changing the structure of the tables in the destination database and because of network timeouts, we wanted to bring the data over in small pieces and validate that it was brought over accurately, rather than doing a single multi-hour copy or <a href="https://www.postgresql.org/docs/current/app-pgdump.html"><u>pg_dump</u></a>.  We also wanted to ensure a long-running read could not impact production and that the process could be paused and resumed at any time.  The basic model to transfer data was done with a simple psql copy statement piped into another psql copy statement.  No intermediate files were used.</p><p><code>psql_cfdb -c "COPY (SELECT * FROM cf_rec WHERE id BETWEEN n and n+1000000 TO STDOUT)" | </code></p><p><code>psql_dnsdb -c "COPY cf_rec FROM STDIN"</code></p><p>Prior to a batch being moved, the count of records to be moved was recorded in <b>cfdb</b>, and after each batch was moved, a count was recorded in <b>dnsdb</b> and compared to the count in <b>cfdb</b> to ensure that a network interruption or other unforeseen error did not cause data to be lost. The bash script to copy data looked like this, where we included files that could be touched to pause or end the copy (if they cause load on production or there was an incident).  Once again, this code below has been heavily simplified.</p>
            <pre><code>#!/bin/bash
for i in "$@"; do
   # Allow user to control whether this is paused or not via pause_copy file
   while [ -f pause_copy ]; do
      sleep 1
   done
   # Allow user to end migration by creating end_copy file
   if [ ! -f end_copy ]; then
      # Copy a batch of records from cfdb to dnsdb
      # Get count of records from cfdb 
	# Get count of records from dnsdb
 	# Compare cfdb count with dnsdb count and alert if different 
   fi
done
</code></pre>
            <p><sup><i>Bash copy script</i></sup></p>
    <div>
      <h3>Change copy</h3>
      <a href="#change-copy">
        
      </a>
    </div>
    <p>Once the initial copy was completed, we needed to update <b>dnsdb</b> with any changes that had occurred in <b>cfdb</b> since the start of the initial copy. To implement this change copy, we created a function <i>fn_log_change_transfer_log_cf_rec </i>that could be passed a <i>batch_id</i> and <i>batch_size</i>, and did 5 things, all of which were executed in a single database <a href="https://www.postgresql.org/docs/current/tutorial-transactions.html"><u>transaction</u></a>:</p><ol><li><p>Select a <i>batch_size</i> of records from <i>log_cf_rec</i> in <b>cfdb</b>.</p></li><li><p>Copy the batch to <i>transferred_log_cf_rec</i> in <b>cfdb </b>to mark it as transferred.</p></li><li><p>Delete the batch from <i>log_cf_rec</i>.</p></li><li><p>Write a summary of the action to <i>log_change_action</i> table. This will later be used to compare transferred records with <b>cfdb</b>.</p></li><li><p>Return the batch of records.</p></li></ol><p>We then took the returned batch of records and copied them to <i>migrate_log_cf_rec </i>in <b>dnsdb</b>. We used the same bash script as above, except this time, the copy command looked like this:</p><p><code>psql_cfdb -c "COPY (SELECT * FROM </code><code><i>fn_log_change_transfer_log_cf_rec(&lt;batch_id&gt;,&lt;batch_size&gt;</i></code><code>) TO STDOUT" | </code></p><p><code>psql_dnsdb -c "COPY migrate_log_cf_rec FROM STDIN"</code></p>
    <div>
      <h3>Applying changes in the destination database</h3>
      <a href="#applying-changes-in-the-destination-database">
        
      </a>
    </div>
    <p>Now, with a batch of data in the <i>migrate_log_cf_rec </i>table, we called a newly created function <i>log_change_apply</i> to apply and audit the changes. Once again, this was all executed within a single database transaction. The function did the following:</p><ol><li><p>Move a batch from the <i>migrate_log_cf_rec</i> table to a new temporary table.</p></li><li><p>Write the counts for the batch_id to the <i>log_change_action</i> table.</p></li><li><p>Delete from the temporary table all but the latest record for a unique id (last action). For example, an insert followed by 30 updates would have a single record left, the final update. There is no need to apply all the intermediate updates.</p></li><li><p>Delete any record from <i>cf_rec</i> that has any corresponding changes.</p></li><li><p>Insert any [i]nsert or [u]pdate records in <i>cf_rec</i>.</p></li><li><p>Copy the batch to <i>applied_migrate_log_cf_rec</i> for a full audit trail.</p></li></ol>
    <div>
      <h3>Putting it all together</h3>
      <a href="#putting-it-all-together">
        
      </a>
    </div>
    <p>There were 4 distinct phases, each of which was part of a different database transaction:</p><ol><li><p>Call <i>fn_log_change_transfer_log_cf_rec </i>in <b>cfdb </b>to get a batch of records.</p></li><li><p>Copy the batch of records to <b>dnsdb.</b></p></li><li><p>Call <i>log_change_apply </i>in <b>dnsdb </b>to apply the batch of records.</p></li><li><p>Compare the <i>log_change_action</i> table in each respective database to ensure counts match.</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2REIq71tc7M4jKPLZSJzS9/11f22f700300f2ad3a5ee5ca85a75480/Applying_changes_in_the_destination_database.png" />
          </figure><p>This process was run every 3 seconds for several weeks before the migration to ensure that we could keep <b>dnsdb</b> in sync with <b>cfdb</b>.</p>
    <div>
      <h2>Managing which database is live</h2>
      <a href="#managing-which-database-is-live">
        
      </a>
    </div>
    <p>The last major pre-migration task was the construction of the request locking system that would be used throughout the actual migration. The aim was to create a system that would allow the database to communicate with the DNS Records API, to allow the DNS Records API to handle HTTP connections more gracefully. If done correctly, this could reduce downtime for DNS Record API users to nearly zero.</p><p>In order to facilitate this, a new table called <i>cf_migration_manager</i> was created. The table would be periodically polled by the DNS Records API, communicating two critical pieces of information:</p><ol><li><p><b>Which database was active.</b> Here we just used a simple A or B naming convention.</p></li><li><p><b>If the database was locked for writing</b>. In the event the database was locked for writing, the DNS Records API would hold HTTP requests until the lock was released by the database.</p></li></ol><p>Both pieces of information would be controlled within a migration manager script.</p><p>The benefit of migrating the 20+ internal services from direct database access to using our internal DNS Records gRPC API is that we were able to control access to the database to ensure that no one else would be writing without going through the <i>cf_migration_manager</i>.</p>
    <div>
      <h2>During the migration </h2>
      <a href="#during-the-migration">
        
      </a>
    </div>
    <p>Although we aimed to complete this migration in a matter of seconds, we announced a DNS maintenance window that could last a couple of hours just to be safe. Now that everything was set up, and both <b>cfdb</b> and <b>dnsdb</b> were roughly in sync, it was time to proceed with the migration. The steps were as follows:</p><ol><li><p>Lower the time between copies from 3s to 0.5s.</p></li><li><p>Lock <b>cfdb</b> for writes via <i>cf_migration_manager</i>. This would tell the DNS Records API to hold write connections.</p></li><li><p>Make <b>cfdb</b> read-only and migrate the last logged changes to <b>dnsdb</b>. </p></li><li><p>Enable writes to <b>dnsdb</b>. </p></li><li><p>Tell DNS Records API that <b>dnsdb</b> is the new primary database and that write connections can proceed via the <i>cf_migration_manager</i>.</p></li></ol><p>Since we needed to ensure that the last changes were copied to <b>dnsdb</b> before enabling writing, this entire process took no more than 2 seconds. During the migration we saw a spike of API latency as a result of the migration manager locking writes, and then dealing with a backlog of queries. However, we recovered back to normal latencies after several minutes. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6agUpD8BQVxgDupBrwtTw3/38c96f91879c6539011866821ad6f11a/image3.png" />
          </figure><p><sup><i>DNS Records API Latency and Requests during migration</i></sup></p><p>Unfortunately, due to the far-reaching impact that DNS has at Cloudflare, this was not the end of the migration. There were 3 lesser-used services that had slipped by in our scan of services accessing DNS records via <b>cfdb</b>. Fortunately, the setup of the foreign table meant that we could very quickly fix any residual issues by simply changing the table name. </p>
    <div>
      <h2>Post-migration</h2>
      <a href="#post-migration">
        
      </a>
    </div>
    <p>Almost immediately, as expected, we saw a steep drop in usage across <b>cfdb</b>. This freed up a lot of resources for other services to take advantage of.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/Xfnbc9MZLwJB91ypItWsi/1eb21362893b31a1e3c846d1076a9f5b/image6.jpg" />
          </figure><p><sup><i><b>cfdb</b></i></sup><sup><i> usage dropped significantly after the migration period.</i></sup></p><p>Since the migration, the average <b>requests</b> per second to the DNS Records API has more than <b>doubled</b>. At the same time, our CPU usage across both <b>cfdb</b> and <b>dnsdb</b> has settled at below 10% as seen below, giving us room for spikes and future growth. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/39su35dkb5Pl8uwYfYjHLg/0eb26ced30b44efb71abb73830e01f3a/image2.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5AdlLKXtD68QWCsMVLKnkt/9137beee9c941827eb57c53825ffe209/image4.png" />
          </figure><p><sup><i><b>cfdb</b></i></sup><sup><i> and </i></sup><sup><i><b>dnsdb</b></i></sup><sup><i> CPU usage now</i></sup></p><p>As a result of this improved capacity, our database-related incident rate dropped dramatically.</p><p>As for query latencies, our latency post-migration is slightly lower on average, with fewer sustained spikes above 500ms. However, the performance improvement is largely noticed during high load periods, when our database handles spikes without significant issues. Many of these spikes come as a result of clients making calls to collect a large amount of DNS records or making several changes to their zone in short bursts. Both of these actions are common use cases for large customers onboarding zones.</p><p>In addition to these improvements, the DNS team also has more granular control over <b>dnsdb</b> cluster-specific settings that can be tweaked for our needs rather than catering to all the other services. For example, we were able to make custom changes to replication lag limits to ensure that services using replicas were able to read with some amount of certainty that the data would exist in a consistent form. Measures like this reduce overall load on the primary because almost all read queries can now go to the replicas.</p><p>Although this migration was a resounding success, we are always working to improve our systems. As we grow, so do our customers, which means the need to scale never really ends. We have more exciting improvements on the roadmap, and we are looking forward to sharing more details in the future.</p><p>The DNS team at Cloudflare isn’t the only team solving challenging problems like the one above. If this sounds interesting to you, we have many more tech deep dives on our blog, and we are always looking for curious engineers to join our team — see open opportunities <a href="https://www.cloudflare.com/en-gb/careers/jobs/"><u>here</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[API]]></category>
            <category><![CDATA[Database]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Postgres]]></category>
            <category><![CDATA[Tracing]]></category>
            <category><![CDATA[Quicksilver]]></category>
            <guid isPermaLink="false">24rozMdbFQ7jmUgRNMF4RU</guid>
            <dc:creator>Alex Fattouche</dc:creator>
            <dc:creator>Corey Horton</dc:creator>
        </item>
        <item>
            <title><![CDATA[Making zone management more efficient with batch DNS record updates]]></title>
            <link>https://blog.cloudflare.com/batched-dns-changes/</link>
            <pubDate>Mon, 23 Sep 2024 13:00:00 GMT</pubDate>
            <description><![CDATA[ In response to customer demand, we now support the ability to DELETE, PATCH, PUT and POST multiple DNS records in a single API call, enabling more efficient and reliable zone management.
 ]]></description>
            <content:encoded><![CDATA[ <p>Customers that use Cloudflare to manage their DNS often need to create a whole batch of records, enable <a href="https://developers.cloudflare.com/dns/manage-dns-records/reference/proxied-dns-records/"><u>proxying</u></a> on many records, update many records to point to a new target at the same time, or even delete all of their records. Historically, customers had to resort to bespoke scripts to make these changes, which came with their own set of issues. In response to customer demand, we are excited to announce support for batched API calls to the <a href="https://developers.cloudflare.com/dns/manage-dns-records/how-to/create-dns-records/"><u>DNS records API</u></a> starting today. This lets customers make large changes to their zones much more efficiently than before. Whether sending a POST, PUT, PATCH or DELETE, users can now execute these four different <a href="https://en.wikipedia.org/wiki/HTTP#Request_methods"><u>HTTP methods</u></a>, and multiple HTTP requests all at the same time.</p>
    <div>
      <h2>Efficient zone management matters</h2>
      <a href="#efficient-zone-management-matters">
        
      </a>
    </div>
    <p><a href="https://www.cloudflare.com/en-gb/learning/dns/dns-records/"><u>DNS records</u></a> are an essential part of most web applications and websites, and they serve many different purposes. The most common use case for a DNS record is to have a hostname point to an <a href="https://en.wikipedia.org/wiki/IPv4"><u>IPv4</u></a> address, this is called an <a href="https://www.cloudflare.com/en-gb/learning/dns/dns-records/dns-a-record/"><u>A record</u></a>:</p><p><b>example.com</b> 59 IN A <b>198.51.100.0</b></p><p><b>blog.example.com</b> 59 IN A <b>198.51.100.1</b></p><p><b>ask.example.com</b> 59 IN A <b>198.51.100.2</b></p><p>In its most simple form, this enables Internet users to connect to websites without needing to memorize their IP address. </p><p>Often, our customers need to be able to do things like create a whole batch of records, or enable <a href="https://developers.cloudflare.com/dns/manage-dns-records/reference/proxied-dns-records/"><u>proxying</u></a> on many records, or update many records to point to a new target at the same time, or even delete all of their records. Unfortunately, for most of these cases, we were asking customers to write their own custom scripts or programs to do these tasks for them, a number of which are open sourced and whose content has not been checked by us. These scripts are often used to avoid needing to repeatedly make the same API calls manually. This takes time, not only for the development of the scripts, but also to simply execute all the API calls, not to mention it can leave the zone in a bad state if some changes fail while others succeed.</p>
    <div>
      <h2>Introducing /batch</h2>
      <a href="#introducing-batch">
        
      </a>
    </div>
    <p>Starting today, everyone with a <a href="https://developers.cloudflare.com/dns/zone-setups/"><u>Cloudflare zone</u></a> will have access to this endpoint, with free tier customers getting access to 200 changes in one batch, and paid plans getting access to 3,500 changes in one batch. We have successfully tested up to 100,000 changes in one call. The API is simple, expecting a POST request to be made to the <a href="https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-batch-dns-records"><u>new API endpoint</u></a> /dns_records/batch, which passes in a JSON object in the body in the format:</p>
            <pre><code>{
    deletes:[]Record
    patches:[]Record
    puts:[]Record
    posts:[]Record
}
</code></pre>
            <p>Each list of records []Record will follow the same requirements as the regular API, except that the record ID on deletes, patches, and puts will be required within the Record object itself. Here is a simple example:</p>
            <pre><code>{
    "deletes": [
        {
            "id": "143004ef463b464a504bde5a5be9f94a"
        },
        {
            "id": "165e9ef6f325460c9ca0eca6170a7a23"
        }
    ],
    "patches": [
        {
            "id": "16ac0161141a4e62a79c50e0341de5c6",
            "content": "192.0.2.45"
        },
        {
            "id": "6c929ea329514731bcd8384dd05e3a55",
            "name": "update.example.com",
            "proxied": true
        }
    ],
    "puts": [
        {
            "id": "ee93eec55e9e45f4ae3cb6941ffd6064",
            "content": "192.0.2.50",
            "name": "no-change.example.com",
            "proxied": false,
            "ttl:": 1
        },
        {
            "id": "eab237b5a67e41319159660bc6cfd80b",
            "content": "192.0.2.45",
            "name": "no-change.example.com",
            "proxied": false,
            "ttl:": 3000
        }
    ],
    "posts": [
        {
            "name": "@",
            "type": "A",
            "content": "192.0.2.45",
            "proxied": false,
            "ttl": 3000
        },
        {
            "name": "a.example.com",
            "type": "A",
            "content": "192.0.2.45",
            "proxied": true
        }
    ]
}</code></pre>
            <p>Our API will then parse this and execute these calls in the following order: </p><ol><li><p>deletes</p></li><li><p>patches</p></li><li><p>puts</p></li><li><p>posts</p></li></ol><p>Each of these respective lists will be executed in the order given. This ordering system is important because it removes the need for our clients to worry about conflicts, such as if they need to create a CNAME on the same hostname as a to-be-deleted A record, which is not allowed in <a href="https://datatracker.ietf.org/doc/html/rfc1912#section-2.4"><u>RFC 1912</u></a>. In the event that any of these individual actions fail, the entire API call will fail and return the first error it sees. The batch request will also be executed inside a single database <a href="https://en.wikipedia.org/wiki/Database_transaction"><u>transaction</u></a>, which will roll back in the event of failure.</p><p>After the batch request has been successfully executed in our database, we then propagate the changes to our edge via <a href="https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale"><u>Quicksilver</u></a>, our distributed KV store. Each of the individual record changes inside the batch request is treated as a single key-value pair, and database transactions are not supported. As such, <b>we cannot guarantee that the propagation to our edge servers will be atomic</b>. For example, if replacing a <a href="https://developers.cloudflare.com/dns/manage-dns-records/how-to/subdomains-outside-cloudflare/"><u>delegation</u></a> with an A record, some resolvers may see the <a href="https://www.cloudflare.com/en-gb/learning/dns/dns-records/dns-ns-record/"><u>NS</u></a> record removed before the A record is added. </p><p>The response will follow the same format as the request. Patches and puts that result in no changes will be placed at the end of their respective lists.</p><p>We are also introducing some new changes to the Cloudflare dashboard, allowing users to select multiple records and subsequently:</p><ol><li><p>Delete all selected records</p></li><li><p>Change the proxy status of all selected records</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ZU7nvMlcH2L51IqJrS1zC/db7ac600e503a72bb0c25679d63394e7/BLOG-2495_2.png" />
          </figure><p>We plan to continue improving the dashboard to support more batch actions based on your feedback.</p>
    <div>
      <h2>The journey</h2>
      <a href="#the-journey">
        
      </a>
    </div>
    <p>Although at the surface, this batch endpoint may seem like a fairly simple change, behind the scenes it is the culmination of a multi-year, multi-team effort. Over the past several years, we have been working hard to improve the DNS pipeline that takes our customers' records and pushes them to <a href="https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale"><u>Quicksilver</u></a>, our distributed database. As part of this effort, we have been improving our <a href="https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-list-dns-records"><u>DNS Records API</u></a> to reduce the overall latency. The DNS Records API is Cloudflare's most used API externally, serving twice as many requests as any other API at peak. In addition, the DNS Records API supports over 20 internal services, many of which touch very important areas such as DNSSEC, TLS, Email, Tunnels, Workers, Spectrum, and R2 storage. Therefore, it was important to build something that scales. </p><p>To improve API performance, we first needed to understand the complexities of the entire stack. At Cloudflare, we use <a href="https://www.jaegertracing.io/"><u>Jaeger tracing</u></a> to debug our systems. It gives us granular insights into a sample of requests that are coming into our APIs. When looking at API request latency, the <a href="https://www.jaegertracing.io/docs/1.23/architecture/#span"><u>span</u></a> that stood out was the time spent on each individual database lookup. The latency here can vary anywhere from ~1ms to ~5ms. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/61f3sKGUs9oWMPT9P4au6R/a91d8291b626f4bab3ac1c69adf62a5d/BLOG-2495_3.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3L3OaTb9cTKKKcIjCm1RLq/86ffd63116988025fd52105e316c5b5a/BLOG-2495_4.png" />
          </figure><p><sub><i>Jaeger trace showing variable database latency</i></sub></p><p>Given this variability in database query latency, we wanted to understand exactly what was going on within each DNS Records API request. When we first started on this journey, the breakdown of database lookups for each action was as follows:</p><table><tr><th><p><b>Action</b></p></th><th><p><b>Database Queries</b></p></th><th><p><b>Reason</b></p></th></tr><tr><td><p>POST</p></td><td><p>2 </p></td><td><p>One to write and one to read the new record.</p></td></tr><tr><td><p>PUT</p></td><td><p>3</p></td><td><p>One to collect, one to write, and one to read back the new record.</p></td></tr><tr><td><p>PATCH</p></td><td><p>3</p></td><td><p>One to collect, one to write, and one to read back the new record.</p></td></tr><tr><td><p>DELETE</p></td><td><p>2</p></td><td><p>One to read and one to delete.</p></td></tr></table><p>The reason we needed to read the newly created records on POST, PUT, and PATCH was because the record contains information filled in by the database which we cannot infer in the API. </p><p>Let’s imagine that a customer needed to edit 1,000 records. If each database lookup took 3ms to complete, that was 3ms * 3 lookups * 1,000 records = 9 seconds spent on database queries alone, not taking into account the round trip time to and from our API or any other processing latency. It’s clear that we needed to reduce the number of overall queries and ideally minimize per query latency variation. Let’s tackle the variation in latency first.</p><p>Each of these calls is not a simple INSERT, UPDATE, or DELETE, because we have functions wrapping these database calls for sanitization purposes. In order to understand the variable latency, we enlisted the help of <a href="https://www.postgresql.org/docs/current/auto-explain.html"><u>PostgreSQL’s “auto_explain”</u></a>. This module gives a breakdown of execution times for each statement without needing to EXPLAIN each one by hand. We used the following settings:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2myvmIREh2Q9yl30HbRus/29f085d40ba7dde34e9a46c27e3c6ba2/BLOG-2495_5.png" />
          </figure><p>A handful of queries showed durations like the one below, which took an order of magnitude longer than other queries.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/557xg66x8OiHM6pcAG4svk/56157cd0e5b6d7fd47f0152798598729/BLOG-2495_6.png" />
          </figure><p>We noticed that in several locations we were doing queries like:</p><p><code>IF (EXISTS (SELECT id FROM table WHERE row_hash = __new_row_hash))</code></p><p>If you are trying to insert into very large zones, such queries could mean even longer database query times, potentially explaining the discrepancy between 1ms and 5ms in our tracing images above. Upon further investigation, we already had a unique index on that exact hash. <a href="https://www.postgresql.org/docs/current/indexes-unique.html"><u>Unique indexes</u></a> in PostgreSQL enforce the uniqueness of one or more column values, which means we can safely remove those existence checks without risk of inserting duplicate rows.</p><p>The next task was to introduce database batching into our DNS Records API. In any API, external calls such as SQL queries are going to add substantial latency to the request. Database batching allows the DNS Records API to execute multiple SQL queries within one single network call, subsequently lowering the number of database round trips our system needs to make. </p><p>According to the table above, each database write also corresponded to a read after it had completed the query. This was needed to collect information like creation/modification timestamps and new IDs. To improve this, we tweaked our database functions to now return the newly created DNS record itself, removing a full round trip to the database. Here is the updated table:</p><table><tr><th><p><b>Action</b></p></th><th><p><b>Database Queries</b></p></th><th><p><b>Reason</b></p></th></tr><tr><td><p>POST</p></td><td><p>1 </p></td><td><p>One to write</p></td></tr><tr><td><p>PUT</p></td><td><p>2</p></td><td><p>One to read, one to write.</p></td></tr><tr><td><p>PATCH</p></td><td><p>2</p></td><td><p>One to read, one to write.</p></td></tr><tr><td><p>DELETE</p></td><td><p>2</p></td><td><p>One to read, one to delete.</p></td></tr></table><p>We have room for improvement here, however we cannot easily reduce this further due to some restrictions around auditing and other sanitization logic.</p><p><b>Results:</b></p><table><tr><th><p><b>Action</b></p></th><th><p><b>Average database time before</b></p></th><th><p><b>Average database time after</b></p></th><th><p><b>Percentage Decrease</b></p></th></tr><tr><td><p>POST</p></td><td><p>3.38ms</p></td><td><p>0.967ms</p></td><td><p>71.4%</p></td></tr><tr><td><p>PUT</p></td><td><p>4.47ms</p></td><td><p>2.31ms</p></td><td><p>48.4%</p></td></tr><tr><td><p>PATCH</p></td><td><p>4.41ms</p></td><td><p>2.24ms</p></td><td><p>49.3%</p></td></tr><tr><td><p>DELETE</p></td><td><p>1.21ms</p></td><td><p>1.21ms</p></td><td><p>0%</p></td></tr></table><p>These are some pretty good improvements! Not only did we reduce the API latency, we also reduced the database query load, benefiting other systems as well.</p>
    <div>
      <h2>Weren’t we talking about batching?</h2>
      <a href="#werent-we-talking-about-batching">
        
      </a>
    </div>
    <p>I previously mentioned that the /batch endpoint is fully atomic, making use of a single database transaction. However, a single transaction may still require multiple database network calls, and from the table above, that can add up to a significant amount of time when dealing with large batches. To optimize this, we are making use of <a href="https://pkg.go.dev/github.com/jackc/pgx/v4#Batch"><u>pgx/batch</u></a>, a Golang object that allows us to write and subsequently read multiple queries in a single network call. Here is a high level of how the batch endpoint works:</p><ol><li><p>Collect all the records for the PUTs, PATCHes and DELETEs.</p></li><li><p>Apply any per record differences as requested by the PATCHes and PUTs.</p></li><li><p>Format the batch SQL query to include each of the actions.</p></li><li><p>Execute the batch SQL query in the database.</p></li><li><p>Parse each database response and return any errors if needed.</p></li><li><p>Audit each change.</p></li></ol><p>This takes at most only two database calls per batch. One to fetch, and one to write/delete. If the batch contains only POSTs, this will be further reduced to a single database call. Given all of this, we should expect to see a significant improvement in latency when making multiple changes, which we do when observing how these various endpoints perform: </p><p><i>Note: Each of these queries was run from multiple locations around the world and the median of the response times are shown here. The server responding to queries is located in Portland, Oregon, United States. Latencies are subject to change depending on geographical location.</i></p><p><b>Create only:</b></p><table><tr><th><p>
</p></th><th><p><b>10 Records</b></p></th><th><p><b>100 Records</b></p></th><th><p><b>1,000 Records</b></p></th><th><p><b>10,000 Records</b></p></th></tr><tr><td><p><b>Regular API</b></p></td><td><p>7.55s</p></td><td><p>74.23s</p></td><td><p>757.32s</p></td><td><p>7,877.14s</p></td></tr><tr><td><p><b>Batch API - Without database batching</b></p></td><td><p>0.85s</p></td><td><p>1.47s</p></td><td><p>4.32s</p></td><td><p>16.58s</p></td></tr><tr><td><p><b>Batch API - with database batching</b></p></td><td><p>0.67s</p></td><td><p>1.21s</p></td><td><p>3.09s</p></td><td><p>10.33s</p></td></tr></table><p><b>Delete only:</b></p><table><tr><th><p>
</p></th><th><p><b>10 Records</b></p></th><th><p><b>100 Records</b></p></th><th><p><b>1,000 Records</b></p></th><th><p><b>10,000 Records</b></p></th></tr><tr><td><p><b>Regular API</b></p></td><td><p>7.28s</p></td><td><p>67.35s</p></td><td><p>658.11s</p></td><td><p>7,471.30s</p></td></tr><tr><td><p><b>Batch API - without database batching</b></p></td><td><p>0.79s</p></td><td><p>1.32s</p></td><td><p>3.18s</p></td><td><p>17.49s</p></td></tr><tr><td><p><b>Batch API - with database batching</b></p></td><td><p>0.66s</p></td><td><p>0.78s</p></td><td><p>1.68s</p></td><td><p>7.73s</p></td></tr></table><p><b>Create/Update/Delete:</b></p><table><tr><th><p>
</p></th><th><p><b>10 Records</b></p></th><th><p><b>100 Records</b></p></th><th><p><b>1,000 Records</b></p></th><th><p><b>10,000 Records</b></p></th></tr><tr><td><p><b>Regular API</b></p></td><td><p>7.11s</p></td><td><p>72.41s</p></td><td><p>715.36s</p></td><td><p>7,298.17s</p></td></tr><tr><td><p><b>Batch API - without database batching</b></p></td><td><p>0.79s</p></td><td><p>1.36s</p></td><td><p>3.05s</p></td><td><p>18.27s</p></td></tr><tr><td><p><b>Batch API - with database batching</b></p></td><td><p>0.74s</p></td><td><p>1.06s</p></td><td><p>2.17s</p></td><td><p>8.48s</p></td></tr></table><p><b>Overall Average:</b></p><table><tr><th><p>
</p></th><th><p><b>10 Records</b></p></th><th><p><b>100 Records</b></p></th><th><p><b>1,000 Records</b></p></th><th><p><b>10,000 Records</b></p></th></tr><tr><td><p><b>Regular API</b></p></td><td><p>7.31s</p></td><td><p>71.33s</p></td><td><p>710.26s</p></td><td><p>7,548.87s</p></td></tr><tr><td><p><b>Batch API - without database batching</b></p></td><td><p>0.81s</p></td><td><p>1.38s</p></td><td><p>3.51s</p></td><td><p>17.44s</p></td></tr><tr><td><p><b>Batch API - with database batching</b></p></td><td><p>0.69s</p></td><td><p>1.02s</p></td><td><p>2.31s</p></td><td><p>8.85s</p></td></tr></table><p>We can see that on average, the new batching API is significantly faster than the regular API trying to do the same actions, and it’s also nearly twice as fast as the batching API without batched database calls. We can see that at 10,000 records, the batching API is a staggering 850x faster than the regular API. As mentioned above, these numbers are likely to change for a number of different reasons, but it’s clear that making several round trips to and from the API adds substantial latency, regardless of the region.</p>
    <div>
      <h2>Batch overload</h2>
      <a href="#batch-overload">
        
      </a>
    </div>
    <p>Making our API faster is awesome, but we don’t operate in an isolated environment. Each of these records needs to be processed and pushed to <a href="https://blog.cloudflare.com/introducing-quicksilver-configuration-distribution-at-internet-scale"><u>Quicksilver</u></a>, our distributed database. If we have customers creating tens of thousands of records every 10 seconds, we need to be able to handle this downstream so that we don’t overwhelm our system. In a May 2022 blog post titled <a href="https://blog.cloudflare.com/dns-build-improvement"><i><u>How we improved DNS record build speed by more than 4,000x</u></i></a>, I noted<i> </i>that:</p><blockquote><p><i>We plan to introduce a batching system that will collect record changes into groups to minimize the number of queries we make to our database and Quicksilver.</i></p></blockquote><p>This task has since been completed, and our propagation pipeline is now able to batch thousands of record changes into a single database query which can then be published to Quicksilver in order to be propagated to our global network. </p>
    <div>
      <h2>Next steps</h2>
      <a href="#next-steps">
        
      </a>
    </div>
    <p>We have a couple more improvements we may want to bring into the API. We also intend to improve the UI to bring more usability improvements to the dashboard to more easily manage zones. <a href="https://research.rallyuxr.com/cloudflare/lp/cm0zu2ma7017j1al98l1m8a7n?channel=share&amp;studyId=cm0zu2ma4017h1al9byak79iw"><u>We would love to hear your feedback</u></a>, so please let us know what you think and if you have any suggestions for improvements.</p><p>For more details on how to use the new /batch API endpoint, head over to our <a href="https://developers.cloudflare.com/dns/manage-dns-records/how-to/batch-record-changes/"><u>developer documentation</u></a> and <a href="https://developers.cloudflare.com/api/operations/dns-records-for-a-zone-batch-dns-records"><u>API reference</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Birthday Week]]></category>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[API]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Database]]></category>
            <guid isPermaLink="false">op0CI3wllMcGjptdRb2Ce</guid>
            <dc:creator>Alex Fattouche</dc:creator>
        </item>
        <item>
            <title><![CDATA[Cloudflare Workers database integration with Upstash]]></title>
            <link>https://blog.cloudflare.com/cloudflare-workers-database-integration-with-upstash/</link>
            <pubDate>Wed, 02 Aug 2023 13:00:22 GMT</pubDate>
            <description><![CDATA[ Announcing the new Upstash database integrations for Workers. Now it is easier to use Upstash Redis, Kafka and QStash inside your Worker  ]]></description>
            <content:encoded><![CDATA[ <p><i></i></p><p><i>This blog post references a feature which has updated documentation. For the latest reference content, visit </i><a href="https://developers.cloudflare.com/workers/databases/third-party-integrations/"><i>https://developers.cloudflare.com/workers/databases/third-party-integrations/</i></a></p><p>During <a href="https://www.cloudflare.com/developer-week/">Developer Week</a> we announced <a href="/announcing-database-integrations/">Database Integrations on Workers</a> a new and seamless way to connect with some of the most popular databases. You select the provider, authorize through an OAuth2 flow and automatically get the right configuration stored as encrypted environment variables to your Worker.</p><p>Today we are thrilled to announce that we have been working with Upstash to expand our integrations catalog. We are now offering three new integrations: Upstash Redis, Upstash Kafka and Upstash QStash. These integrations allow our customers to unlock new capabilities on Workers. Providing them with a broader range of options to meet their specific requirements.</p>
    <div>
      <h3>Add the integration</h3>
      <a href="#add-the-integration">
        
      </a>
    </div>
    <p>We are going to show the setup process using the Upstash Redis integration.</p><p>Select your Worker, go to the Settings tab, select the Integrations tab to see all the available integrations.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4PgG63i9pFA5GtOuhGAAeE/5580ef72388faa48bb274d81edfd16ba/2.png" />
            
            </figure><p>After selecting the Upstash Redis integration we will get the following page.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4oL9KEz7NUDqw16aXrk2g0/2708bd58089fa1e8abc503bfd7074649/3.png" />
            
            </figure><p>First, you need to review and grant permissions, so the Integration can add secrets to your Worker. Second, we need to connect to Upstash using the OAuth2 flow. Third, select the Redis database we want to use. Then, the Integration will fetch the right information to generate the credentials. Finally, click “Add Integration” and it's done! We can now use the credentials as environment variables on our Worker.</p>
    <div>
      <h3>Implementation example</h3>
      <a href="#implementation-example">
        
      </a>
    </div>
    <p>On this occasion we are going to use the <a href="https://developers.cloudflare.com/fundamentals/get-started/reference/http-request-headers/#cf-ipcountry">CF-IPCountry</a>  header to conditionally return a custom greeting message to visitors from Paraguay, United States, Great Britain and Netherlands. While returning a generic message to visitors from other countries.</p><p>To begin we are going to load the custom greeting messages using Upstash’s online CLI tool.</p>
            <pre><code>➜ set PY "Mba'ẽichapa 🇵🇾"
OK
➜ set US "How are you? 🇺🇸"
OK
➜ set GB "How do you do? 🇬🇧"
OK
➜ set NL "Hoe gaat het met u? 🇳🇱"
OK</code></pre>
            <p>We also need to install <code>@upstash/redis</code> package on our Worker before we upload the following code.</p>
            <pre><code>import { Redis } from '@upstash/redis/cloudflare'
 
export default {
  async fetch(request, env, ctx) {
    const country = request.headers.get("cf-ipcountry");
    const redis = Redis.fromEnv(env);
    if (country) {
      const localizedMessage = await redis.get(country);
      if (localizedMessage) {
        return new Response(localizedMessage);
      }
    }
    return new Response("👋👋 Hello there! 👋👋");
  },
};</code></pre>
            <p>Just like that we are returning a localized message from the Redis instance depending on the country which the request originated from. Furthermore, we have a couple ways to improve performance, for write heavy use cases we can use <a href="/announcing-workers-smart-placement/">Smart Placement</a> with no replicas, so the Worker code will be executed near the Redis instance provided by Upstash. Otherwise, creating a <a href="https://docs.upstash.com/redis/features/globaldatabase">Global Database</a> on Upstash to have multiple read replicas across regions will help.</p>
    <div>
      <h3><a href="https://developers.cloudflare.com/workers/databases/native-integrations/upstash/">Try it now</a></h3>
      <a href="#">
        
      </a>
    </div>
    <p>Upstash Redis, Kafka and QStash are now available for all users! Stay tuned for more updates as we continue to expand our Database Integrations catalog.</p> ]]></content:encoded>
            <category><![CDATA[Cloudflare Workers]]></category>
            <category><![CDATA[Developers]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Database]]></category>
            <category><![CDATA[Internship Experience]]></category>
            <category><![CDATA[Developer Platform]]></category>
            <guid isPermaLink="false">6PIdVuhR9PDMgFblDoqqfc</guid>
            <dc:creator>Joaquin Gimenez</dc:creator>
            <dc:creator>Shaun Persad</dc:creator>
        </item>
        <item>
            <title><![CDATA[Intelligent, automatic restarts for unhealthy Kafka consumers]]></title>
            <link>https://blog.cloudflare.com/intelligent-automatic-restarts-for-unhealthy-kafka-consumers/</link>
            <pubDate>Tue, 24 Jan 2023 14:00:00 GMT</pubDate>
            <description><![CDATA[ At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts. ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7eWbGD5pEX9bKf2p58iqOw/b55ba4bfd305da7ed38cf66fe770585c/image3-8-2.png" />
            
            </figure><p>At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts.</p><p>We learned a lot about keeping our applications that leverage Kafka healthy, so they can always be operational. Application health checks are notoriously hard to implement: What determines an application as healthy? How can we keep services operational at all times?</p><p>These can be implemented in many ways. We’ll talk about an approach that allows us to considerably reduce incidents with unhealthy applications while requiring less manual intervention.</p>
    <div>
      <h3>Kafka at Cloudflare</h3>
      <a href="#kafka-at-cloudflare">
        
      </a>
    </div>
    <p><a href="/using-apache-kafka-to-process-1-trillion-messages/">Cloudflare is a big adopter of Kafka</a>. We use Kafka as a way to decouple services due to its asynchronous nature and reliability. It allows different teams to work effectively without creating dependencies on one another. You can also read more about how other teams at Cloudflare use Kafka in <a href="/http-analytics-for-6m-requests-per-second-using-clickhouse/">this</a> post.</p><p>Kafka is used to send and receive messages. Messages represent some kind of event like a credit card payment or details of a new user created in your platform. These messages can be represented in multiple ways: JSON, Protobuf, Avro and so on.</p><p>Kafka organises messages in topics. A topic is an ordered log of events in which each message is marked with a progressive offset. When an event is written by an external system, that is appended to the end of that topic. These events are not deleted from the topic by default (retention can be applied).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2KUYbqCCL74YZVU8NXOThl/4ec5024168993a2300add7221016af0d/1-4.png" />
            
            </figure><p>Topics are stored as log files on disk, which are finite in size. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on separate servers–enabling to scale topics.</p><p>Topics are managed by brokers–nodes in a Kafka cluster. These are responsible for writing new events to partitions, serving reads and replicating partitions among themselves.</p><p>Messages can be consumed by individual consumers or co-ordinated groups of consumers, known as consumer groups.</p><p>Consumers use a unique id (consumer id) that allows them to be identified by the broker as an application which is consuming from a specific topic.</p><p>Each topic can be read by an infinite number of different consumers, as long as they use a different id. Each consumer can replay the same messages as many times as they want.</p><p>When a consumer starts consuming from a topic, it will process all messages, starting from a selected offset, from each partition. With a consumer group, the partitions are divided amongst each consumer in the group. This division is determined by the consumer group leader. This leader will receive information about the other consumers in the group and will decide which consumers will receive messages from which partitions (partition strategy).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Qe2Qe5nQ5gcHyhV0zpTWw/5182eea9de66164a36a28e92270fdb3f/2-3.png" />
            
            </figure><p>The offset of a consumer’s commit can demonstrate whether the consumer is working as expected. Committing a processed offset is the way a consumer and its consumer group report to the broker that they have processed a particular message.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/29Y9mQiHkvGKUzc3RGF1sk/09d2987f53eef026c164e6c49cacc95c/unnamed-6.png" />
            
            </figure><p>A standard measurement of whether a consumer is processing fast enough is lag. We use this to measure how far behind the newest message we are. This tracks time elapsed between messages being written to and read from a topic. When a service is lagging behind, it means that the consumption is at a slower rate than new messages being produced.</p><p>Due to Cloudflare’s scale, message rates typically end up being very large and a lot of requests are time-sensitive so monitoring this is vital.</p><p>At Cloudflare, our applications using Kafka are deployed as microservices on Kubernetes.</p>
    <div>
      <h3>Health checks for Kubernetes apps</h3>
      <a href="#health-checks-for-kubernetes-apps">
        
      </a>
    </div>
    <p>Kubernetes uses <a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/">probes</a> to understand if a service is healthy and is ready to receive traffic or to run. When a liveness probe fails and the bounds for retrying are exceeded, Kubernetes restarts the services.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4FagbTygES9L7dmEQ6ratD/0a6f0d4c5ac117b723ad726a12d3936a/4-3.png" />
            
            </figure><p>When a readiness probe fails and the bounds for retrying are exceeded, it stops sending HTTP traffic to the targeted pods. In the case of Kafka applications this is not relevant as they don’t run an http server. For this reason, we’ll cover only liveness checks.</p><p>A classic Kafka liveness check done on a consumer checks the status of the connection with the broker. It’s often best practice to keep these checks simple and perform some basic operations - in this case, something like listing topics. If, for any reason, this check fails consistently, for instance the broker returns a TLS error, Kubernetes terminates the service and starts a new pod of the same service, therefore forcing a new connection. Simple Kafka liveness checks do a good job of understanding when the connection with the broker is unhealthy.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6gNWb3Rit0MmTutsurm7sf/70355c422fab7ebce7d59d8c2c682d6d/5-2.png" />
            
            </figure>
    <div>
      <h3>Problems with Kafka health checks</h3>
      <a href="#problems-with-kafka-health-checks">
        
      </a>
    </div>
    <p>Due to Cloudflare’s scale, a lot of our Kafka topics are divided into multiple partitions (in some cases this can be hundreds!) and in many cases the replica count of our consuming service doesn’t necessarily match the number of partitions on the Kafka topic. This can mean that in a lot of scenarios this simple approach to health checking is not quite enough!</p><p>Microservices that consume from Kafka topics are healthy if they are consuming and committing offsets at regular intervals when messages are being published to a topic. When such services are not committing offsets as expected, it means that the consumer is in a bad state, and it will start accumulating lag. An approach we often take is to manually terminate and restart the service in Kubernetes, this will cause a reconnection and rebalance.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/N4YalYdgNRxYJK7PVAlzY/26b55fc38c53855a6c28c71b25cdac02/lag.png" />
            
            </figure><p>When a consumer joins or leaves a consumer group, a rebalance is triggered and the consumer group leader must re-assign which consumers will read from which partitions.</p><p>When a rebalance happens, each consumer is notified to stop consuming. Some consumers might get their assigned partitions taken away and re-assigned to another consumer. We noticed when this happened within our library implementation; if the consumer doesn’t acknowledge this command, it will wait indefinitely for new messages to be consumed from a partition that it’s no longer assigned to, ultimately leading to a deadlock. Usually a manual restart of the faulty client-side app is needed to resume processing.</p>
    <div>
      <h3>Intelligent health checks</h3>
      <a href="#intelligent-health-checks">
        
      </a>
    </div>
    <p>As we were seeing consumers reporting as “healthy” but sitting idle, it occurred to us that maybe we were focusing on the wrong thing in our health checks. Just because the service is connected to the Kafka broker and can read from the topic, it does not mean the consumer is actively processing messages.</p><p>Therefore, we realised we should be focused on message ingestion, using the offset values to ensure that forward progress was being made.</p>
    <div>
      <h4>The PagerDuty approach</h4>
      <a href="#the-pagerduty-approach">
        
      </a>
    </div>
    <p>PagerDuty wrote an excellent <a href="https://www.pagerduty.com/eng/kafka-health-checks/">blog</a> on this topic which we used as inspiration when coming up with our approach.</p><p>Their approach used the current (latest) offset and the committed offset values. The current offset signifies the last message that was sent to the topic, while the committed offset is the last message that was processed by the consumer.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2fwem7NtBnO6M1RMhrezr8/af4cbbd7a63d3145f5c7fe9f405bd04d/pasted-image-0-4.png" />
            
            </figure><p>Checking the consumer is moving forwards, by ensuring that the latest offset was changing (receiving new messages) and the committed offsets were changing as well (processing the new messages).</p><p>Therefore, the solution we came up with:</p><ul><li><p>If we cannot read the current offset, fail liveness probe.</p></li><li><p>If we cannot read the committed offset, fail liveness probe.</p></li><li><p>If the committed offset == the current offset, pass liveness probe.</p></li><li><p>If the value for the committed offset has not changed since the last run of the health check, fail liveness probe.</p></li></ul>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5r76n2Iew7pSqA8vYNZzIy/c9e0f6a113a34d0c36a216c054e4d840/pasted-image-0--1--3.png" />
            
            </figure><p>To measure if the committed offset is changing, we need to store the value of the previous run, we do this using an in-memory map where partition number is the key. This means each instance of our service only has a view of the partitions it is currently consuming from and will run the health check for each.</p>
    <div>
      <h4>Problems</h4>
      <a href="#problems">
        
      </a>
    </div>
    <p>When we first rolled out our smart health checks we started to notice cascading failures some time after release. After initial investigations we realised this was happening when a rebalance happens. It would initially affect one replica then quickly result in the others reporting as unhealthy.</p><p>What we observed was due to us storing the previous value of the committed offset in-memory, when a rebalance happens the service may get re-assigned a different partition. When this happened it meant our service was incorrectly assuming that the committed offset for that partition had not changed (as this specific replica was no longer updating the latest value), therefore it would start to report the service as unhealthy. The failing liveness probe would then cause it to restart which would in-turn trigger another rebalancing in Kafka causing other replicas to face the same issue.</p>
    <div>
      <h4>Solution</h4>
      <a href="#solution">
        
      </a>
    </div>
    <p>To fix this issue we needed to ensure that each replica only kept track of the offsets for the partitions it was consuming from at that moment. Luckily, the Shopify Sarama library, which we use internally, has functionality to observe when a rebalancing happens. This meant we could use it to rebuild the in-memory map of offsets so that it would only include the relevant partition values.</p><p>This is handled by receiving the signal from the session context channel:</p>
            <pre><code>for {
  select {
  case message, ok := &lt;-claim.Messages(): // &lt;-- Message received

     // Store latest received offset in-memory
     offsetMap[message.Partition] = message.Offset


     // Handle message
     handleMessage(ctx, message)


     // Commit message offset
     session.MarkMessage(message, "")


  case &lt;-session.Context().Done(): // &lt;-- Rebalance happened

     // Remove rebalanced partition from in-memory map
     delete(offsetMap, claim.Partition())
  }
}</code></pre>
            <p>Verifying this solution was straightforward, we just needed to trigger a rebalance. To test this worked in all possible scenarios we spun up a single replica of a service consuming from multiple partitions, then proceeded to scale up the number of replicas until it matched the partition count, then scaled back down to a single replica. By doing this we verified that the health checks could safely handle new partitions being assigned as well as partitions being taken away.</p>
    <div>
      <h3>Takeaways</h3>
      <a href="#takeaways">
        
      </a>
    </div>
    <p>Probes in Kubernetes are very easy to set up and can be a powerful tool to ensure your application is running as expected. Well implemented probes can often be the difference between engineers being called out to fix trivial issues (sometimes outside of working hours) and a service which is self-healing.</p><p>However, without proper thought, “dumb” health checks can also lead to a false sense of security that a service is running as expected even when it’s not. One thing we have learnt from this was to think more about the specific behaviour of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected.</p> ]]></content:encoded>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Observability]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Kubernetes]]></category>
            <guid isPermaLink="false">7s1ijlG7zMlxJPI6Hcs3zl</guid>
            <dc:creator>Chris Shepherd</dc:creator>
            <dc:creator>Andrea Medda</dc:creator>
        </item>
        <item>
            <title><![CDATA[Using Apache Kafka to process 1 trillion inter-service messages]]></title>
            <link>https://blog.cloudflare.com/using-apache-kafka-to-process-1-trillion-messages/</link>
            <pubDate>Tue, 19 Jul 2022 13:00:00 GMT</pubDate>
            <description><![CDATA[ We learnt a lot about Kafka on the way to 1 trillion messages, and built some interesting internal tools to ease adoption that will be explored in this blog post ]]></description>
            <content:encoded><![CDATA[ 
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3RvfaKNjkQcCDhobJWvu7o/e6fa0460c33b0250ceb61458d2d4bd8d/image3-8.png" />
            
            </figure><p>Cloudflare has been using Kafka in production since 2014. We have come a long way since then, and currently run 14 distinct Kafka clusters, across multiple data centers, with roughly 330 nodes. Between them, over a trillion messages have been processed over the last eight years.</p><p>Cloudflare uses Kafka to decouple microservices and communicate the creation, change or deletion of various resources via a common data format in a fault-tolerant manner. This decoupling is one of many factors that enables Cloudflare engineering teams to work on multiple features and products concurrently.</p><p>We learnt a lot about Kafka on the way to one trillion messages, and built some interesting internal tools to ease adoption that will be explored in this blog post. The focus in this blog post is on inter-application communication use cases alone and not logging (we have other Kafka clusters that power the dashboards where customers view statistics that handle more than one trillion messages <i>each day</i>). I am an engineer on the <a href="https://www.cloudflare.com/application-services/">Application Services</a> team and our team has a charter to provide tools/services to product teams, so they can focus on their core competency which is delivering value to our customers.</p><p>In this blog I’d like to recount some of our experiences in the hope that it helps other engineering teams who are on a similar journey of adopting Kafka widely.</p>
    <div>
      <h3>Tooling</h3>
      <a href="#tooling">
        
      </a>
    </div>
    <p>One of our Kafka clusters is creatively named Messagebus. It is the most general purpose cluster we run, and was created to:</p><ul><li><p>Prevent data silos;</p></li><li><p>Enable services to communicate more clearly with basically zero integration cost (more on how we achieved this below);</p></li><li><p>Encourage the use of a self-documenting communication format and therefore removing the problem of out of date documentation.</p></li></ul><p>To make it as easy to use as possible and to encourage adoption, the Application Services team created two internal projects. The first is unimaginatively named Messagebus-Client. Messagebus-Client is a Go library that wraps the fantastic <a href="https://github.com/Shopify/sarama">Shopify Sarama</a> library with an opinionated set of configuration options and the ability to manage the rotation of mTLS certificates.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1oHafIQiSG7GPT5Vy4NK2o/fccef1fe85ff5d635975d7397a5d6299/unnamed1-2.png" />
            
            </figure><p>The success of this project is also somewhat its downfall. By providing a ready-to-go Kafka client, we ensured teams got up and running quickly, but we also abstracted some core concepts of Kafka a little too much, meaning that small unassuming configuration changes could have a big impact.</p><p>One such example led to partition skew (a large portion of messages being directed towards a single partition, meaning we were not processing messages in real time; see the chart below). One drawback of Kafka is you can only have one consumer per partition, so when incidents do occur, you can’t trivially scale your way to faster throughput.</p><p>That also means before your service hits production it is wise to do some back of the napkin math to figure out what throughput might look like, otherwise you will need to add partitions later. We have since amended our library to make events like the below less likely.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ZGNKq4vOi5vzkHe5FWqsV/f36dac854e7c8e67b5f66a75f16ddeda/image2-14.png" />
            
            </figure><p>The reception for the Messagebus-Client has been largely positive. We spent time as a team to understand what the predominant use cases were, and took the concept one step further to build out what we call the connector framework.</p>
    <div>
      <h3>Connectors</h3>
      <a href="#connectors">
        
      </a>
    </div>
    <p>The connector framework is based on Kafka-connectors and allows our engineers to easily spin up a service that can read from a system of record and push it somewhere else (such as Kafka, or even Cloudflare’s own <a href="/introducing-quicksilver-configuration-distribution-at-internet-scale/">Quicksilver</a>). To make this as easy as possible, we use Cookiecutter templating to allow engineers to enter a few parameters into a CLI and in return receive a ready to deploy service.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2zcySLWw14zCys58e2yJfn/5f0c05b47e424f5a9f3448fca01be0a1/unnamed2-3.png" />
            
            </figure><p>We provide the ability to configure data pipelines via environment variables. For simple use cases, we provide the functionality out of the box. However, extending the readers, writers and transformations is as simple as satisfying an interface and “registering” the new entry.</p><p>For example, adding the environment variables:</p>
            <pre><code>READER=kafka
TRANSFORMATIONS=topic_router:topic1,topic2|pf_edge
WRITER=quicksilver</code></pre>
            <p>will:</p><ul><li><p>Read messages from Kafka topic “topic1” and “topic2”;</p></li><li><p>Transform the message using a transformation function called “pf_edge” which maps the request from a Kafka protobuf to a Quicksilver request;</p></li><li><p>Write the result to Quicksilver.</p></li></ul><p>Connectors come readily baked with basic metrics and alerts, so teams know they can move to production quickly but with confidence.</p><p>Below is a diagram of how one team used our connector framework to read from the Messagebus cluster and write to various other systems. This is orchestrated by a system the Application Service team runs called Communication Preferences Service (CPS). Whenever a user opts in/out of marketing emails or changes their language preferences on cloudflare.com, they are calling CPS which ensures those settings are reflected in all the relevant systems.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/73VDgC81vyvClHrzGhC3ks/51f86d58fd8b9477c33a6d2808663d62/unnamed3-2.png" />
            
            </figure>
    <div>
      <h3>Strict Schemas</h3>
      <a href="#strict-schemas">
        
      </a>
    </div>
    <p>Alongside the Messagebus-Client library, we also provide a repo called Messagebus Schema. This is a schema registry for all message types that will be sent over our Messagebus cluster. For message format, we use protobuf and have been very happy with that decision. Previously, our team had used JSON for some of our kafka schemas, but we found it much harder to enforce forward and backwards compatibility, as well as message sizes being substantially larger than the protobuf equivalent. Protobuf provides strict message schemas (including type safety), the forward and backwards compatibility we desired, the ability to generate code in multiple languages as well as the files being very human-readable.</p><p>We encourage heavy commentary before approving a merge. Once merged, we use prototool to do breaking change detection, enforce some stylistic rules and to generate code for various languages (at time of writing it's just Go and Rust, but it is trivial to add more).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4gukMjNyXSWw57NnycOs7U/7137d7d445fd0e0980406cce99b7cb80/image6-8.png" />
            
            </figure><p><i>An example Protobuf message in our schema</i></p><p>Furthermore, in Messagebus Schema we store a mapping of proto messages to a team, alongside that team’s chat room in our internal communication tool. This allows us to escalate issues to the correct team easily when necessary.</p><p>One important decision we made for the Messagebus cluster is to only allow one proto message per topic. This is configured in Messagebus Schema and enforced by the Messagebus-Client. This was a good decision to enable easy adoption, but it has led to numerous topics existing. When you consider that for each topic we create, we add numerous partitions and replicate them with a replication factor of at least three for resilience, there is a lot of potential to optimize compute for our lower throughput topics.</p>
    <div>
      <h3>Observability</h3>
      <a href="#observability">
        
      </a>
    </div>
    <p>Making it easy for teams to observe Kafka is essential for our decoupled engineering model to be successful. We therefore have automated metrics and alert creation wherever we can to ensure that all the engineering teams have a wealth of information available to them to respond to any issues that arise in a timely manner.</p><p>We use Salt to manage our infrastructure configuration and follow a Gitops style model, where our repo holds the source of truth for the state of our infrastructure. To add a new Kafka topic, our engineers make a pull request into this repo and add a couple of lines of YAML. Upon merge, the topic and an alert for high lag (where lag is defined as the difference in time between the last committed offset being read and the last produced offset being produced) will be created. Other alerts can (and should) be created, but this is left to the discretion of application teams. The reason we automatically generate alerts for high lag is that this simple alert is a great proxy for catching a high amount of issues including:</p><ul><li><p>Your consumer isn’t running.</p></li><li><p>Your consumer cannot keep up with the amount of throughput or there is an anomalous amount of messages being produced to your topic at this time.</p></li><li><p>Your consumer is misbehaving and not acknowledging messages.</p></li></ul><p>For metrics, we use Prometheus and display them with Grafana. For each new topic created, we automatically provide a view into production rate, consumption rate and partition skew by producer/consumer. If an engineering team is called out, within the alert message is a link to this Grafana view.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2IVbiHGj7DRPdVUXorpOEQ/1be04566221051b8ec499ae265f083fb/image7-cropped.png" />
            
            </figure><p>In our Messagebus-Client, we expose some metrics automatically and users get the ability to extend them further. The metrics we expose by default are:</p><p>For producers:</p><ul><li><p>Messages successfully delivered.</p></li><li><p>Message failed to deliver.</p></li></ul><p>For consumer:</p><ul><li><p>Messages successfully consumed.</p></li><li><p>Message consumption errors.</p></li></ul><p>Some teams use these for alerting on a significant change in throughput, others use them to alert if no messages are produced/consumed in a given time frame.</p>
    <div>
      <h3>A Practical Example</h3>
      <a href="#a-practical-example">
        
      </a>
    </div>
    <p>As well as providing the Messagebus framework, the Application Services team looks for common concerns within Engineering and looks to solve them in a scalable, extensible way which means other engineering teams can utilize the system and not have to build their own (thus meaning we are not building lots of disparate systems that are only slightly different).</p><p>One example is the Alert Notification System (ANS). ANS is the backend service for the “Notifications” tab in the Cloudflare dashboard. You may have noticed over the past 12 months that new alert and policy types have been made available to customers very regularly. This is because we have made it very easy for other teams to do this. The approach is:</p><ul><li><p>Create a new entry into ANS’s configuration YAML (We use CUE lang to validate the configuration as part of our continuous integration process);</p></li><li><p>Import our Messagebus-Client into your code base;</p></li><li><p>Emit a message to our alert topic when an event of interest takes place.</p></li></ul><p>That’s it! The producer team now has a means for customers to configure granular alerting policies for their new alert that includes being able to dispatch them via Slack, Google Chat or a custom webhook, PagerDuty or email (by both API and dashboard). Retrying and dead letter messages are managed for them, and a whole host of metrics are made available, all by making some very small changes.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4csJmS7UJAwNtrWvV0WKwx/73d4fd1ff6528c14d3ec42b9237f00a4/unnamed4.png" />
            
            </figure>
    <div>
      <h3>What’s Next?</h3>
      <a href="#whats-next">
        
      </a>
    </div>
    <p>Usage of Kafka (and our Messagebus tools) is only going to increase at Cloudflare as we continue to grow, and as a team we are committed to making the tooling around Messagebus easy to use, customizable where necessary and (perhaps most importantly) easy to observe. We regularly take feedback from other engineers to help improve the Messagebus-Client (we are on the fifth version now) and are currently experimenting with abstracting the intricacies of Kafka away completely and allowing teams to use gRPC to stream messages to Kafka. Blog post on the success/failure of this to follow!</p><p>If you're interested in building scalable services and solving interesting technical problems, we are hiring engineers on our team in <a href="https://boards.greenhouse.io/cloudflare/jobs/3252504?gh_jid=3252504"><i>Austin</i></a><i>, and </i><a href="https://boards.greenhouse.io/cloudflare/jobs/3252504?gh_jid=3252504"><i>Remote US</i></a><i>.</i></p> ]]></content:encoded>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Salt]]></category>
            <guid isPermaLink="false">227GWCIbrOPz01S0w39hHZ</guid>
            <dc:creator>Matt Boyle</dc:creator>
        </item>
        <item>
            <title><![CDATA[How we improved DNS record build speed by more than 4,000x]]></title>
            <link>https://blog.cloudflare.com/dns-build-improvement/</link>
            <pubDate>Wed, 25 May 2022 12:59:04 GMT</pubDate>
            <description><![CDATA[ How we redesigned our DNS pipeline to significantly improve DNS propagation speed across all zones. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Since my previous blog about <a href="/secondary-dns-deep-dive/">Secondary DNS</a>, <a href="https://www.cloudflare.com/dns/">Cloudflare's DNS</a> traffic has more than doubled from 15.8 trillion DNS queries per month to 38.7 trillion. Our network now spans over 270 cities in over 100 countries, interconnecting with more than 10,000 networks globally. According to <a href="https://w3techs.com/technologies/overview/dns_server">w3 stats</a>, “Cloudflare is used as a DNS server provider by 15.3% of all the websites.” This means we have an enormous responsibility to serve <a href="https://www.cloudflare.com/learning/dns/what-is-dns/">DNS</a> in the fastest and most reliable way possible.</p><p>Although the response time we have on DNS queries is the most important performance metric, there is another metric that sometimes goes unnoticed. DNS Record Propagation time is how long it takes changes submitted to our API to be reflected in our DNS query responses. Every millisecond counts here as it allows customers to quickly change configuration, making their systems much more agile. Although our DNS propagation pipeline was already known to be very fast, we had identified several improvements that, if implemented, would massively improve performance. In this blog post I’ll explain how we managed to drastically improve our DNS record propagation speed, and the impact it has on our customers.</p>
    <div>
      <h3>How DNS records are propagated</h3>
      <a href="#how-dns-records-are-propagated">
        
      </a>
    </div>
    <p>Cloudflare uses a multi-stage pipeline that takes our customers’ DNS record changes and pushes them to our global network, so they are available all over the world.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/621S7OYC5i9VEHUMnpruti/322ede366828e0669c16443d861608f8/image3-37.png" />
            
            </figure><p>The steps shown in the diagram above are:</p><ol><li><p>Customer makes a change to a record via our DNS Records API (or UI).</p></li><li><p>The change is persisted to the database.</p></li><li><p>The database event triggers a Kafka message which is consumed by the Zone Builder.</p></li><li><p>The Zone Builder takes the message, collects the contents of the zone from the database and pushes it to Quicksilver, our distributed KV store.</p></li><li><p>Quicksilver then propagates this information to the network.</p></li></ol><p>Of course, this is a simplified version of what is happening. In reality, our API receives thousands of requests per second. All POST/PUT/PATCH/DELETE requests ultimately result in a DNS record change. Each of these changes needs to be actioned so that the information we show through our API and in the <a href="https://dash.cloudflare.com/?to=/:account/:zone/dns">Cloudflare dashboard</a> is eventually consistent with the information we use to respond to DNS queries.</p><p>Historically, one of the largest bottlenecks in the DNS propagation pipeline was the Zone Builder, shown in step 4 above. Responsible for collecting and organizing records to be written to our global network, our Zone Builder often ate up most of the propagation time, especially for larger zones. As we continue to scale, it is important for us to remove any bottlenecks that may exist in our systems, and this was clearly identified as one such bottleneck.</p>
    <div>
      <h3>Growing pains</h3>
      <a href="#growing-pains">
        
      </a>
    </div>
    <p>When the pipeline shown above was <a href="/how-we-made-our-dns-stack-3x-faster/">first announced</a>, the Zone Builder received somewhere between 5 and 10 DNS record changes per second. Although the Zone Builder at the time was a massive improvement on the previous system, it was not going to last long given the growth that Cloudflare was and still is experiencing. Fast-forward to today, we receive on average 250 DNS record changes per second, a staggering 25x growth from when the Zone Builder was first announced.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/MlSZs6szqObivtM8cC1rV/3b50129c5d7a2691e893242d1b683a0a/image4-30.png" />
            
            </figure><p>The way that the Zone Builder was initially designed was quite simple. When a zone changed, the Zone Builder would grab all the records from the database for that zone and compare them with the records stored in Quicksilver. Any differences were fixed to maintain consistency between the database and Quicksilver.</p><p>This is known as a full build. Full builds work great because each DNS record change corresponds to one zone change event. This means that multiple events can be batched and subsequently dropped if needed. For example, if a user makes 10 changes to their zone, this will result in 10 events. Since the Zone Builder grabs all the records for the zone anyway, there is no need to build the zone 10 times. We just need to build it once after the final change has been submitted.</p><p>What happens if the zone contains one million records or 10 million records? This is a very real problem, because not only is Cloudflare scaling, but our customers are scaling with us. Today our largest zone currently has millions of records. Although our database is optimized for performance, even one full build containing one million records took up to <b>35 seconds</b>, largely caused by database query latency. In addition, when the Zone Builder compares the zone contents with the records stored in Quicksilver, we need to fetch all the records from Quicksilver for the zone, adding time. However, the impact doesn’t just stop at the single customer. This also eats up more resources from other services reading from the database and slows down the rate at which our Zone Builder can build other zones.</p>
    <div>
      <h3>Per-record build: a new build type</h3>
      <a href="#per-record-build-a-new-build-type">
        
      </a>
    </div>
    <p>Many of you might already have the solution to this problem in your head:</p><p><i>Why doesn’t the Zone Builder just query the database for the record that has changed and propagate just the single record?</i></p><p>Of course this is the correct solution, and the one we eventually ended up at. However, the road to get there was not as simple as it might seem.</p><p>Firstly, our database uses a series of functions that, at zone touch time, create a PostgreSQL Queue (PGQ) event that ultimately gets turned into a Kafka event. Initially, we had no distinction for individual record events, which meant our Zone Builder had no idea what had actually changed until it queried the database.</p><p>Next, the Zone Builder is still responsible for DNS zone settings in addition to records. Some examples of DNS zone settings include custom nameserver control and DNSSEC control. As a result, our Zone Builder needed to be aware of specific build types to ensure that they don’t step on each other. Furthermore, per-record builds cannot be batched in the same way that zone builds can because each event needs to be actioned separately.</p><p>As a result, a brand new scheduling system needed to be written. Lastly, Quicksilver interaction needed to be re-written to account for the different types of schedulers. These issues can be broken down as follows:</p><ol><li><p>Create a new Kafka event pipeline for record changes that contain information about the changed record.</p></li><li><p>Separate the Zone Builder into a new type of scheduler that implements some defined scheduler interface.</p></li><li><p>Implement the per-record scheduler to read events one by one in the correct order.</p></li><li><p>Implement the new Quicksilver interface for the per-record scheduler.</p></li></ol><p>Below is a high level diagram of how the new Zone Builder looks internally with the new scheduler types.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/23Y8EYi1qubiNN16dVq62j/338ca37bd56d92d157eb09c016841f9c/image6-20.png" />
            
            </figure><p>It is critically important that we lock between these two schedulers because it would otherwise be possible for the full build scheduler to overwrite the per-record scheduler’s changes with stale data.</p><p>It is important to note that none of this per-record architecture would be possible without the use of Cloudflare’s <a href="/black-lies/">black lie approach</a> to negative answers with DNSSEC. Normally, in order to properly serve negative answers with DNSSEC, all the records within the zone must be canonically sorted. This is needed in order to maintain a list of references from the apex record through all the records in the zone. With this normal approach to negative answers, a single record that has been added to the zone requires collecting all records to determine its insertion point within this sorted list of names.</p>
    <div>
      <h3>Bugs</h3>
      <a href="#bugs">
        
      </a>
    </div>
    <p>I would love to be able to write a Cloudflare blog where everything went smoothly; however, that is never the case. Bugs happen, but we need to be ready to react to them and set ourselves up so that next time this specific bug cannot happen.</p><p>In this case, the major bug we discovered was related to the cleanup of old records in Quicksilver. With the full Zone Builder, we have the luxury of knowing exactly what records exist in both the database and in Quicksilver. This makes writing and cleaning up a fairly simple task.</p><p>When the per-record builds were introduced, record events such as creates, updates, and deletes all needed to be treated differently. Creates and deletes are fairly simple because you are either adding or removing a record from Quicksilver. Updates introduced an unforeseen issue due to the way that our PGQ was producing Kafka events. Record updates only contained the new record information, which meant that when the record name was changed, we had no way of knowing what to query for in Quicksilver in order to clean up the old record. This meant that any time a customer changed the name of a record in the DNS Records API, the old record would not be deleted. Ultimately, this was fixed by replacing those specific update events with both a creation and a deletion event so that the Zone Builder had the necessary information to clean up the stale records.</p><p>None of this is rocket surgery, but we spend engineering effort to continuously improve our software so that it grows with the scaling of Cloudflare. And it’s challenging to change such a fundamental low-level part of Cloudflare when millions of domains depend on us.</p>
    <div>
      <h3>Results</h3>
      <a href="#results">
        
      </a>
    </div>
    <p>Today, all DNS Records API record changes are treated as per-record builds by the Zone Builder. As I previously mentioned, we have not been able to get rid of full builds entirely; however, they now represent about 13% of total DNS builds. This 13% corresponds to changes made to DNS settings that require knowledge of the entire zone's contents.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1mmDhQTWMR03iIWHL5oRNB/e953a2be78239e765d8986ecfa7fdf47/image1-56.png" />
            
            </figure><p>When we compare the two build types as shown below we can see that per-record builds are on average <b>150x</b> faster than full builds. The build time below includes both database query time and Quicksilver write time.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/48Umk7PrqhCT7J1kBSjXCg/6adab8baa50d861e5852d841357f31eb/image2-51.png" />
            
            </figure><p>From there, our records are propagated to our global network through Quicksilver.</p><p>The 150x improvement above is with respect to averages, but what about that 4000x that I mentioned at the start? As you can imagine, as the size of the zone increases, the difference between full build time and per-record build time also increases. I used a test zone of one million records and ran several per-record builds, followed by several full builds. The results are shown in the table below:</p><table><tr><td><p><b>Build Type</b></p></td><td><p><b>Build Time (ms)</b></p></td></tr><tr><td><p>Per Record #1</p></td><td><p>6</p></td></tr><tr><td><p>Per Record #2</p></td><td><p>7</p></td></tr><tr><td><p>Per Record #3</p></td><td><p>6</p></td></tr><tr><td><p>Per Record #4</p></td><td><p>8</p></td></tr><tr><td><p>Per Record #5</p></td><td><p>6</p></td></tr><tr><td><p>Full #1</p></td><td><p>34032</p></td></tr><tr><td><p>Full #2</p></td><td><p>33953</p></td></tr><tr><td><p>Full #3</p></td><td><p>34271</p></td></tr><tr><td><p>Full #4</p></td><td><p>34121</p></td></tr><tr><td><p>Full #5</p></td><td><p>34093</p></td></tr></table><p>We can see that, given five per-record builds, the build time was no more than 8ms. When running a full build however, the build time lasted on average 34 seconds. That is a build time reduction of <b>4250x</b>!</p><p>Given the full build times for both average-sized zones and large zones, it is apparent that all Cloudflare customers are benefitting from this improved performance, and the benefits only improve as the size of the zone increases. In addition, our Zone Builder uses less database and Quicksilver resources meaning other Cloudflare systems are able to operate at increased capacity.</p>
    <div>
      <h3>Next Steps</h3>
      <a href="#next-steps">
        
      </a>
    </div>
    <p>The results here have been very impactful, though we think that we can do even better. In the future, we plan to get rid of full builds altogether by replacing them with zone setting builds. Instead of fetching the zone settings in addition to all the records, the zone setting builder would just fetch the settings for the zone and propagate that to our global network via Quicksilver. Similar to the per-record builds, this is a difficult challenge due to the complexity of zone settings and the number of actors that touch it. Ultimately if this can be accomplished, we can officially retire the full builds and leave it as a reminder in our git history of the scale at which we have grown over the years.</p><p>In addition, we plan to introduce a batching system that will collect record changes into groups to minimize the number of queries we make to our database and Quicksilver.</p><p>Does solving these kinds of technical and operational challenges excite you? Cloudflare is always hiring for talented specialists and generalists within our <a href="https://www.cloudflare.com/careers/jobs/?department=Engineering&amp;location=default">Engineering</a> and <a href="https://www.cloudflare.com/careers">other teams</a>.</p> ]]></content:encoded>
            <category><![CDATA[DNS]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <category><![CDATA[Product News]]></category>
            <guid isPermaLink="false">1TgZJPuWF9cbCw5YAo3emU</guid>
            <dc:creator>Alex Fattouche</dc:creator>
        </item>
        <item>
            <title><![CDATA[Getting to the Core: Benchmarking Cloudflare’s Latest Server Hardware]]></title>
            <link>https://blog.cloudflare.com/getting-to-the-core/</link>
            <pubDate>Fri, 20 Nov 2020 12:00:00 GMT</pubDate>
            <description><![CDATA[ A refresh of the hardware that Cloudflare uses to run analytics provided big efficiency improvements. ]]></description>
            <content:encoded><![CDATA[ <p></p><p>Maintaining a server fleet the size of Cloudflare’s is an operational challenge, to say the least. Anything we can do to lower complexity and improve efficiency has effects for our SRE (Site Reliability Engineer) and Data Center teams that can be felt throughout a server’s 4+ year lifespan.</p><p>At the Cloudflare Core, we process logs to analyze attacks and compute analytics. In 2020, our Core servers were in need of a refresh, so we decided to redesign the hardware to be more in line with our Gen X edge servers. We designed two major server variants for the core. The first is Core Compute 2020, an AMD-based server for analytics and general-purpose compute paired with solid-state storage drives. The second is Core Storage 2020, an Intel-based server with twelve spinning disks to run database workloads.</p>
    <div>
      <h2>Core Compute 2020</h2>
      <a href="#core-compute-2020">
        
      </a>
    </div>
    <p>Earlier this year, we blogged about our 10th generation edge servers or Gen X and the <a href="/technical-details-of-why-cloudflare-chose-amd-epyc-for-gen-x-servers/">improvements</a> they delivered to our edge in <a href="/an-epyc-trip-to-rome-amd-is-cloudflares-10th-generation-edge-server-cpu/">both</a> performance and <a href="/securing-memory-at-epyc-scale/">security</a>. The new Core Compute 2020 server leverages many of our learnings from the edge server. The Core Compute servers run a variety of workloads including Kubernetes, Kafka, and various smaller services.</p>
    <div>
      <h3>Configuration Changes (Kubernetes)</h3>
      <a href="#configuration-changes-kubernetes">
        
      </a>
    </div>
    <table><tr><td><p>
</p></td><td><p><b>Previous Generation Compute</b></p></td><td><p><b>Core Compute 2020</b></p></td></tr><tr><td><p>CPU</p></td><td><p>2 x Intel Xeon Gold 6262</p></td><td><p>1 x AMD EPYC 7642</p></td></tr><tr><td><p>Total Core / Thread Count</p></td><td><p>48C / 96T</p></td><td><p>48C / 96T</p></td></tr><tr><td><p>Base / Turbo Frequency</p></td><td><p>1.9 / 3.6 GHz</p></td><td><p>2.3 / 3.3 GHz</p></td></tr><tr><td><p>Memory</p></td><td><p>8 x 32GB DDR4-2666</p></td><td><p>8 x 32GB DDR4-2933</p></td></tr><tr><td><p>Storage</p></td><td><p>6 x 480GB SATA SSD</p></td><td><p>2 x 3.84TB NVMe SSD</p></td></tr><tr><td><p>Network</p></td><td><p>Mellanox CX4 Lx 2 x 25GbE</p></td><td><p>Mellanox CX4 Lx 2 x 25GbE</p></td></tr></table><p><b>Configuration Changes (Kafka)</b></p><table><tr><td><p>
</p></td><td><p><b>Previous Generation (Kafka)</b></p></td><td><p><b>Core Compute 2020</b></p></td></tr><tr><td><p>CPU</p></td><td><p>2 x Intel Xeon Silver 4116</p></td><td><p>1 x AMD EPYC 7642</p></td></tr><tr><td><p>Total Core / Thread Count</p></td><td><p>24C / 48T</p></td><td><p>48C / 96T</p></td></tr><tr><td><p>Base / Turbo Frequency</p></td><td><p>2.1 / 3.0 GHz</p></td><td><p>2.3 / 3.3 GHz</p></td></tr><tr><td><p>Memory</p></td><td><p>6 x 32GB DDR4-2400</p></td><td><p>8 x 32GB DDR4-2933</p></td></tr><tr><td><p>Storage</p></td><td><p>12 x 1.92TB SATA SSD</p></td><td><p>10 x 3.84TB NVMe SSD</p></td></tr><tr><td><p>Network</p></td><td><p>Mellanox CX4 Lx 2 x 25GbE</p></td><td><p>Mellanox CX4 Lx 2 x 25GbE</p></td></tr></table><p>Both previous generation servers were Intel-based platforms, with the Kubernetes server based on Xeon 6262 processors, and the Kafka server based on Xeon 4116 processors. One goal with these refreshed versions was to converge the configurations in order to simplify spare parts and firmware management across the fleet.</p><p>As the above tables show, the configurations have been converged with the only difference being the number of NVMe drives installed depending on the workload running on the host. In both cases we moved from a dual-socket configuration to a single-socket configuration, and the number of cores and threads per server either increased or stayed the same. In all cases, the base frequency of those cores was significantly improved. We also moved from SATA SSDs to NVMe SSDs.</p>
    <div>
      <h3>Core Compute 2020 Synthetic Benchmarking</h3>
      <a href="#core-compute-2020-synthetic-benchmarking">
        
      </a>
    </div>
    <p>The heaviest user of the SSDs was determined to be Kafka. The majority of the time Kafka is sequentially writing 2MB blocks to the disk. We created a simple FIO script with 75% sequential write and 25% sequential read, scaling the block size from a standard page table entry size of 4096B to Kafka’s write size of 2MB. The results aligned with what we expected from an NVMe-based drive.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6gqCJeXAL1sUcfVmBW3tVx/7b8a3a9a233086a321967ebb20878434/image5-5.png" />
            
            </figure>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6jvSoQBw4BnPCkDYGyeqBH/b2c0a79f10afbdd73ee73d2545f5700f/image4-9.png" />
            
            </figure>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1N6j9iEmdVaGiomZaoT1wH/cea3efb6d4781f8c0856743869feeb39/image3-8.png" />
            
            </figure>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3KfgNVcwPiNMcb9Gv3T2KU/74c8932b2f69bacfa32754c4c7c1e2d8/image6-5.png" />
            
            </figure>
    <div>
      <h3>Core Compute 2020 Production Benchmarking</h3>
      <a href="#core-compute-2020-production-benchmarking">
        
      </a>
    </div>
    <p>Cloudflare runs many of our Core Compute services in Kubernetes containers, some of which are multi-core. By transitioning to a single socket, problems associated with dual sockets were eliminated, and we are guaranteed to have all cores allocated for any given container on the same socket.</p><p>Another heavy workload that is constantly running on Compute hosts is the Cloudflare <a href="/the-csam-scanning-tool/">CSAM Scanning Tool</a>. Our Systems Engineering team isolated a Compute 2020 compute host and the previous generation compute host, had them run just this workload, and measured the time to compare the fuzzy hashes for images to the NCMEC hash lists and verify that they are a “miss”.</p><p>Because the CSAM Scanning Tool is very compute intensive we specifically isolated it to take a look at its performance with the new hardware. We’ve spent a great deal of effort on software optimization and improved algorithms for this tool but investing in faster, better hardware is also important.</p><p>In these heatmaps, the X axis represents time, and the Y axis represents “buckets” of time taken to verify that it is not a match to one of the NCMEC hash lists. For a given time slice in the heatmap, the red point is the bucket with the most times measured, the yellow point the second most, and the green points the least. The red points on the Compute 2020 graph are all in the 5 to 8 millisecond bucket, while the red points on the previous Gen heatmap are all in the 8 to 13 millisecond bucket, which shows that on average, the Compute 2020 host is verifying hashes significantly faster.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6JYLpJjG9bAUuQLyeC2kyH/9168b62067b8776a7521e7d831472f74/image2-10.png" />
            
            </figure>
    <div>
      <h2>Core Storage 2020</h2>
      <a href="#core-storage-2020">
        
      </a>
    </div>
    <p>Another major workload we identified was <a href="/clickhouse-capacity-estimation-framework/">ClickHouse</a>, which performs analytics over large datasets. The last time we upgraded our servers running ClickHouse was back in <a href="/http-analytics-for-6m-requests-per-second-using-clickhouse/">2018</a>.</p>
    <div>
      <h3>Configuration Changes</h3>
      <a href="#configuration-changes">
        
      </a>
    </div>
    <table><tr><td><p>
</p></td><td><p><b>Previous Generation</b></p></td><td><p><b>Core Storage 2020</b></p></td></tr><tr><td><p>CPU</p></td><td><p>2 x Intel Xeon E5-2630 v4</p></td><td><p>1 x Intel Xeon Gold 6210U</p></td></tr><tr><td><p>Total Core / Thread Count</p></td><td><p>20C / 40T</p></td><td><p>20C / 40T</p></td></tr><tr><td><p>Base / Turbo Frequency</p></td><td><p>2.2 / 3.1 GHz</p></td><td><p>2.5 / 3.9 GHz</p></td></tr><tr><td><p>Memory</p></td><td><p>8 x 32GB DDR4-2400</p></td><td><p>8 x 32GB DDR4-2933</p></td></tr><tr><td><p>Storage</p></td><td><p>12 x 10TB 7200 RPM 3.5” SATA</p></td><td><p>12 x 10TB 7200 RPM 3.5” SATA</p></td></tr><tr><td><p>Network</p></td><td><p>Mellanox CX4 Lx 2 x 25GbE</p></td><td><p>Mellanox CX4 Lx 2 x 25GbE</p></td></tr></table><p><b>CPU Changes</b></p><p>For ClickHouse, we use a 1U chassis with 12 x 10TB 3.5” hard drives. At the time we were designing Core Storage 2020 our server vendor did not yet have an AMD version of this chassis, so we remained on Intel. However, we moved Core Storage 2020 to a single 20 core / 40 thread Xeon processor, rather than the previous generation’s dual-socket 10 core / 20 thread processors. By moving to the single-socket Xeon 6210U processor, we were able to keep the same core count, but gained 17% higher base frequency and 26% higher max turbo frequency. Meanwhile, the total CPU thermal design profile (TDP), which is an approximation of the maximum power the CPU can draw, went down from 165W to 150W.</p><p>On a dual-socket server, remote memory accesses, which are memory accesses by a process on socket 0 to memory attached to socket 1, incur a latency penalty, as seen in this table:</p><table><tr><td><p>
</p></td><td><p><b>Previous Generation</b></p></td><td><p><b>Core Storage 2020</b></p></td></tr><tr><td><p>Memory latency, socket 0 to socket 0</p></td><td><p>81.3 ns</p></td><td><p>86.9 ns</p></td></tr><tr><td><p>Memory latency, socket 0 to socket 1</p></td><td><p>142.6 ns</p></td><td><p>N/A</p></td></tr></table><p>An additional advantage of having a CPU with all 20 cores on the same socket is the elimination of these remote memory accesses, which take 76% longer than local memory accesses.</p>
    <div>
      <h3>Memory Changes</h3>
      <a href="#memory-changes">
        
      </a>
    </div>
    <p>The memory in the Core Storage 2020 host is rated for operation at 2933 MHz; however, in the 8 x 32GB configuration we need on these hosts, the Intel Xeon 6210U processor clocks them at 2666 MH. Compared to the previous generation, this gives us a 13% boost in memory speed. While we would get a slightly higher clock speed with a balanced, 6 DIMMs configuration, we determined that we are willing to sacrifice the slightly higher clock speed in order to have the additional RAM capacity provided by the 8 x 32GB configuration.</p>
    <div>
      <h3>Storage Changes</h3>
      <a href="#storage-changes">
        
      </a>
    </div>
    <p>Data capacity stayed the same, with 12 x 10TB SATA drives in RAID 0 configuration for best  throughput. Unlike the previous generation, the drives in the Core Storage 2020 host are helium filled. Helium produces less drag than air, resulting in potentially lower latency.</p>
    <div>
      <h3>Core Storage 2020 Synthetic benchmarking</h3>
      <a href="#core-storage-2020-synthetic-benchmarking">
        
      </a>
    </div>
    <p>We performed synthetic four corners benchmarking: IOPS measurements of random reads and writes using 4k block size, and bandwidth measurements of sequential reads and writes using 128k block size. We used the fio tool to see what improvements we would get in a lab environment. The results show a 10% latency improvement and 11% IOPS improvement in random read performance. Random write testing shows 38% lower latency and 60% higher IOPS. Write throughput is improved by 23%, and read throughput is improved by a whopping 90%.</p><p></p><table><tr><td><p>
</p></td><td><p><b>Previous Generation</b></p></td><td><p><b>Core Storage 2020</b></p></td><td><p><b>% Improvement</b></p></td></tr><tr><td><p>4k Random Reads (IOPS)</p></td><td><p>3,384</p></td><td><p>3,758</p></td><td><p>11.0%</p></td></tr><tr><td><p>4k Random Read Mean Latency (ms, lower is better)</p></td><td><p>75.4</p></td><td><p>67.8</p></td><td><p>10.1% lower</p></td></tr><tr><td><p>4k Random Writes (IOPS)</p></td><td><p>4,009</p></td><td><p>6,397</p></td><td><p>59.6%</p></td></tr><tr><td><p>4k Random Write Mean Latency (ms, lower is better)</p></td><td><p>63.5</p></td><td><p>39.7</p></td><td><p>37.5% lower</p></td></tr><tr><td><p>128k Sequential Reads (MB/s)</p></td><td><p>1,155</p></td><td><p>2,195</p></td><td><p>90.0%</p></td></tr><tr><td><p>128k Sequential Writes (MB/s)</p></td><td><p>1,265</p></td><td><p>1,558</p></td><td><p>23.2%</p></td></tr></table>
    <div>
      <h3>CPU frequencies</h3>
      <a href="#cpu-frequencies">
        
      </a>
    </div>
    <p>The higher base and turbo frequencies of the Core Storage 2020 host’s Xeon 6210U processor allowed that processor to achieve higher average frequencies while running our production ClickHouse workload. A recent snapshot of two production hosts showed the Core Storage 2020 host being able to sustain an average of 31% higher CPU frequency while running ClickHouse.</p><table><tr><td><p>
</p></td><td><p><b>Previous generation (average core frequency)</b></p></td><td><p><b>Core Storage 2020 (average core frequency)</b></p></td><td><p><b>% improvement</b></p></td></tr><tr><td><p>Mean Core Frequency</p></td><td><p>2441 MHz</p></td><td><p>3199 MHz</p></td><td><p>31%</p></td></tr></table>
    <div>
      <h3>Core Storage 2020 Production benchmarking</h3>
      <a href="#core-storage-2020-production-benchmarking">
        
      </a>
    </div>
    <p>Our ClickHouse database hosts are continually performing merge operations to optimize the database data structures. Each individual merge operation takes just a few seconds on average, but since they’re constantly running, they can consume significant resources on the host. We sampled the average merge time every five minutes over seven days, and then sampled the data to find the average, minimum, and maximum merge times reported by a Compute 2020 host and by a previous generation host. Results are summarized below.</p>
    <div>
      <h3>ClickHouse merge operation performance improvement</h3>
      <a href="#clickhouse-merge-operation-performance-improvement">
        
      </a>
    </div>
    <table><tr><td><p><b>Time</b></p></td><td><p><b>Previous generation</b></p></td><td><p><b>Core Storage 2020</b></p></td><td><p><b>% improvement</b></p></td></tr><tr><td><p>Mean time to merge</p></td><td><p>1.83</p></td><td><p>1.15</p></td><td><p>37% lower</p></td></tr><tr><td><p>Maximum merge time</p></td><td><p>3.51</p></td><td><p>2.35</p></td><td><p>33% lower</p></td></tr><tr><td><p>Minimum merge time</p></td><td><p>0.68</p></td><td><p>0.32</p></td><td><p>53% lower</p></td></tr></table><p>Our lab-measured CPU frequency and storage performance improvements on Core Storage 2020 have translated into significantly reduced times to perform this database operation.</p>
    <div>
      <h2>Conclusion</h2>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>With our Core 2020 servers, we were able to realize significant performance improvements, both in synthetic benchmarking outside production and in the production workloads we tested. This will allow Cloudflare to run the same workloads on fewer servers, saving CapEx costs and data center rack space. The similarity of the configuration of the Kubernetes and Kafka hosts should help with fleet management and spare parts management. For our next redesign, we will try to further converge the designs on which we run the major Core workloads to further improve efficiency.</p><p>Special thanks to Will Buckner and Chris Snook for their help in the development of these servers, and to Tim Bart for validating CSAM Scanning Tool’s performance on Compute.</p> ]]></content:encoded>
            <category><![CDATA[Hardware]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Kubernetes]]></category>
            <category><![CDATA[ClickHouse]]></category>
            <category><![CDATA[Gen X]]></category>
            <guid isPermaLink="false">4fzfrcZ8XekQ1ykkMxltp1</guid>
            <dc:creator>Brian Bassett</dc:creator>
        </item>
        <item>
            <title><![CDATA[Tracing System CPU on Debian Stretch]]></title>
            <link>https://blog.cloudflare.com/tracing-system-cpu-on-debian-stretch/</link>
            <pubDate>Sun, 13 May 2018 16:00:00 GMT</pubDate>
            <description><![CDATA[ How an innocent OS upgrade triggered a cascade of issues and forced us into tracing Linux networking internals. ]]></description>
            <content:encoded><![CDATA[ <p><i>This is a heavily truncated version of an internal blog post from August 2017. For more recent updates on Kafka, check out </i><a href="/squeezing-the-firehose/"><i>another blog post on compression</i></a><i>, where we optimized throughput 4.5x for both disks and network.</i></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6o0r4Jk1oqG6ncMv8xWNb3/08bd0256a7509447b87aa08a7e7305f5/photo-1511971523672-53e6411f62b9" />
            
            </figure><p>Photo by <a href="https://unsplash.com/@alex_povolyashko?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Alex Povolyashko</a> / <a href="https://unsplash.com/?utm_source=ghost&amp;utm_medium=referral&amp;utm_campaign=api-credit">Unsplash</a></p>
    <div>
      <h3>Upgrading our systems to Debian Stretch</h3>
      <a href="#upgrading-our-systems-to-debian-stretch">
        
      </a>
    </div>
    <p>For quite some time we've been rolling out Debian Stretch, to the point where we have reached ~10% adoption in our core datacenters. As part of upgarding the underlying OS, we also evaluate the higher level software stack, e.g. taking a look at our ClickHouse and Kafka clusters.</p><p>During our upgrade of Kafka, we sucessfully migrated two smaller clusters, <code>logs</code> and <code>dns</code>, but ran into issues when attempting to upgrade one of our larger clusters, <code>http</code>.</p><p>Thankfully, we were able to roll back the <code>http</code> cluster upgrade relatively easily, due to heavy versioning of both the OS and the higher level software stack. If there's one takeaway from this blog post, it's to take advantage of consistent versioning.</p>
    <div>
      <h3>High level differences</h3>
      <a href="#high-level-differences">
        
      </a>
    </div>
    <p>We upgraded one Kafka <code>http</code> node, and it did not go as planned:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/1.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2hbPsT1ahBYgS806ztIpG3/070d402c9a5c1d38f257d65d87252f6c/1.png" />
            </a>
            </figure><p>Having 5x CPU usage was definitely an unexpected outcome. For control datapoints, we compared to a node where no upgrade happened, and an intermediary node that received a software stack upgrade, but not an OS upgrade. Neither of these two nodes experienced the same CPU saturation issues, even though their setups were practically identical.</p><p>For debugging CPU saturation issues, we call on <code>perf</code> to fish out details:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/2-3.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2FsdAiwWMDh5t6VTTLuSV9/14c09b1b8fc3053a3dd4d49ff467f19a/2-3.png" />
            </a>
            </figure><p><i>The command used was: </i><code><i>perf top -F 99</i></code><i>.</i></p>
    <div>
      <h3>RCU stalls</h3>
      <a href="#rcu-stalls">
        
      </a>
    </div>
    <p>In addition to higher system CPU usage, we found secondary slowdowns, including <a href="http://www.rdrop.com/~paulmck/RCU/whatisRCU.html">read-copy update (RCU)</a> stalls:</p>
            <pre><code>[ 4909.110009] logfwdr (26887) used greatest stack depth: 11544 bytes left
[ 4909.392659] oom_reaper: reaped process 26861 (logfwdr), now anon-rss:8kB, file-rss:0kB, shmem-rss:0kB
[ 4923.462841] INFO: rcu_sched self-detected stall on CPU
[ 4923.462843]  13-...: (2 GPs behind) idle=ea7/140000000000001/0 softirq=1/2 fqs=4198
[ 4923.462845]   (t=8403 jiffies g=110722 c=110721 q=6440)</code></pre>
            <p>We've seen RCU stalls before, and our (suboptimal) solution was to reboot the machine.</p><p>However, one can only handle so many reboots before the problem becomes severe enough to warrant a deep dive. During our deep dive, we noticed in <code>dmesg</code> that we had issues allocating memory, while trying to write errors:</p>
            <pre><code>Aug 15 21:51:35 myhost kernel: INFO: rcu_sched detected stalls on CPUs/tasks:
Aug 15 21:51:35 myhost kernel:         26-...: (1881 ticks this GP) idle=76f/140000000000000/0 softirq=8/8 fqs=365
Aug 15 21:51:35 myhost kernel:         (detected by 0, t=2102 jiffies, g=1837293, c=1837292, q=262)
Aug 15 21:51:35 myhost kernel: Task dump for CPU 26:
Aug 15 21:51:35 myhost kernel: java            R  running task    13488  1714   1513 0x00080188
Aug 15 21:51:35 myhost kernel:  ffffc9000d1f7898 ffffffff814ee977 ffff88103f410400 000000000000000a
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffffc9000d1f78c0 ffffffff814eea10
Aug 15 21:51:35 myhost kernel:  0000000000000041 ffffffff82203142 ffff88103f410400 ffffc9000d1f7920
Aug 15 21:51:35 myhost kernel: Call Trace:
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff814ee977&gt;] ? scrup+0x147/0x160
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff814eea10&gt;] ? lf+0x80/0x90
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff814eecb5&gt;] ? vt_console_print+0x295/0x3c0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff810b1193&gt;] ? call_console_drivers.isra.22.constprop.30+0xf3/0x100
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff810b1f51&gt;] ? console_unlock+0x281/0x550
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff810b2498&gt;] ? vprintk_emit+0x278/0x430
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff810b27ef&gt;] ? vprintk_default+0x1f/0x30
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff811588df&gt;] ? printk+0x48/0x50
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff810b30ee&gt;] ? dump_stack_print_info+0x7e/0xc0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8142d41f&gt;] ? dump_stack+0x44/0x65
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff81162e64&gt;] ? warn_alloc+0x124/0x150
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff81163842&gt;] ? __alloc_pages_slowpath+0x932/0xb80
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff81163c92&gt;] ? __alloc_pages_nodemask+0x202/0x250
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff811ae9c2&gt;] ? alloc_pages_current+0x92/0x120
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff81159d2f&gt;] ? __page_cache_alloc+0xbf/0xd0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8115cdfa&gt;] ? filemap_fault+0x2ea/0x4d0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8136dc95&gt;] ? xfs_filemap_fault+0x45/0xa0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8118b3eb&gt;] ? __do_fault+0x6b/0xd0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff81190028&gt;] ? handle_mm_fault+0xe98/0x12b0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8110756b&gt;] ? __seccomp_filter+0x1db/0x290
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8104fa5c&gt;] ? __do_page_fault+0x22c/0x4c0
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff8104fd10&gt;] ? do_page_fault+0x20/0x70
Aug 15 21:51:35 myhost kernel:  [&lt;ffffffff819bea02&gt;] ? page_fault+0x22/0x30</code></pre>
            <p>This suggested that we were logging too many errors, and the actual failure may be earlier in the process. Armed with this hypothesis, we looked at the very beginning of the error chain:</p>
            <pre><code>Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages
Aug 16 01:14:51 myhost kernel:  [&lt;ffffffff81171754&gt;] shrink_inactive_list+0x1f4/0x4f0
Aug 16 01:14:51 myhost kernel:  [&lt;ffffffff8117234b&gt;] shrink_node_memcg+0x5bb/0x780
Aug 16 01:14:51 myhost kernel:  [&lt;ffffffff811725e2&gt;] shrink_node+0xd2/0x2f0
Aug 16 01:14:51 myhost kernel:  [&lt;ffffffff811728ef&gt;] do_try_to_free_pages+0xef/0x310
Aug 16 01:14:51 myhost kernel:  [&lt;ffffffff81172be5&gt;] try_to_free_pages+0xd5/0x180
Aug 16 01:14:51 myhost kernel:  [&lt;ffffffff811632db&gt;] __alloc_pages_slowpath+0x31b/0xb80</code></pre>
            <p>As much as <code>shrink_node</code> may scream "NUMA issues", you're looking primarily at:</p>
            <pre><code>Aug 16 01:14:51 myhost systemd-journald[13812]: Missed 17171 kernel messages</code></pre>
            <p>In addition, we also found memory allocation issues:</p>
            <pre><code>[78972.506644] Mem-Info:
[78972.506653] active_anon:3936889 inactive_anon:371971 isolated_anon:0
[78972.506653]  active_file:25778474 inactive_file:1214478 isolated_file:2208
[78972.506653]  unevictable:0 dirty:1760643 writeback:0 unstable:0
[78972.506653]  slab_reclaimable:1059804 slab_unreclaimable:141694
[78972.506653]  mapped:47285 shmem:535917 pagetables:10298 bounce:0
[78972.506653]  free:202928 free_pcp:3085 free_cma:0
[78972.506660] Node 0 active_anon:8333016kB inactive_anon:989808kB active_file:50622384kB inactive_file:2401416kB unevictable:0kB isolated(anon):0kB isolated(file):3072kB mapped:96624kB dirty:3422168kB writeback:0kB shmem:1261156kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:15744 all_unreclaimable? no
[78972.506666] Node 1 active_anon:7414540kB inactive_anon:498076kB active_file:52491512kB inactive_file:2456496kB unevictable:0kB isolated(anon):0kB isolated(file):5760kB mapped:92516kB dirty:3620404kB writeback:0kB shmem:882512kB shmem_thp: 0kB shmem_pmdmapped: 0kB anon_thp: 0kB writeback_tmp:0kB unstable:0kB pages_scanned:9080974 all_unreclaimable? no
[78972.506671] Node 0 DMA free:15900kB min:100kB low:124kB high:148kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB writepending:0kB present:15996kB managed:15900kB mlocked:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB
** 9 printk messages dropped ** [78972.506716] Node 0 Normal: 15336*4kB (UMEH) 4584*8kB (MEH) 2119*16kB (UME) 775*32kB (MEH) 106*64kB (UM) 81*128kB (MH) 29*256kB (UM) 25*512kB (M) 19*1024kB (M) 7*2048kB (M) 2*4096kB (M) = 236080kB
[78972.506725] Node 1 Normal: 31740*4kB (UMEH) 3879*8kB (UMEH) 873*16kB (UME) 353*32kB (UM) 286*64kB (UMH) 62*128kB (UMH) 28*256kB (MH) 20*512kB (UMH) 15*1024kB (UM) 7*2048kB (UM) 12*4096kB (M) = 305752kB
[78972.506726] Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506727] Node 1 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
[78972.506728] 27531091 total pagecache pages
[78972.506729] 0 pages in swap cache
[78972.506730] Swap cache stats: add 0, delete 0, find 0/0
[78972.506730] Free swap  = 0kB
[78972.506731] Total swap = 0kB
[78972.506731] 33524975 pages RAM
[78972.506732] 0 pages HighMem/MovableOnly
[78972.506732] 546255 pages reserved
[78972.620129] ntpd: page allocation stalls for 272380ms, order:0, mode:0x24000c0(GFP_KERNEL)
[78972.620132] CPU: 16 PID: 13099 Comm: ntpd Tainted: G           O    4.9.43-cloudflare-2017.8.4 #1
[78972.620133] Hardware name: Quanta Computer Inc D51B-2U (dual 1G LoM)/S2B-MB (dual 1G LoM), BIOS S2B_3A21 10/01/2015
[78972.620136]  ffffc90022f9b6f8 ffffffff8142d668 ffffffff81ca31b8 0000000000000001
[78972.620138]  ffffc90022f9b778 ffffffff81162f14 024000c022f9b740 ffffffff81ca31b8
[78972.620140]  ffffc90022f9b720 0000000000000010 ffffc90022f9b788 ffffc90022f9b738
[78972.620140] Call Trace:
[78972.620148]  [&lt;ffffffff8142d668&gt;] dump_stack+0x4d/0x65
[78972.620152]  [&lt;ffffffff81162f14&gt;] warn_alloc+0x124/0x150
[78972.620154]  [&lt;ffffffff811638f2&gt;] __alloc_pages_slowpath+0x932/0xb80
[78972.620157]  [&lt;ffffffff81163d42&gt;] __alloc_pages_nodemask+0x202/0x250
[78972.620160]  [&lt;ffffffff811aeae2&gt;] alloc_pages_current+0x92/0x120
[78972.620162]  [&lt;ffffffff8115f6ee&gt;] __get_free_pages+0xe/0x40
[78972.620165]  [&lt;ffffffff811e747a&gt;] __pollwait+0x9a/0xe0
[78972.620168]  [&lt;ffffffff817c9ec9&gt;] datagram_poll+0x29/0x100
[78972.620170]  [&lt;ffffffff817b9d48&gt;] sock_poll+0x48/0xa0
[78972.620172]  [&lt;ffffffff811e7c35&gt;] do_select+0x335/0x7b0</code></pre>
            <p>This specific error message did seem fun:</p>
            <pre><code>[78991.546088] systemd-network: page allocation stalls for 287000ms, order:0, mode:0x24200ca(GFP_HIGHUSER_MOVABLE)</code></pre>
            <p>You don't want your page allocations to stall for 5 minutes, especially when it's order zero allocation (smallest allocation of one 4 KiB page).</p><p>Comparing to our control nodes, the only two possible explanations were: a kernel upgrade, and the switch from Debian Jessie to Debian Stretch. We suspected the former, since CPU usage implies a kernel issue. However, just to be safe, we rolled both the kernel back to 4.4.55, and downgraded the affected nodes back to Debian Jessie. This was a reasonable compromise, since we needed to minimize downtime on production nodes.</p>
    <div>
      <h3>Digging a bit deeper</h3>
      <a href="#digging-a-bit-deeper">
        
      </a>
    </div>
    <p>Keeping servers running on older kernel and distribution is not a viable long term solution. Through bisection, we found the issue lay in the Jessie to Stretch upgrade, contrary to our initial hypothesis.</p><p>Now that we knew what the problem was, we proceeded to investigate why. With the help from existing automation around <code>perf</code> and Java, we generated the following flamegraphs:</p><ul><li><p>Jessie</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/9.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3fhMCSmQj4IC8MLxPN2d1V/60a107967bdede0ba8c4465090fb6ec4/9.png" />
            </a>
            </figure><ul><li><p>Stretch</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/10.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6b523AB48TNUF2jj6OYhxi/9cadde05d8cf89187f182b56c48b3c1b/10.png" />
            </a>
            </figure><p>At first it looked like Jessie was doing <code>writev</code> instead of <code>sendfile</code>, but the full flamegraphs revealed that Strech was executing <code>sendfile</code> a lot slower.</p><p>If you highlight <code>sendfile</code>:</p><ul><li><p>Jessie</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/11.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6XXts4Hvy58nwa8FG3ZNfT/2e36ce3aa2b111059bcff6a21e3da712/11.png" />
            </a>
            </figure><ul><li><p>Stretch</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/12.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/32BB3vFsnbl7ul6b6Aa5MP/10788cfa90962c2034e7be7fc6b76a1f/12.png" />
            </a>
            </figure><p>And zoomed in:</p><ul><li><p>Jessie</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/13.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/75TU9Q58iCRcCKAt6eZxf3/4cdf2f2038bb4e7ef813ba7e21562121/13.png" />
            </a>
            </figure><ul><li><p>Stretch</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/14.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/fgUsgL2hUrhJHg3ns5HeE/3b733ecc2ed4ebee2a751394a81804fb/14.png" />
            </a>
            </figure><p>These two look very different.</p><p>Some colleagues suggested that the differences in the graphs may be due to TCP offload being disabled, but upon checking our NIC settings, we found that the feature flags were identical.</p><p>We'll dive into the differences in the next section.</p>
    <div>
      <h3>And deeper</h3>
      <a href="#and-deeper">
        
      </a>
    </div>
    <p>To trace latency distributions of <code>sendfile</code> syscalls between Jessie and Stretch, we used <a href="https://github.com/iovisor/bcc/blob/master/tools/funclatency_example.txt"><code>funclatency</code></a> from <a href="https://iovisor.github.io/bcc/">bcc-tools</a>:</p><ul><li><p>Jessie</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:25
     usecs               : count     distribution
         0 -&gt; 1          : 9        |                                        |
         2 -&gt; 3          : 47       |****                                    |
         4 -&gt; 7          : 53       |*****                                   |
         8 -&gt; 15         : 379      |****************************************|
        16 -&gt; 31         : 329      |**********************************      |
        32 -&gt; 63         : 101      |**********                              |
        64 -&gt; 127        : 23       |**                                      |
       128 -&gt; 255        : 50       |*****                                   |
       256 -&gt; 511        : 7        |                                        |</code></pre>
            <ul><li><p>Stretch</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/funclatency -uTi 1 do_sendfile
Tracing 1 functions for "do_sendfile"... Hit Ctrl-C to end.
23:27:28
     usecs               : count     distribution
         0 -&gt; 1          : 1        |                                        |
         2 -&gt; 3          : 20       |***                                     |
         4 -&gt; 7          : 46       |*******                                 |
         8 -&gt; 15         : 56       |********                                |
        16 -&gt; 31         : 65       |**********                              |
        32 -&gt; 63         : 75       |***********                             |
        64 -&gt; 127        : 75       |***********                             |
       128 -&gt; 255        : 258      |****************************************|
       256 -&gt; 511        : 144      |**********************                  |
       512 -&gt; 1023       : 24       |***                                     |
      1024 -&gt; 2047       : 27       |****                                    |
      2048 -&gt; 4095       : 28       |****                                    |
      4096 -&gt; 8191       : 35       |*****                                   |
      8192 -&gt; 16383      : 1        |                                        |</code></pre>
            <p>In the flamegraphs, you can see timers being set at the tip (<code>mod_timer</code> function), with these timers taking locks. On Stretch we installed 3x more timers, resulting in 10x the amount of contention:</p><ul><li><p>Jessie</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:36
FUNC                                    COUNT
mod_timer                               60482
00:33:37
FUNC                                    COUNT
mod_timer                               58263
00:33:38
FUNC                                    COUNT
mod_timer                               54626</code></pre>
            
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:36
FUNC                                    COUNT
lock_timer_base                         15962
00:32:37
FUNC                                    COUNT
lock_timer_base                         16261
00:32:38
FUNC                                    COUNT
lock_timer_base                         15806</code></pre>
            <ul><li><p>Stretch</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 mod_timer
Tracing 1 functions for "mod_timer"... Hit Ctrl-C to end.
00:33:28
FUNC                                    COUNT
mod_timer                              149068
00:33:29
FUNC                                    COUNT
mod_timer                              155994
00:33:30
FUNC                                    COUNT
mod_timer                              160688</code></pre>
            
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 lock_timer_base
Tracing 1 functions for "lock_timer_base"... Hit Ctrl-C to end.
00:32:32
FUNC                                    COUNT
lock_timer_base                        119189
00:32:33
FUNC                                    COUNT
lock_timer_base                        196895
00:32:34
FUNC                                    COUNT
lock_timer_base                        140085</code></pre>
            <p>The Linux kernel includes debugging facilities for timers, which <a href="https://elixir.bootlin.com/linux/v4.9.43/source/kernel/time/timer.c#L1010">call</a> the <code>timer:timer_start</code> <a href="https://elixir.bootlin.com/linux/v4.9.43/source/include/trace/events/timer.h#L44">tracepoint</a> on every timer start. This allowed us to pull up timer names:</p><ul><li><p>Jessie</p></li></ul>
            <pre><code>$ sudo perf record -e timer:timer_start -p 23485 -- sleep 10 &amp;&amp; sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 54 times to write data ]
[ perf record: Captured and wrote 17.778 MB perf.data (173520 samples) ]
      6 blk_rq_timed_out_timer
      2 clocksource_watchdog
      5 commit_timeout
      5 cursor_timer_handler
      2 dev_watchdog
     10 garp_join_timer
      2 ixgbe_service_timer
     36 reqsk_timer_handler
   4769 tcp_delack_timer
    171 tcp_keepalive_timer
 168512 tcp_write_timer</code></pre>
            <ul><li><p>Stretch</p></li></ul>
            <pre><code>$ sudo perf record -e timer:timer_start -p 3416 -- sleep 10 &amp;&amp; sudo perf script | sed 's/.* function=//g' | awk '{ print $1 }' | sort | uniq -c
[ perf record: Woken up 671 times to write data ]
[ perf record: Captured and wrote 198.273 MB perf.data (1988650 samples) ]
      6 clocksource_watchdog
      4 commit_timeout
     12 cursor_timer_handler
      2 dev_watchdog
     18 garp_join_timer
      4 ixgbe_service_timer
      1 neigh_timer_handler
      1 reqsk_timer_handler
   4622 tcp_delack_timer
      1 tcp_keepalive_timer
1983978 tcp_write_timer
      1 writeout_period</code></pre>
            <p>So basically we install 12x more <code>tcp_write_timer</code> timers, resulting in higher kernel CPU usage.</p><p>Taking specific flamegraphs of the timers revealed the differences in their operation:</p><ul><li><p>Jessie</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/15.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4PJjYK3FzgAeQxpbHPGn5i/06f546c8ea1cda3d58c4c54dd3618a15/15.png" />
            </a>
            </figure><ul><li><p>Stretch</p></li></ul>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/16.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7M8XyRvy7vHDytdpWJQXAr/784aa92acf4f92c8896d08e2fede9bcd/16.png" />
            </a>
            </figure><p>We then traced the functions that were different:</p><ul><li><p>Jessie</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:33
FUNC                                    COUNT
tcp_sendmsg                             21166
03:33:34
FUNC                                    COUNT
tcp_sendmsg                             21768
03:33:35
FUNC                                    COUNT
tcp_sendmsg                             21712</code></pre>
            
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:14
FUNC                                    COUNT
tcp_push_one                              496
03:37:15
FUNC                                    COUNT
tcp_push_one                              432
03:37:16
FUNC                                    COUNT
tcp_push_one                              495</code></pre>
            
            <pre><code>$ sudo /usr/share/bcc/tools/trace -p 23485 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
   1583 4
   2043 54
   3546 18
   4016 59
   4423 50
   5349 8
   6154 40
   6620 38
  17121 51
  39528 44</code></pre>
            <ul><li><p>Stretch</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_sendmsg
Tracing 1 functions for "tcp_sendmsg"... Hit Ctrl-C to end.
03:33:30
FUNC                                    COUNT
tcp_sendmsg                             53834
03:33:31
FUNC                                    COUNT
tcp_sendmsg                             49472
03:33:32
FUNC                                    COUNT
tcp_sendmsg                             51221</code></pre>
            
            <pre><code>$ sudo /usr/share/bcc/tools/funccount -T -i 1 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
03:37:10
FUNC                                    COUNT
tcp_push_one                            64483
03:37:11
FUNC                                    COUNT
tcp_push_one                            65058
03:37:12
FUNC                                    COUNT
tcp_push_one                            72394</code></pre>
            
            <pre><code>$ sudo /usr/share/bcc/tools/trace -p 3416 'tcp_sendmsg "%d", arg3' -T -M 100000 | awk '{ print $NF }' | sort | uniq -c | sort -n | tail
    396 46
    409 4
   1124 50
   1305 18
   1547 40
   1672 59
   1729 8
   2181 38
  19052 44
  64504 4096</code></pre>
            <p>The traces showed huge variations of <code>tcp_sendmsg</code> and <code>tcp_push_one</code> within <code>sendfile</code>.</p><p>To further introspect, we leveraged a kernel feature available since 4.9: the ability to count stacks. This led us to measuring what hits <code>tcp_push_one</code>:</p><ul><li><p>Jessie</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    1
  tcp_push_one
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    4950</code></pre>
            <ul><li><p>Stretch</p></li></ul>
            <pre><code>$ sudo /usr/share/bcc/tools/stackcount -i 10 tcp_push_one
Tracing 1 functions for "tcp_push_one"... Hit Ctrl-C to end.
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  do_iter_readv_writev
  do_readv_writev
  vfs_writev
  do_writev
  SyS_writev
  do_syscall_64
  return_from_SYSCALL_64
    123
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  sock_write_iter
  __vfs_write
  vfs_write
  SyS_write
  do_syscall_64
  return_from_SYSCALL_64
    172
  tcp_push_one
  inet_sendmsg
  sock_sendmsg
  kernel_sendmsg
  sock_no_sendpage
  tcp_sendpage
  inet_sendpage
  kernel_sendpage
  sock_sendpage
  pipe_to_sendpage
  __splice_from_pipe
  splice_from_pipe
  generic_splice_sendpage
  direct_splice_actor
  splice_direct_to_actor
  do_splice_direct
  do_sendfile
  sys_sendfile64
  do_syscall_64
  return_from_SYSCALL_64
    735110</code></pre>
            <p>If you diff the most popular stacks, you'll get:</p>
            <pre><code>--- jessie.txt  2017-08-16 21:14:13.000000000 -0700
+++ stretch.txt 2017-08-16 21:14:20.000000000 -0700
@@ -1,4 +1,9 @@
 tcp_push_one
+inet_sendmsg
+sock_sendmsg
+kernel_sendmsg
+sock_no_sendpage
+tcp_sendpage
 inet_sendpage
 kernel_sendpage
 sock_sendpage</code></pre>
            <p>Let's look closer at <a href="https://elixir.bootlin.com/linux/v4.9.43/source/net/ipv4/tcp.c#L1012"><code>tcp_sendpage</code></a>:</p>
            <pre><code>int tcp_sendpage(struct sock *sk, struct page *page, int offset,
         size_t size, int flags)
{
    ssize_t res;

    if (!(sk-&gt;sk_route_caps &amp; NETIF_F_SG) ||
        !sk_check_csum_caps(sk))
        return sock_no_sendpage(sk-&gt;sk_socket, page, offset, size,
                    flags);

    lock_sock(sk);

    tcp_rate_check_app_limited(sk);  /* is sending application-limited? */

    res = do_tcp_sendpages(sk, page, offset, size, flags);
    release_sock(sk);
    return res;
}</code></pre>
            <p>It looks like we don't enter the <code>if</code> body. We looked up what <a href="https://elixir.bootlin.com/linux/v4.9.43/source/include/linux/netdev_features.h#L115">NET_F_SG</a> does: <a href="https://en.wikipedia.org/wiki/Large_send_offload">segmentation offload</a>. This difference is peculiar, since both OS'es should have this enabled.</p>
    <div>
      <h3>Even deeper, to the crux</h3>
      <a href="#even-deeper-to-the-crux">
        
      </a>
    </div>
    <p>It turned out that we had segmentation offload enabled for only a few of our NICs: <code>eth2</code>, <code>eth3</code>, and <code>bond0</code>. Our network setup can be described as follows:</p>
            <pre><code>eth2 --&gt;|              |--&gt; vlan10
        |---&gt; bond0 --&gt;|
eth3 --&gt;|              |--&gt; vlan100</code></pre>
            <p><b>The missing piece was that we were missing segmentation offload on VLAN interfaces, where the actual IPs live.</b></p><p>Here's the diff from <code>ethtook -k vlan10</code>:</p>
            <pre><code>$ diff -rup &lt;(ssh jessie sudo ethtool -k vlan10) &lt;(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:21:12.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:21:12.000000000 -0700
@@ -1,21 +1,21 @@
 Features for vlan10:
 rx-checksumming: off [fixed]
-tx-checksumming: off
+tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
-       tx-checksum-ip-generic: off
+       tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
        tx-checksum-fcoe-crc: off
        tx-checksum-sctp: off
-scatter-gather: off
-       tx-scatter-gather: off
+scatter-gather: on
+       tx-scatter-gather: on
        tx-scatter-gather-fraglist: off
-tcp-segmentation-offload: off
-       tx-tcp-segmentation: off [requested on]
-       tx-tcp-ecn-segmentation: off [requested on]
-       tx-tcp-mangleid-segmentation: off [requested on]
-       tx-tcp6-segmentation: off [requested on]
-udp-fragmentation-offload: off [requested on]
-generic-segmentation-offload: off [requested on]
+tcp-segmentation-offload: on
+       tx-tcp-segmentation: on
+       tx-tcp-ecn-segmentation: on
+       tx-tcp-mangleid-segmentation: on
+       tx-tcp6-segmentation: on
+udp-fragmentation-offload: on
+generic-segmentation-offload: on
 generic-receive-offload: on
 large-receive-offload: off [fixed]
 rx-vlan-offload: off [fixed]</code></pre>
            <p>So we entusiastically enabled segmentation offload:</p>
            <pre><code>$ sudo ethtool -K vlan10 sg on</code></pre>
            <p>And it didn't help! Will the suffering ever end? Let's also enable TCP transmission checksumming offload:</p>
            <pre><code>$ sudo ethtool -K vlan10 tx on
Actual changes:
tx-checksumming: on
        tx-checksum-ip-generic: on
tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on
        tx-tcp-mangleid-segmentation: on
        tx-tcp6-segmentation: on
udp-fragmentation-offload: on</code></pre>
            <p>Nothing. The diff is essentially empty now:</p>
            <pre><code>$ diff -rup &lt;(ssh jessie sudo ethtool -k vlan10) &lt;(ssh stretch sudo ethtool -k vlan10)
--- /dev/fd/63  2017-08-16 21:31:27.000000000 -0700
+++ /dev/fd/62  2017-08-16 21:31:27.000000000 -0700
@@ -4,11 +4,11 @@ tx-checksumming: on
        tx-checksum-ipv4: off [fixed]
        tx-checksum-ip-generic: on
        tx-checksum-ipv6: off [fixed]
-       tx-checksum-fcoe-crc: off [requested on]
-       tx-checksum-sctp: off [requested on]
+       tx-checksum-fcoe-crc: off
+       tx-checksum-sctp: off
 scatter-gather: on
        tx-scatter-gather: on
-       tx-scatter-gather-fraglist: off [requested on]
+       tx-scatter-gather-fraglist: off
 tcp-segmentation-offload: on
        tx-tcp-segmentation: on
        tx-tcp-ecn-segmentation: on</code></pre>
            <p>The last missing piece we found was that offload changes are applied only during connection initiation, so we restarted Kafka, and we immediately saw a performance improvement (green line):</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/17.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/73uyXt5y4F1L6AUULX8S9g/e80494d09daf7d0b87884c62fd5341e6/17.png" />
            </a>
            </figure><p>Not enabling offload features when possible seems like a pretty bad regression, so we filed a ticket for <code>systemd</code>:</p><ul><li><p><a href="https://github.com/systemd/systemd/issues/6629">https://github.com/systemd/systemd/issues/6629</a></p></li></ul><p>In the meantime, we work around our upstream issue by enabling offload features automatically on boot if they are disabled on VLAN interfaces.</p><p>Having a fix enabled, we rebooted our <code>logs</code> Kafka cluster to upgrade to the latest kernel, and our 5 day CPU usage history yielded positive results:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/18.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/VBuiRySNEfFiN8LQ9nUg5/a5a1881b229cb1e173663af52f3eb136/18.png" />
            </a>
            </figure><p>The DNS cluster also yielded positive results, with just 2 nodes rebooted (purple line going down):</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/04/19.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4CWuJQCmMt7QarvAdU0b3g/c35ad9f7a9ab6113614f736f0e682d64/19.png" />
            </a>
            </figure>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>It was an error on our part to hit performance degradation without a good regression framework in place to catch the issue. Luckily, due to our heavy use of version control, we managed to bisect the issue rather quickly, and had a temp rollback in place while root causing the problem.</p><p>In the end, enabling offload also removed RCU stalls. It's not really clear whether it was the cause or just a catalyst, but the end result speaks for itself.</p><p>On the bright side, we dug pretty deep into Linux kernel internals, and although there were fleeting moments of giving up, moving to the woods to become a park ranger, we persevered and came out of the forest successful.</p><hr /><p><i>If deep diving from high level symptoms to kernel/OS issues makes you excited, </i><a href="https://www.cloudflare.com/careers/"><i>drop us a line</i></a><i>.</i></p><hr /> ]]></content:encoded>
            <category><![CDATA[Speed & Reliability]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[eBPF]]></category>
            <category><![CDATA[Linux]]></category>
            <guid isPermaLink="false">29dWe9XJa54DvzHbTBAzEk</guid>
            <dc:creator>Ivan Babrou</dc:creator>
        </item>
        <item>
            <title><![CDATA[HTTP Analytics for 6M requests per second using ClickHouse]]></title>
            <link>https://blog.cloudflare.com/http-analytics-for-6m-requests-per-second-using-clickhouse/</link>
            <pubDate>Tue, 06 Mar 2018 13:00:00 GMT</pubDate>
            <description><![CDATA[ One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options: ]]></description>
            <content:encoded><![CDATA[ <p>One of our large scale data infrastructure challenges here at Cloudflare is around providing HTTP traffic analytics to our customers. HTTP Analytics is available to all our customers via two options:</p><ul><li><p>Analytics tab in Cloudflare dashboard</p></li><li><p>Zone Analytics API with 2 endpoints</p><ul><li><p><a href="https://api.cloudflare.com/#zone-analytics-dashboard">Dashboard endpoint</a></p></li><li><p><a href="https://api.cloudflare.com/#zone-analytics-analytics-by-co-locations">Co-locations endpoint</a> (Enterprise plan only)</p></li></ul></li></ul><p>In this blog post I'm going to talk about the exciting evolution of the Cloudflare analytics pipeline over the last year. I'll start with a description of the old pipeline and the challenges that we experienced with it. Then, I'll describe how we leveraged ClickHouse to form the basis of a new and improved pipeline. In the process, I'll share details about how we went about schema design and performance tuning for ClickHouse. Finally, I'll look forward to what the Data team is thinking of providing in the future.</p><p>Let's start with the old data pipeline.</p>
    <div>
      <h3>Old data pipeline</h3>
      <a href="#old-data-pipeline">
        
      </a>
    </div>
    <p>The previous pipeline was built in 2014. It has been mentioned previously in <a href="https://blog.cloudflare.com/scaling-out-postgresql-for-cloudflare-analytics-using-citusdb/">Scaling out PostgreSQL for CloudFlare Analytics using CitusDB</a> and <a href="https://blog.cloudflare.com/more-data-more-data/">More data, more data</a> blog posts from the Data team.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5MY5UJgdkM35pwDy4mMOxt/2c08b73e37001788547db620f00a5a92/Old-system-architecture.jpg" />
          </figure><p>It had following components:</p><ul><li><p><b>Log forwarder </b>- collected Cap'n Proto formatted logs from the edge, notably DNS and Nginx logs, and shipped them to Kafka in Cloudflare central datacenter.</p></li><li><p><b>Kafka cluster </b>- consisted of 106 brokers with x3 replication factor, 106 partitions, ingested Cap'n Proto formatted logs at average rate 6M logs per second.</p></li><li><p><b>Kafka consumers</b> - each of 106 partitions had dedicated Go consumer (a.k.a. Zoneagg consumer), which read logs and produced aggregates per partition per zone per minute and then wrote them into Postgres.
<b>Postgres database</b> - single instance PostgreSQL database (a.k.a. RollupDB), accepted aggregates from Zoneagg consumers and wrote them into temporary tables per partition per minute. It then rolled-up the aggregates into further aggregates with aggregation cron. More specifically:</p><ul><li><p>Aggregates per partition, minute, zone → aggregates data per minute, zone</p></li><li><p>Aggregates per minute, zone → aggregates data per hour, zone</p></li><li><p>Aggregates per hour, zone → aggregates data per day, zone</p></li><li><p>Aggregates per day, zone → aggregates data per month, zone</p></li></ul></li><li><p><b>Citus Cluster</b> - consisted of Citus main and 11 Citus workers with x2 replication factor (a.k.a. Zoneagg Citus cluster), the storage behind Zone Analytics API and our BI internal tools. It had replication cron, which did remote copy of tables from Postgres instance into Citus worker shards.</p></li><li><p><b>Zone Analytics API</b> - served queries from internal PHP API. It consisted of 5 API instances written in Go and queried Citus cluster, and was not visible to external users.</p></li><li><p><b>PHP API </b>- 3 instances of proxying API, which forwarded public API queries to internal Zone Analytics API, and had some business logic on zone plans, error messages, etc.</p></li><li><p><b>Load Balancer </b>- nginx proxy, forwarded queries to PHP API/Zone Analytics API.</p></li></ul><p>Cloudflare has grown tremendously since this pipeline was originally designed in 2014. It started off processing under 1M requests per second and grew to current levels of 6M requests per second. The pipeline had served us and our customers well over the years, but began to split at the seams. Any system should be re-engineered after some time, when requirements change.</p><p>Some specific disadvantages of the original pipeline were:</p><ul><li><p><b>Postgres SPOF</b> - single PostgreSQL instance was a SPOF (Single Point of Failure), as it didn't have replicas or backups and if we were to lose this node, whole analytics pipeline could be paralyzed and produce no new aggregates for Zone Analytics API.</p></li><li><p><b>Citus main SPOF</b> - Citus main was the entrypoint to all Zone Analytics API queries and if it went down, all our customers' Analytics API queries would return errors.</p></li><li><p><b>Complex codebase</b> - thousands of lines of bash and SQL for aggregations, and thousands of lines of Go for API and Kafka consumers made the pipeline difficult to maintain and debug.</p></li><li><p><b>Many dependencies</b> - the pipeline consisted of many components, and failure in any individual component could result in halting the entire pipeline.</p></li><li><p><b>High maintenance cost</b> - due to its complex architecture and codebase, there were frequent incidents, which sometimes took engineers from the Data team and other teams many hours to mitigate.</p></li></ul><p>Over time, as our request volume grew, the challenges of operating this pipeline became more apparent, and we realized that this system was being pushed to its limits. This realization inspired us to think about which components would be ideal candidates for replacement, and led us to build new data pipeline.</p><p>Our first design of an improved analytics pipeline centred around the use of the <a href="https://flink.apache.org/">Apache Flink</a> stream processing system. We had previously used Flink for other data pipelines, so it was a natural choice for us. However, these pipelines had been at a much lower rate than the 6M requests per second we needed to process for HTTP Analytics, and we struggled to get Flink to scale to this volume - it just couldn't keep up with ingestion rate per partition on all 6M HTTP requests per second.</p><p>Our colleagues on our DNS team had already built and productionized DNS analytics pipeline atop ClickHouse. They wrote about it in <a href="https://blog.cloudflare.com/how-cloudflare-analyzes-1m-dns-queries-per-second/">"How Cloudflare analyzes 1M DNS queries per second"</a> blog post. So, we decided to take a deeper look at ClickHouse.</p>
    <div>
      <h3>ClickHouse</h3>
      <a href="#clickhouse">
        
      </a>
    </div>
    <blockquote><p>"ClickHouse не тормозит."
Translation from Russian: ClickHouse doesn't have brakes (or isn't slow)
© ClickHouse core developers</p></blockquote><p>When exploring additional candidates for replacing some of the key infrastructure of our old pipeline, we realized that using a column oriented database might be well suited to our analytics workloads. We wanted to identify a column oriented database that was horizontally scalable and fault tolerant to help us deliver good uptime guarantees, and extremely performant and space efficient such that it could handle our scale. We quickly realized that ClickHouse could satisfy these criteria, and then some.</p><p><a href="https://clickhouse.yandex/">ClickHouse</a> is an open source column-oriented database management system capable of real time generation of analytical data reports using SQL queries. It is blazing fast, linearly scalable, hardware efficient, fault tolerant, feature rich, highly reliable, simple and handy. ClickHouse core developers provide great help on solving issues, merging and maintaining our PRs into ClickHouse. For example, engineers from Cloudflare have contributed a whole bunch of code back upstream:</p><ul><li><p>Aggregate function <a href="https://clickhouse.com/docs/en/sql-reference/aggregate-functions/reference/topk">topK</a> by <a href="https://github.com/vavrusa">Marek Vavruša</a></p></li><li><p>IP prefix dictionary by Marek Vavruša</p></li><li><p>SummingMergeTree engine optimizations by Marek Vavruša</p></li><li><p><a href="https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka">Kafka table Engine</a> by Marek Vavruša. We're thinking to replace Kafka Go consumers with this engine when it will be stable enough and ingest from Kafka into ClickHouse directly.</p></li><li><p>Aggregate function <a href="https://clickhouse.yandex/docs/en/single/index.html#summapkey-value">sumMap</a> by <a href="https://github.com/bocharov">Alex Bocharov</a>. Without this function it would be impossible to build our new Zone Analytics API.</p></li><li><p><a href="https://github.com/yandex/ClickHouse/pull/1636">Mark cache fix</a> by Alex Bocharov</p></li><li><p><a href="https://github.com/yandex/ClickHouse/pull/1844">uniqHLL12 function fix</a> for big cardinalities by Alex Bocharov. The description of the issue and its fix should be an interesting reading.</p></li></ul>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6mN4cq6YrpiHjBfbP6vpCh/7181a8e68b4a63cd48e42e0eaf807191/ClickHouse-uniq-functions.png" />
          </figure><p>Along with filing many bug reports, we also report about every issue we face in our cluster, which we hope will help to improve ClickHouse in future.</p><p>Even though DNS analytics on ClickHouse had been a great success, we were still skeptical that we would be able to scale ClickHouse to the needs of the HTTP pipeline:</p><ul><li><p>Kafka DNS topic has on average 1.5M messages per second vs 6M messages per second for HTTP requests topic.</p></li><li><p>Kafka DNS topic average uncompressed message size is 130B vs 1630B for HTTP requests topic.</p></li><li><p>DNS query ClickHouse record consists of 40 columns vs 104 columns for HTTP request ClickHouse record.</p></li></ul><p>After unsuccessful attempts with Flink, we were skeptical of ClickHouse being able to keep up with the high ingestion rate. Luckily, early prototype showed promising performance and we decided to proceed with old pipeline replacement. The first step in replacing the old pipeline was to design a schema for the new ClickHouse tables.</p>
    <div>
      <h3>ClickHouse schema design</h3>
      <a href="#clickhouse-schema-design">
        
      </a>
    </div>
    <p>Once we identified ClickHouse as a potential candidate, we began exploring how we could port our existing Postgres/Citus schemas to make them compatible with ClickHouse.</p><p>For our <a href="https://api.cloudflare.com/#zone-analytics-dashboard">Zone Analytics API</a> we need to produce many different aggregations for each zone (domain) and time period (minutely / hourly / daily / monthly). For deeper dive about specifics of aggregates please follow Zone Analytics API documentation or this handy <a href="https://docs.google.com/spreadsheets/d/1zQ3yI3HB2p8hiM-Jwvq1-MaeEyIouix2I-iUAPZtJYw/edit#gid=1788221216">spreadsheet</a>.</p><p>These aggregations should be available for any time range for the last 365 days. While ClickHouse is a really great tool to work with non-aggregated data, with our volume of 6M requests per second we just cannot afford yet to store non-aggregated data for that long.</p><p>To give you an idea of how much data is that, here is some "napkin-math" capacity planning. I'm going to use an average insertion rate of 6M requests per second and $100 as a cost estimate of 1 TiB to calculate storage cost for 1 year in different message formats:</p><table><tr><th><p><b>Metric</b></p></th><th><p><b>Cap'n Proto</b></p></th><th><p><b>Cap'n Proto (zstd)</b></p></th><th><p><b>ClickHouse</b></p></th></tr><tr><td><p>Avg message/record size</p></td><td><p>1630 B</p></td><td><p>360 B</p></td><td><p>36.74 B</p></td></tr><tr><td><p>Storage requirements for 1 year</p></td><td><p>273.93 PiB</p></td><td><p>60.5 PiB</p></td><td><p>18.52 PiB (RF x3)</p></td></tr><tr><td><p>Storage cost for 1 year</p></td><td><p>$28M</p></td><td><p>$6.2M</p></td><td><p>$1.9M</p></td></tr></table><p>And that is if we assume that requests per second will stay the same, but in fact it's growing fast all the time.</p><p>Even though storage requirements are quite scary, we're still considering to store raw (non-aggregated) requests logs in ClickHouse for 1 month+. See "Future of Data APIs" section below.</p>
    <div>
      <h4>Non-aggregated requests table</h4>
      <a href="#non-aggregated-requests-table">
        
      </a>
    </div>
    <p>We store over <a href="https://docs.google.com/spreadsheets/d/1zQ3yI3HB2p8hiM-Jwvq1-MaeEyIouix2I-iUAPZtJYw/edit?usp=sharing">100+ columns</a>, collecting lots of different kinds of metrics about each request passed through Cloudflare. Some of these columns are also available in our <a href="https://support.cloudflare.com/hc/en-us/articles/216672448-Enterprise-Log-Share-Logpull-REST-API">Enterprise Log Share</a> product, however ClickHouse non-aggregated requests table has more fields.</p><p>With so many columns to store and huge storage requirements we've decided to proceed with the aggregated-data approach, which worked well for us before in old pipeline and which will provide us with backward compatibility.</p>
    <div>
      <h4>Aggregations schema design #1</h4>
      <a href="#aggregations-schema-design-1">
        
      </a>
    </div>
    <p>According to the <a href="https://api.cloudflare.com/#zone-analytics-dashboard">API documentation</a>, we need to provide lots of different requests breakdowns and to satisfy these requirements we decided to test the following approach:</p><ol><li><p>Create Cickhouse <a href="https://clickhouse.com/docs/en/sql-reference/statements/create/view">materialized views</a> with <a href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/aggregatingmergetree">ReplicatedAggregatingMergeTree</a> engine pointing to non-aggregated requests table and containing minutely aggregates data for each of the breakdowns:</p><ul><li><p><b>Requests totals</b> - containing numbers like total requests, bytes, threats, uniques, etc.</p></li><li><p><b>Requests by colo</b> - containing requests, bytes, etc. breakdown by edgeColoId - each of 120+ Cloudflare datacenters</p></li><li><p><b>Requests by http status</b> - containing breakdown by HTTP status code, e.g. 200, 404, 500, etc.</p></li><li><p><b>Requests by content type</b> - containing breakdown by response content type, e.g. HTML, JS, CSS, etc.</p></li><li><p><b>Requests by country</b> - containing breakdown by client country (based on IP)</p></li><li><p><b>Requests by threat type</b> - containing breakdown by threat type</p></li><li><p><b>Requests by browser</b> - containing breakdown by browser family, extracted from user agent</p></li><li><p><b>Requests by ip class</b> - containing breakdown by client IP class</p></li></ul></li><li><p>Write the code gathering data from all 8 materialized views, using two approaches:</p><ul><li><p>Querying all 8 materialized views at once using JOIN</p></li><li><p>Querying each one of 8 materialized views separately in parallel</p></li></ul></li><li><p>Run performance testing benchmark against common Zone Analytics API queries</p></li></ol>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/3iQvOQSik5xLDR7lkG0X9H/ee505f826947f6844c3e001bb83c7e7b/Schema-design--1-1.jpg" />
          </figure><p>Schema design #1 didn't work out well. ClickHouse JOIN syntax forces to write monstrous query over 300 lines of SQL, repeating the selected columns many times because you can do only <a href="https://github.com/yandex/ClickHouse/issues/873">pairwise joins</a> in ClickHouse.</p><p>As for querying each of materialized views separately in parallel, benchmark showed prominent, but moderate results - query throughput would be a little bit better than using our Citus based old pipeline.</p>
    <div>
      <h4>Aggregations schema design #2</h4>
      <a href="#aggregations-schema-design-2">
        
      </a>
    </div>
    <p>In our second iteration of the schema design, we strove to keep a similar structure to our existing Citus tables. To do this, we experimented with the SummingMergeTree engine, which is described in detail by the excellent ClickHouse <a href="https://clickhouse.com/docs/en/engines/table-engines/mergetree-family/summingmergetree">documentation</a>:</p><blockquote><p>In addition, a table can have nested data structures that are processed in a special way. If the name of a nested table ends in 'Map' and it contains at least two columns that meet the following criteria... then this nested table is interpreted as a mapping of key =&gt; (values...), and when merging its rows, the elements of two data sets are merged by 'key' with a summation of the corresponding (values...).</p></blockquote><p>We were pleased to find this feature, because the SummingMergeTree engine allowed us to significantly reduce the number of tables required as compared to our initial approach. At the same time, it allowed us to match the structure of our existing Citus tables. The reason was that the ClickHouse Nested structure ending in 'Map' was similar to the <a href="https://www.postgresql.org/docs/9.6/static/hstore.html">Postgres hstore</a> data type, which we used extensively in the old pipeline.</p><p>However, there were two existing issues with ClickHouse maps:</p><ul><li><p>SummingMergeTree does aggregation for all records with same primary key, but final aggregation across all shards should be done using some aggregate function, which didn't exist in ClickHouse.</p></li><li><p>For storing uniques (uniques visitors based on IP), we need to use AggregateFunction data type, and although SummingMergeTree allows you to create column with such data type, it will not perform aggregation on it for records with same primary keys.</p></li></ul><p>To resolve problem #1, we had to create a new aggregation function <a href="https://clickhouse.yandex/docs/en/single/index.html#summapkey-value">sumMap</a>. Luckily, ClickHouse source code is of excellent quality and its core developers are very helpful with reviewing and merging requested changes.</p><p>As for problem #2, we had to put uniques into separate materialized view, which uses the ReplicatedAggregatingMergeTree Engine and supports merge of AggregateFunction states for records with the same primary keys. We're considering adding the same functionality into SummingMergeTree, so it will simplify our schema even more.</p><p>We also created a separate materialized view for the Colo endpoint because it has much lower usage (5% for Colo endpoint queries, 95% for Zone dashboard queries), so its more dispersed primary key will not affect performance of Zone dashboard queries.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2ZmeMrcqIgAI99UJgfx1HK/80923c6f467d95a1bd8d74e838f91d99/Schema-design--2.jpg" />
          </figure><p>Once schema design was acceptable, we proceeded to performance testing.</p>
    <div>
      <h3>ClickHouse performance tuning</h3>
      <a href="#clickhouse-performance-tuning">
        
      </a>
    </div>
    <p>We explored a number of avenues for performance improvement in ClickHouse. These included tuning index granularity, and improving the merge performance of the SummingMergeTree engine.</p><p>By default ClickHouse recommends to use 8192 index granularity. There is <a href="https://medium.com/@f1yegor/clickhouse-primary-keys-2cf2a45d7324">nice article</a> explaining ClickHouse primary keys and index granularity in depth.</p><p>While default index granularity might be excellent choice for most of use cases, in our case we decided to choose the following index granularities:</p><ul><li><p>For the main non-aggregated requests table we chose an index granularity of 16384. For this table, the number of rows read in a query is typically on the order of millions to billions. In this case, a large index granularity does not make a huge difference on query performance.</p></li><li><p>For the aggregated requests_* stables, we chose an index granularity of 32. A low index granularity makes sense when we only need to scan and return a few rows. It made a huge difference in API performance - query latency decreased by 50% and throughput increased by ~3 times when we changed index granularity 8192 → 32.</p></li></ul><p>Not relevant to performance, but we also disabled the min_execution_speed setting, so queries scanning just a few rows won't return exception because of "slow speed" of scanning rows per second.</p><p>On the aggregation/merge side, we've made some ClickHouse optimizations as well, like <a href="https://github.com/yandex/ClickHouse/pull/1330">increasing SummingMergeTree maps merge speed</a> by x7 times, which we contributed back into ClickHouse for everyone's benefit.</p><p>Once we had completed the performance tuning for ClickHouse, we could bring it all together into a new data pipeline. Next, we describe the architecture for our new, ClickHouse-based data pipeline.</p>
    <div>
      <h3>New data pipeline</h3>
      <a href="#new-data-pipeline">
        
      </a>
    </div>
    <p>The new pipeline architecture re-uses some of the components from old pipeline, however it replaces its most weak components.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2UaoX6fPvFuhN1ecJpbe6v/8aee3bd1ab395cb4a4ac2b8db9575a23/New-system-architecture.jpg" />
          </figure><p>New components include:</p><ul><li><p><b>Kafka consumers </b>- 106 Go consumers per each partition consume Cap'n Proto raw logs and extract/prepare needed 100+ ClickHouse fields. Consumers no longer do any aggregation logic.</p></li><li><p><b>ClickHouse cluster</b> - 36 nodes with x3 replication factor. It handles non-aggregate requests logs ingestion and then produces aggregates using materialized views.</p></li><li><p><b>Zone Analytics API</b> - rewritten and optimized version of API in Go, with many meaningful metrics, healthchecks, failover scenarios.</p></li></ul><p>As you can see the architecture of new pipeline is much simpler and fault-tolerant. It provides Analytics for all our 7M+ customers' domains totalling more than 2.5 billion monthly unique visitors and over 1.5 trillion monthly page views.</p><p>On average we process 6M HTTP requests per second, with peaks of upto 8M requests per second.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/zjubA3CgetykRE8Angxln/7509dd5dea82ad7408339a5f78561b41/HTTP-Logfwdr-throughput.png" />
          </figure><p>Average log message size in <a href="https://capnproto.org/">Cap’n Proto</a> format used to be ~1630B, but thanks to amazing job on Kafka compression by our Platform Operations Team, it decreased significantly. Please see <a href="https://blog.cloudflare.com/squeezing-the-firehose/">"Squeezing the firehose: getting the most from Kafka compression"</a> blog post with deeper dive into those optimisations.</p>
    <div>
      <h4>Benefits of new pipeline</h4>
      <a href="#benefits-of-new-pipeline">
        
      </a>
    </div>
    <ul><li><p><b>No SPOF</b> - removed all SPOFs and bottlenecks, everything has at least x3 replication factor.</p></li><li><p><b>Fault-tolerant</b> - it's more fault-tolerant, even if Kafka consumer or ClickHouse node or Zone Analytics API instance fails, it doesn't impact the service.</p></li><li><p><b>Scalable</b> - we can add more Kafka brokers or ClickHouse nodes and scale ingestion as we grow. We are not so confident about query performance when cluster will grow to hundreds of nodes. However, Yandex team managed to scale their cluster to 500+ nodes, distributed geographically between several data centers, using two-level sharding.</p></li><li><p><b>Reduced complexity</b> - due to removing messy crons and consumers which were doing aggregations and <a href="https://www.cloudflare.com/learning/cloud/how-to-refactor-applications/">refactoring</a> API code we were able to:</p><ul><li><p>Shutdown Postgres RollupDB instance and free it up for reuse.</p></li><li><p>Shutdown Citus cluster 12 nodes and free it up for reuse. As we won't use Citus for serious workload anymore we can reduce our operational and support costs.</p></li><li><p>Delete tens of thousands of lines of old Go, SQL, Bash, and PHP code.</p></li><li><p>Remove WWW PHP API dependency and extra latency.</p></li></ul></li><li><p><b>Improved API throughput and latency </b>- with previous pipeline Zone Analytics API was struggling to serve more than 15 queries per second, so we had to introduce temporary hard rate limits for largest users. With new pipeline we were able to remove hard rate limits and now we are serving ~40 queries per second. We went further and did intensive load testing for new API and with current setup and hardware we are able serve up to ~150 queries per second and this is scalable with additional nodes.
</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1TCHFdGAndAank2N3agHQj/5db6811ef0918e8fc66b8543531f9733/Zone-Analytics-API-requests-latency-quantiles.png" />
          </figure><p></p></li><li><p><b>Easier to operate</b> - with shutdown of many unreliable components, we are finally at the point where it's relatively easy to operate this pipeline. ClickHouse quality helps us a lot in this matter.</p></li><li><p><b>Decreased amount of incidents</b> - with new more reliable pipeline, we now have fewer incidents than before, which ultimately has reduced on-call burden. Finally, we can sleep peacefully at night :-).</p></li></ul><p>Recently, we've improved the throughput and latency of the new pipeline even further with better hardware. I'll provide details about this cluster below.</p>
    <div>
      <h4>Our ClickHouse cluster</h4>
      <a href="#our-clickhouse-cluster">
        
      </a>
    </div>
    <p>In total we have 36 ClickHouse nodes. The new hardware is a big upgrade for us:</p><ul><li><p><b>Chassis</b> - Quanta D51PH-1ULH chassis instead of Quanta D51B-2U chassis (2x less physical space)</p></li><li><p><b>CPU</b> - 40 logical cores E5-2630 v3 @ 2.40 GHz instead of 32 cores E5-2630 v4 @ 2.20 GHz</p></li><li><p><b>RAM</b> - 256 GB RAM instead of 128 GB RAM</p></li><li><p><b>Disks</b> - 12 x 10 TB Seagate ST10000NM0016-1TT101 disks instead of 12 x 6 TB Toshiba TOSHIBA MG04ACA600E</p></li><li><p><b>Network</b> - 2 x 25G Mellanox ConnectX-4 in MC-LAG instead of 2 x 10G Intel 82599ES</p></li></ul><p>Our Platform Operations team noticed that ClickHouse is not great at running heterogeneous clusters yet, so we need to gradually replace all nodes in the existing cluster with new hardware, all 36 of them. The process is fairly straightforward, it's no different than replacing a failed node. The problem is that <a href="https://github.com/yandex/ClickHouse/issues/1821">ClickHouse doesn't throttle recovery</a>.</p><p>Here is more information about our cluster:</p><ul><li><p><b>Avg insertion rate</b> - all our pipelines bring together 11M rows per second.</p></li><li><p><b>Avg insertion bandwidth</b> - 47 Gbps.</p></li><li><p><b>Avg queries per second</b> - on average cluster serves ~40 queries per second with frequent peaks up to ~80 queries per second.</p></li><li><p><b>CPU time</b> - after recent hardware upgrade and all optimizations, our cluster CPU time is quite low.
</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/61sIThxxM4s9mgA8nQSibn/4e09df1a744f0c1e2cd92b8bf5bfdd5f/ClickHouse-CPU-usage.png" />
          </figure><p></p></li><li><p><b>Max disk IO</b> (device time) - it's low as well.
</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1rNrJvGvXd6VNvPRH2qe0l/f7813d61e834b420515047062e25e791/Max-disk-IO.png" />
          </figure><p></p><p></p></li></ul><p>In order to make the switch to the new pipeline as seamless as possible, we performed a transfer of historical data from the old pipeline. Next, I discuss the process of this data transfer.</p>
    <div>
      <h4>Historical data transfer</h4>
      <a href="#historical-data-transfer">
        
      </a>
    </div>
    <p>As we have 1 year storage requirements, we had to do one-time ETL (Extract Transfer Load) from the old Citus cluster into ClickHouse.</p><p>At Cloudflare we love Go and its goroutines, so it was quite straightforward to write a simple ETL job, which:</p><ul><li><p>For each minute/hour/day/month extracts data from Citus cluster</p></li><li><p>Transforms Citus data into ClickHouse format and applies needed business logic</p></li><li><p>Loads data into ClickHouse</p></li></ul><p>The whole process took couple of days and over 60+ billions rows of data were transferred successfully with consistency checks. The completion of this process finally led to the shutdown of old pipeline. However, our work does not end there, and we are constantly looking to the future. In the next section, I'll share some details about what we are planning.</p>
    <div>
      <h3>Future of Data APIs</h3>
      <a href="#future-of-data-apis">
        
      </a>
    </div>
    
    <div>
      <h4>Log Push</h4>
      <a href="#log-push">
        
      </a>
    </div>
    <p>We're currently working on something called "Log Push". Log push allows you to specify a desired data endpoint and have your HTTP request logs sent there automatically at regular intervals. At the moment, it's in private beta and going to support sending logs to:</p><ul><li><p>Amazon S3 bucket</p></li><li><p>Google Cloud Service bucket</p></li><li><p>Other storage services and platforms</p></li></ul><p>It's expected to be generally available soon, but if you are interested in this new product and you want to try it out please contact our Customer Support team.</p>
    <div>
      <h4>Logs SQL API</h4>
      <a href="#logs-sql-api">
        
      </a>
    </div>
    <p>We're also evaluating possibility of building new product called Logs SQL API. The idea is to provide customers access to their logs via flexible API which supports standard SQL syntax and JSON/CSV/TSV/XML format response.</p><p>Queries can extract:</p><ul><li><p><b>Raw requests logs fields</b> (e.g. SELECT field1, field2, ... FROM requests WHERE ...)</p></li><li><p><b>Aggregated data from requests logs</b> (e.g. SELECT clientIPv4, count() FROM requests GROUP BY clientIPv4 ORDER BY count() DESC LIMIT 10)</p></li></ul><p>Google BigQuery provides similar <a href="https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs/query">SQL API</a> and Amazon has product callled <a href="https://docs.aws.amazon.com/kinesisanalytics/latest/sqlref/analytics-sql-reference.html">Kinesis Data analytics</a> with SQL API support as well.</p><p>Another option we're exploring is to provide syntax similar to <a href="https://api.cloudflare.com/#dns-analytics-properties">DNS Analytics API</a> with filters and dimensions.</p><p>We're excited to hear your feedback and know more about your analytics use case. It can help us a lot to build new products!</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>All this could not be possible without hard work across multiple teams! First of all thanks to other Data team engineers for their tremendous efforts to make this all happen. Platform Operations Team made significant contributions to this project, especially Ivan Babrou and Daniel Dao. Contributions from Marek Vavruša in DNS Team were also very helpful.</p><p>Finally, Data team at Cloudflare is a small team, so if you're interested in building and operating distributed services, you stand to have some great problems to work on. Check out the <a href="https://boards.greenhouse.io/cloudflare/jobs/613800">Distributed Systems Engineer - Data</a> and <a href="https://boards.greenhouse.io/cloudflare/jobs/688056">Data Infrastructure Engineer</a> roles in London, UK and San Francisco, US, and let us know what you think.</p> ]]></content:encoded>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[Data]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Cap'n Proto]]></category>
            <category><![CDATA[API]]></category>
            <category><![CDATA[php]]></category>
            <category><![CDATA[Load Balancing]]></category>
            <category><![CDATA[NGINX]]></category>
            <guid isPermaLink="false">6VEE3i8wXN2CDKWJJ16uXS</guid>
            <dc:creator>Alex Bocharov</dc:creator>
        </item>
        <item>
            <title><![CDATA[Squeezing the firehose: getting the most from Kafka compression]]></title>
            <link>https://blog.cloudflare.com/squeezing-the-firehose/</link>
            <pubDate>Mon, 05 Mar 2018 16:17:03 GMT</pubDate>
            <description><![CDATA[ How Cloudflare was able to save hundreds of gigabits of network bandwidth and terabytes of storage from Kafka. ]]></description>
            <content:encoded><![CDATA[ <p>We at Cloudflare are long time <a href="https://kafka.apache.org/">Kafka</a> users, first mentions of it date back to beginning of 2014 when the most recent version was 0.8.0. We use Kafka as a log to power analytics (both HTTP and DNS), <a href="https://www.cloudflare.com/learning/ddos/ddos-mitigation/">DDoS mitigation</a>, logging and metrics.</p><p>While the idea of unifying abstraction of the log remained the same since then (<a href="https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying">read this classic blog post</a> from Jay Kreps if you haven't), Kafka evolved in other areas since then. One of these improved areas was compression support. Back in the old days we've tried enabling it a few times and ultimately gave up on the idea because of <a href="https://github.com/Shopify/sarama/issues/805">unresolved</a> <a href="https://issues.apache.org/jira/browse/KAFKA-1718">issues</a> in the protocol.</p>
    <div>
      <h3>Kafka compression overview</h3>
      <a href="#kafka-compression-overview">
        
      </a>
    </div>
    <p>Just last year Kafka 0.11.0 came out with the new improved protocol and log format.</p><p>The naive approach to compression would be to compress messages in the log individually:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4hIZ5PFDUxsm8R48jv6TfL/2d066d744d9e89775c35424db5b9f6d5/Screen-Shot-2018-03-05-at-12.10.00-PM.png" />
            
            </figure><p>Edit: originally we said this is how Kafka worked before 0.11.0, but that appears to be false.</p><p>Compression algorithms work best if they have more data, so in the new log format messages (now called records) are packed back to back and compressed in batches. In the previous log format messages recursive (compressed set of messages is a message), new format makes things more straightforward: compressed batch of records is just a batch.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6WMDU1akGFipMtPWFPXXJB/e0bc3c5bffc3bb215251c4bc33598fda/Screen-Shot-2018-03-05-at-12.10.13-PM.png" />
            
            </figure><p>Now compression has a lot more space to do its job. There's a high chance that records in the same Kafka topic share common parts, which means they can be compressed better. On the scale of thousands of messages difference becomes enormous. The downside here is that if you want to read record3 in the example above, you have to fetch records 1 and 2 as well, whether the batch is compressed or not. In practice this doesn't matter too much, because consumers usually read all records sequentially batch after batch.</p><p>The beauty of compression in Kafka is that it lets you trade off CPU vs disk and network usage. The protocol itself is designed to minimize overheads as well, by requiring decompression only in a few places:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/572hmHvuR97iuBVjOqD5SS/a5a10aceffd450e8b6563e94966cc53c/Screen-Shot-2018-03-05-at-12.10.19-PM.png" />
            
            </figure><p>On the receiving side of the log only consumers need to decompress messages:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5Jyhzp3FFqxOEUML1Zw3Cv/85d1d134fd87dc74820da7afe99c090e/Screen-Shot-2018-03-05-at-12.10.25-PM.png" />
            
            </figure><p>In reality, if you don't use encryption, data can be copied between NIC and disks with <a href="https://www.ibm.com/developerworks/linux/library/j-zerocopy/">zero copies to user space</a>, lowering the cost to some degree.</p>
    <div>
      <h3>Kafka bottlenecks at Cloudflare</h3>
      <a href="#kafka-bottlenecks-at-cloudflare">
        
      </a>
    </div>
    <p>Having less network and disk usage was a big selling point for us. Back in 2014 we started with spinning disks under Kafka and never had issues with disk space. However, at some point we started having issues with random io. Most of the time consumers and replicas (which are just another type of consumer) read from the very tip of the log, and that data resides in page cache meaning you don't need to read from disks at all:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/LFhjE9d3QMnMvDRS4mtlJ/3e47d9695821f3374c4ad326cb32cce3/Screen-Shot-2018-03-01-at-13.59.06.png" />
            
            </figure><p>In this case the only time you touch the disk is during writes, and sequential writes are cheap. However, things start to fall apart when you have multiple lagging consumers:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2b3aYKJupVoKGdx3tLmTjq/1b740bb0dda4e74025d8299b9d808abc/Screen-Shot-2018-03-01-at-13.59.29.png" />
            
            </figure><p>Each consumer wants to read different part of the log from the physical disk, which means seeking back and forth. One lagging consumer was okay to have, but multiple of them would start fighting for disk io and just increase lag for all of them. To work around this problem we upgraded to SSDs.</p><p>Consumers were no longer fighting for disk time, but it felt terribly wasteful most of the time when consumers are not lagging and there's zero read io. We were not bored for too long, as other problems emerged:</p><ul><li><p>Disk space became a problem. SSDs are much more expensive and usable disk space reduced by a lot.</p></li><li><p>As we grew, we started saturating network. We used 2x10Gbit NICs and imperfect balance meant that we sometimes saturated network links.</p></li></ul><p>Compression promised to solve both of these problems, so we were eager to try again with improved support from Kafka.</p>
    <div>
      <h3>Performance testing</h3>
      <a href="#performance-testing">
        
      </a>
    </div>
    <p>At Cloudflare, we use Go extensively, which means that a lot of our Kafka consumers and producers are in Go. This means we can't just take off-the-shelf Java client provided by Kafka team with every server release and start enjoying the benefits of compression. We had to get support from our Kafka client library first (we use <a href="https://github.com/Shopify/sarama">sarama from Shopify</a>). Luckily, support was added at the end of 2017. With more fixes from our side we were able to get the test setup working.</p><p>Kafka supports 4 compression codecs: <code>none</code>, <code>gzip</code>, <code>lz4</code> and <code>snappy</code>. We had to figure out how these would work for our topics, so we wrote a simple producer that copied data from existing topic into destination topic. With four destination topics for each compression type we were able to get the following numbers.</p><p>Each destination topic was getting roughly the same amount of messages:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7MOMWEQrQVLcd9DeOpignN/81ee9337ba0266c03770c6684237e476/1.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/48H1RMFi0oltn0zAHDtmzD/4e5dab19dc2d95d53563e331bcf60923/2.png" />
          </figure>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2w92h2jHYHdce63bWv0gJR/2c810303800bbd073569c3f086326c31/3.png" />
          </figure><p>To make it even more obvious, this was the disk usage of these topics:</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/FQ5TGkcRqPR63TBPiLW85/cc2c905abd0af052b3db6b4df2660b89/4.png" />
          </figure><p>This looked amazing, but it was rather low throughput nginx errors topic, containing literally string error messages from nginx. Our main target was <code>requests</code> HTTP log topic with <a href="https://capnproto.org/">capnp</a> encoded messages that are much harder to compress. Naturally, we moved on to try out one <code>partition</code> of requests topic. First results were insanely good:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/5.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1ziqP8gINUJ6THlURi989m/28ecfd3e722da5fdbe2cb5ab048cb80c/5.png" />
            </a>
            </figure><p>They were so good, because they were lies. If with nginx error logs we were pushing under 20Mbps of uncompressed logs, here we jumped 30x to 600Mbps and compression wasn't able to keep up. Still, as a starting point, this experiment gave us some expectations in terms of compression ratios for the main target.</p><table><tr><td><p><b>Compression</b></p></td><td><p><b>Messages consumed</b></p></td><td><p><b>Disk usage</b></p></td><td><p><b>Average message size</b></p></td></tr><tr><td><p>None</p></td><td><p>30.18M</p></td><td><p>48106MB</p></td><td><p>1594B</p></td></tr><tr><td><p>Gzip</p></td><td><p>3.17M</p></td><td><p>1443MB</p></td><td><p>455B</p></td></tr><tr><td><p>Snappy</p></td><td><p>20.99M</p></td><td><p>14807MB</p></td><td><p>705B</p></td></tr><tr><td><p>LZ4</p></td><td><p>20.93M</p></td><td><p>14731MB</p></td><td><p>703B</p></td></tr></table><p>Gzip sounded too expensive from the beginning (especially in Go), but Snappy should have been able to keep up. We profiled our producer, and it was spending just 2.4% of CPU time in Snappy compression, never saturating a single core:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/6.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6X1yjXYICcPKzHxKKdQ76K/73278df1829440a7456f32d574d93f1a/6.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/7.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/57E71JTk6DPA6oo8SOLghQ/83edd6f2db81a83db4bd8762117eeb1d/7.png" />
            </a>
            </figure><p>For Snappy we were able to get the following thread stacktrace from Kafka with <code>jstack</code>:</p>
            <pre><code>"kafka-request-handler-3" #87 daemon prio=5 os_prio=0 tid=0x00007f80d2e97800 nid=0x1194 runnable [0x00007f7ee1adc000]
   java.lang.Thread.State: RUNNABLE
    at org.xerial.snappy.SnappyNative.rawCompress(Native Method)
    at org.xerial.snappy.Snappy.rawCompress(Snappy.java:446)
    at org.xerial.snappy.Snappy.compress(Snappy.java:119)
    at org.xerial.snappy.SnappyOutputStream.compressInput(SnappyOutputStream.java:376)
    at org.xerial.snappy.SnappyOutputStream.write(SnappyOutputStream.java:130)
    at java.io.DataOutputStream.write(DataOutputStream.java:107)
    - locked &lt;0x00000007a74cc8f0&gt; (a java.io.DataOutputStream)
    at org.apache.kafka.common.utils.Utils.writeTo(Utils.java:861)
    at org.apache.kafka.common.record.DefaultRecord.writeTo(DefaultRecord.java:203)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendDefaultRecord(MemoryRecordsBuilder.java:622)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:409)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:442)
    at org.apache.kafka.common.record.MemoryRecordsBuilder.appendWithOffset(MemoryRecordsBuilder.java:595)
    at kafka.log.LogValidator$.$anonfun$buildRecordsAndAssignOffsets$1(LogValidator.scala:336)
    at kafka.log.LogValidator$.$anonfun$buildRecordsAndAssignOffsets$1$adapted(LogValidator.scala:335)
    at kafka.log.LogValidator$$$Lambda$675/1035377790.apply(Unknown Source)
    at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:52)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at kafka.log.LogValidator$.buildRecordsAndAssignOffsets(LogValidator.scala:335)
    at kafka.log.LogValidator$.validateMessagesAndAssignOffsetsCompressed(LogValidator.scala:288)
    at kafka.log.LogValidator$.validateMessagesAndAssignOffsets(LogValidator.scala:71)
    at kafka.log.Log.liftedTree1$1(Log.scala:654)
    at kafka.log.Log.$anonfun$append$2(Log.scala:642)
    - locked &lt;0x0000000640068e88&gt; (a java.lang.Object)
    at kafka.log.Log$$Lambda$627/239353060.apply(Unknown Source)
    at kafka.log.Log.maybeHandleIOException(Log.scala:1669)
    at kafka.log.Log.append(Log.scala:624)
    at kafka.log.Log.appendAsLeader(Log.scala:597)
    at kafka.cluster.Partition.$anonfun$appendRecordsToLeader$1(Partition.scala:499)
    at kafka.cluster.Partition$$Lambda$625/1001513143.apply(Unknown Source)
    at kafka.utils.CoreUtils$.inLock(CoreUtils.scala:217)
    at kafka.utils.CoreUtils$.inReadLock(CoreUtils.scala:223)
    at kafka.cluster.Partition.appendRecordsToLeader(Partition.scala:487)
    at kafka.server.ReplicaManager.$anonfun$appendToLocalLog$2(ReplicaManager.scala:724)
    at kafka.server.ReplicaManager$$Lambda$624/2052953875.apply(Unknown Source)
    at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:234)
    at scala.collection.TraversableLike$$Lambda$12/187472540.apply(Unknown Source)
    at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
    at scala.collection.mutable.HashMap$$Lambda$25/1864869682.apply(Unknown Source)
    at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
    at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
    at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
    at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
    at scala.collection.TraversableLike.map(TraversableLike.scala:234)
    at scala.collection.TraversableLike.map$(TraversableLike.scala:227)
    at scala.collection.AbstractTraversable.map(Traversable.scala:104)
    at kafka.server.ReplicaManager.appendToLocalLog(ReplicaManager.scala:708)
    at kafka.server.ReplicaManager.appendRecords(ReplicaManager.scala:459)
    at kafka.server.KafkaApis.handleProduceRequest(KafkaApis.scala:466)
    at kafka.server.KafkaApis.handle(KafkaApis.scala:99)
    at kafka.server.KafkaRequestHandler.run(KafkaRequestHandler.scala:65)
    at java.lang.Thread.run(Thread.java:748)</code></pre>
            <p>This pointed us to <a href="https://github.com/apache/kafka/blob/1.0.0/core/src/main/scala/kafka/log/LogValidator.scala#L70-L71">this piece of code</a> in Kafka repository.</p><p>There wasn't enough logging to figure out why Kafka was recompressing, but we were able to get this information out with a patched Kafka broker:</p>
            <pre><code>diff --git a/core/src/main/scala/kafka/log/LogValidator.scala b/core/src/main/scala/kafka/log/LogValidator.scala
index 15750e9cd..5197d0885 100644
--- a/core/src/main/scala/kafka/log/LogValidator.scala
+++ b/core/src/main/scala/kafka/log/LogValidator.scala
@@ -21,6 +21,7 @@ import java.nio.ByteBuffer
 import kafka.common.LongRef
 import kafka.message.{CompressionCodec, NoCompressionCodec}
 import kafka.utils.Logging
+import org.apache.log4j.Logger
 import org.apache.kafka.common.errors.{InvalidTimestampException, UnsupportedForMessageFormatException}
 import org.apache.kafka.common.record._
 import org.apache.kafka.common.utils.Time
@@ -236,6 +237,7 @@ private[kafka] object LogValidator extends Logging {
   
       // No in place assignment situation 1 and 2
       var inPlaceAssignment = sourceCodec == targetCodec &amp;&amp; toMagic &gt; RecordBatch.MAGIC_VALUE_V0
+      logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: sourceCodec (" + sourceCodec + ") == targetCodec (" + targetCodec + ") &amp;&amp; toMagic (" + toMagic + ") &gt; RecordBatch.MAGIC_VALUE_V0 (" + RecordBatch.MAGIC_VALUE_V0 + ")")
   
       var maxTimestamp = RecordBatch.NO_TIMESTAMP
       val expectedInnerOffset = new LongRef(0)
@@ -250,6 +252,7 @@ private[kafka] object LogValidator extends Logging {
         // Do not compress control records unless they are written compressed
         if (sourceCodec == NoCompressionCodec &amp;&amp; batch.isControlBatch)
           inPlaceAssignment = true
+          logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: sourceCodec (" + sourceCodec + ") == NoCompressionCodec (" + NoCompressionCodec + ") &amp;&amp; batch.isControlBatch (" + batch.isControlBatch + ")")
   
         for (record &lt;- batch.asScala) {
           validateRecord(batch, record, now, timestampType, timestampDiffMaxMs, compactedTopic)
@@ -261,21 +264,26 @@ private[kafka] object LogValidator extends Logging {
           if (batch.magic &gt; RecordBatch.MAGIC_VALUE_V0 &amp;&amp; toMagic &gt; RecordBatch.MAGIC_VALUE_V0) {
             // Check if we need to overwrite offset
             // No in place assignment situation 3
-            if (record.offset != expectedInnerOffset.getAndIncrement())
+            val off = expectedInnerOffset.getAndIncrement()
+            if (record.offset != off)
               inPlaceAssignment = false
+              logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: record.offset (" + record.offset + ") != expectedInnerOffset.getAndIncrement() (" + off + ")")
             if (record.timestamp &gt; maxTimestamp)
               maxTimestamp = record.timestamp
           }
   
           // No in place assignment situation 4
-          if (!record.hasMagic(toMagic))
+          if (!record.hasMagic(toMagic)) {
+            logger.info("inPlaceAssignment = " + inPlaceAssignment + ", condition: !record.hasMagic(toMagic) (" + !record.hasMagic(toMagic) + ")")
             inPlaceAssignment = false
+          }
   
           validatedRecords += record
         }
       }
   
       if (!inPlaceAssignment) {
+        logger.info("inPlaceAssignment = " + inPlaceAssignment + "; recompressing")
         val (producerId, producerEpoch, sequence, isTransactional) = {
           // note that we only reassign offsets for requests coming straight from a producer. For records with magic V2,
           // there should be exactly one RecordBatch per request, so the following is all we need to do. For Records</code></pre>
            <p>And the output was:</p>
            <pre><code>Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: sourceCodec (SnappyCompressionCodec) == targetCodec (SnappyCompressionCodec) &amp;&amp; toMagic (2) &gt; RecordBatch.MAGIC_VALUE_V0 (0) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: sourceCodec (SnappyCompressionCodec) == NoCompressionCodec (NoCompressionCodec) &amp;&amp; batch.isControlBatch (false) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = true, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (0) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (1) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (2) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (3) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (4) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (5) (kafka.log.LogValidator$)
Dec 29 23:18:59 mybroker kafka[33461]: INFO inPlaceAssignment = false, condition: record.offset (0) != expectedInnerOffset.getAndIncrement() (6) (kafka.log.LogValidator$)</code></pre>
            <p>We promptly <a href="https://github.com/Shopify/sarama/pull/1015">fixed the issue</a> and resumed the testing. These were the results:</p><table><tr><td><p><b>Compression</b></p></td><td><p><b>User time</b></p></td><td><p><b>Messages</b></p></td><td><p><b>Time per 1m</b></p></td><td><p><b>CPU ratio</b></p></td><td><p><b>Disk usage</b></p></td><td><p><b>Avg. message size</b></p></td><td><p><b>Compression ratio</b></p></td></tr><tr><td><p>None</p></td><td><p>209.67s</p></td><td><p>26.00M</p></td><td><p>8.06s</p></td><td><p>1x</p></td><td><p>41448MB</p></td><td><p>1594B</p></td><td><p>1x</p></td></tr><tr><td><p>Gzip</p></td><td><p>570.56s</p></td><td><p>6.98M</p></td><td><p>81.74s</p></td><td><p>10.14x</p></td><td><p>3111MB</p></td><td><p>445B</p></td><td><p>3.58x</p></td></tr><tr><td><p>Snappy</p></td><td><p>337.55s</p></td><td><p>26.02M</p></td><td><p>12.97s</p></td><td><p>1.61x</p></td><td><p>17675MB</p></td><td><p>679B</p></td><td><p>2.35x</p></td></tr><tr><td><p>LZ4</p></td><td><p>525.82s</p></td><td><p>26.01M</p></td><td><p>20.22s</p></td><td><p>2.51x</p></td><td><p>22922MB</p></td><td><p>881B</p></td><td><p>1.81x</p></td></tr></table><p>Now we were able to keep up with both Snappy and LZ4. Gzip was still out of the question and LZ4 had incompatibility issues between Kafka versions and our Go client, which left us with Snappy. This was a winner in terms of compression ratio and speed too, so we were not very disappointed by the lack of choice.</p>
    <div>
      <h3>Deploying into production</h3>
      <a href="#deploying-into-production">
        
      </a>
    </div>
    <p>In production, we started small with Java based consumers and producers. Our first production topic was just 1Mbps and 600rps of nginx error logs. Messages there were very repetitive, and we were able to get whopping 8x decrease in size with batching records for just 1 second across 2 partitions.</p><p>This gave us some confidence to move onto next topic with <code>journald</code> logs encoded with JSON. Here we were able to reduce ingress from 300Mbps to just 50Mbps (yellow line on the graph):</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/8.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2MBwqqBM9fKfK5etewI3K8/a6c3e5154769721ace34e35448e0e9d8/8.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/10.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1VRIDnquD9cRDxx9EpNliX/abed839321dde455854bea1c93a0fd76/10.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/11.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1S1GiVnOPIEaZi20vDdu7B/4e0b4a064af5e576f615b18618e22a14/11.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/12.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6F0xiuQrUHNiU2Q0VgaYtC/2176de2b39c4ec3493f06dae8a995ff2/12.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/13.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/lvIkKRy1yrsTJNe4n4DEP/d2da438e3fbe1d9f40eb2e1347b684a7/13.png" />
            </a>
            </figure><p>With all major topics in DNS cluster switched to Snappy we saw even better picture in terms of broker CPU usage:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/14.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7ERRHln9aNu0pSoKSFFQkA/197e14bebbe7642b690f23363a5cdf1e/14.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/15.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7cfmeyd2Jch2ZfTquu14rW/63fd83ad46a6484dd8b7ebe446b21e09/15.png" />
            </a>
            </figure><p>On the next graph you can see Kafka CPU usage as the purple line and producer CPU usage as the green line:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/16.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5QWGl1NhwkYWMHl5Q5Lius/a4085495612071ae2a16a8aa9a675a05/16.png" />
            </a>
            </figure><p>CPU usage of the producer did not go up substantially, which means most of the work is spent in non compression related tasks. Consumers did not see any increase in CPU usage either, which means we've got our 2.6x decrease in size practically for free.</p><p>It was time to hunt the biggest beast of all: <code>requests</code> topic with HTTP access logs. There we were doing up to 100Gbps and 7.5Mrps of ingress at peak (a lot more when big attacks are happening, but this was a quiet week):</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/17.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/695gJ4I1AIqDHKIH5phRKh/36940318a0b87aa698b4e088c60fc0c2/17.png" />
            </a>
            </figure><p>With many smaller topics switched to Snappy already, we did not need to do anything special here. This is how it went:</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/18.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2b39JEnQP0w68PhPDkGjOz/b4f74fb90ed73a44aa8b05fca03a9de3/18.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/19.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7KvLGzI0Rnk44SXDJb6JXT/cdb6ee681318809a65b8ef8a61229589/19.png" />
            </a>
            </figure><p>That's a 2.25x decrease in ingress bandwidth and average message size. We have multiple replicas and consumers, which means egress is a multiple of ingress. We were able to reduce in-DC traffic by hundreds of gigabits of internal traffic and save terabytes of flash storage. With network and disks being bottlenecks, this meant we'd need less than a half of hardware we had. Kafka was one of the main hardware hogs in this datacenter, so this was a large scale win.</p><p>Yet, 2.25x seemed a bit on the low side.</p>
    <div>
      <h3>Looking for more</h3>
      <a href="#looking-for-more">
        
      </a>
    </div>
    <p>We wanted to see if we can do better. To do that, we extracted one batch of records from Kafka and ran some benchmarks on it. All batches are around 1 MB uncompressed, 600 records in each on average.</p><p>To run the benchmarks we used <a href="https://github.com/inikep/lzbench">lzbench</a>, which runs lots of different compression algorithms and provides a summary. Here's what we saw with results sorted by compression ratio (heavily filtered list):</p>
            <pre><code>lzbench 1.7.3 (64-bit MacOS)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  33587 MB/s 33595 MB/s      984156 100.00
...
lz4 1.8.0                 594 MB/s  2428 MB/s      400577  40.70
...
snappy 1.1.4              446 MB/s  1344 MB/s      425564  43.24
...
zstd 1.3.3 -1             409 MB/s   844 MB/s      259438  26.36
zstd 1.3.3 -2             303 MB/s   889 MB/s      244650  24.86
zstd 1.3.3 -3             242 MB/s   899 MB/s      232057  23.58
zstd 1.3.3 -4             240 MB/s   910 MB/s      230936  23.47
zstd 1.3.3 -5             154 MB/s   891 MB/s      226798  23.04</code></pre>
            <p>This looked too good to be true. <a href="https://facebook.github.io/zstd/">Zstandard</a> is a fairly new (released 1.5 years ago) compression algorithm from Facebook. In benchmarks on the project's home page you can see this:</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4RmnIQ4k2nqI5jabUBQLMq/035633fbc37b15c7fb80d2ed72de01f4/zstd.png" />
            
            </figure><p>In our case we were getting this:</p><table><tr><td><p><b>Compressor name</b></p></td><td><p><b>Ratio</b></p></td><td><p><b>Compression</b></p></td><td><p><b>Decompression</b></p></td></tr><tr><td><p>zstd</p></td><td><p>3.794</p></td><td><p>409 MB/s</p></td><td><p>844 MB/s</p></td></tr><tr><td><p>lz4</p></td><td><p>2.475</p></td><td><p>594 MB/s</p></td><td><p>2428 MB/s</p></td></tr><tr><td><p>snappy</p></td><td><p>2.313</p></td><td><p>446 MB/s</p></td><td><p>1344 MB/s</p></td></tr></table><p>Clearly, results are very dependent on the kind of data you are trying to compress. For our data zstd was giving amazing results even on the lowest compression level. Compression ratio was better than even gzip at maximum compression level, while throughput was a lot higher. For posterity, this is how DNS logs compressed (HTTP logs compressed similarly):</p>
            <pre><code>$ ./lzbench -ezstd/zlib rrdns.recordbatch
lzbench 1.7.3 (64-bit MacOS)   Assembled by P.Skibinski
Compressor name         Compress. Decompress. Compr. size  Ratio Filename
memcpy                  33235 MB/s 33502 MB/s      927048 100.00 rrdns.recordbatch
zstd 1.3.3 -1             430 MB/s   909 MB/s      226298  24.41 rrdns.recordbatch
zstd 1.3.3 -2             322 MB/s   878 MB/s      227271  24.52 rrdns.recordbatch
zstd 1.3.3 -3             255 MB/s   883 MB/s      217730  23.49 rrdns.recordbatch
zstd 1.3.3 -4             253 MB/s   883 MB/s      217141  23.42 rrdns.recordbatch
zstd 1.3.3 -5             169 MB/s   869 MB/s      216119  23.31 rrdns.recordbatch
zstd 1.3.3 -6             102 MB/s   939 MB/s      211092  22.77 rrdns.recordbatch
zstd 1.3.3 -7              78 MB/s   968 MB/s      208710  22.51 rrdns.recordbatch
zstd 1.3.3 -8              65 MB/s  1005 MB/s      204370  22.05 rrdns.recordbatch
zstd 1.3.3 -9              59 MB/s  1008 MB/s      204071  22.01 rrdns.recordbatch
zstd 1.3.3 -10             44 MB/s  1029 MB/s      202587  21.85 rrdns.recordbatch
zstd 1.3.3 -11             43 MB/s  1054 MB/s      202447  21.84 rrdns.recordbatch
zstd 1.3.3 -12             32 MB/s  1051 MB/s      201190  21.70 rrdns.recordbatch
zstd 1.3.3 -13             31 MB/s  1050 MB/s      201190  21.70 rrdns.recordbatch
zstd 1.3.3 -14             13 MB/s  1074 MB/s      200228  21.60 rrdns.recordbatch
zstd 1.3.3 -15           8.15 MB/s  1171 MB/s      197114  21.26 rrdns.recordbatch
zstd 1.3.3 -16           5.96 MB/s  1051 MB/s      190683  20.57 rrdns.recordbatch
zstd 1.3.3 -17           5.64 MB/s  1057 MB/s      191227  20.63 rrdns.recordbatch
zstd 1.3.3 -18           4.45 MB/s  1166 MB/s      187967  20.28 rrdns.recordbatch
zstd 1.3.3 -19           4.40 MB/s  1108 MB/s      186770  20.15 rrdns.recordbatch
zstd 1.3.3 -20           3.19 MB/s  1124 MB/s      186721  20.14 rrdns.recordbatch
zstd 1.3.3 -21           3.06 MB/s  1125 MB/s      186710  20.14 rrdns.recordbatch
zstd 1.3.3 -22           3.01 MB/s  1125 MB/s      186710  20.14 rrdns.recordbatch
zlib 1.2.11 -1             97 MB/s   301 MB/s      305992  33.01 rrdns.recordbatch
zlib 1.2.11 -2             93 MB/s   327 MB/s      284784  30.72 rrdns.recordbatch
zlib 1.2.11 -3             74 MB/s   364 MB/s      265415  28.63 rrdns.recordbatch
zlib 1.2.11 -4             68 MB/s   342 MB/s      269831  29.11 rrdns.recordbatch
zlib 1.2.11 -5             48 MB/s   367 MB/s      258558  27.89 rrdns.recordbatch
zlib 1.2.11 -6             32 MB/s   376 MB/s      247560  26.70 rrdns.recordbatch
zlib 1.2.11 -7             24 MB/s   409 MB/s      244623  26.39 rrdns.recordbatch
zlib 1.2.11 -8           9.67 MB/s   429 MB/s      239659  25.85 rrdns.recordbatch
zlib 1.2.11 -9           3.63 MB/s   446 MB/s      235604  25.41 rrdns.recordbatch</code></pre>
            <p>For our purposes we picked level 6 as the compromise between compression ratio and CPU cost. It is possible to be even more aggressive as real world usage proved later.</p><p>One great property of zstd is more or less the same decompression speed between levels, which means you only have one knob that connects CPU cost of compression to compression ratio.</p><p>Armed with this knowledge, we dug up <a href="https://issues.apache.org/jira/browse/KAFKA-4514">forgotten Kafka ticket</a> to add zstd, along with <a href="https://cwiki.apache.org/confluence/display/KAFKA/KIP-110%3A+Add+Codec+for+ZStandard+Compression">KIP</a> (Kafka Improvement Proposal) and even <a href="https://github.com/apache/kafka/pull/2267">PR on GitHub</a>. Sadly, these did not get traction back in the day, but this work saved us a lot of time.</p><p>We <a href="https://github.com/bobrik/kafka/commit/8b17836efda64dba1ebdc080e30ee2945793aef3">ported</a> the patch to Kafka 1.0.0 release and pushed it in production. After another round of smaller scale testing and with <a href="https://github.com/bobrik/sarama/commit/c36187fbafab5afe5c152d2012b05b9306196cdb">patched</a> clients we pushed Zstd into production for requests topic.</p><p>Graphs below include switch from no compression (before 2/9) to Snappy (2/9 to 2/17) to Zstandard (after 2/17):</p>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/20.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/466acpqo8HHXc9mml41SGp/178fefc55aa2e9d29a484169ee47c0ed/20.png" />
            </a>
            </figure>
            <figure>
            <a href="http://staging.blog.mrk.cfdata.org/content/images/2018/03/21.png">
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6HSguckI7mJln65401u5Y2/cd9e39e86898c214637e0cb0c865ae02/21.png" />
            </a>
            </figure><p>The decrease in size was <b>4.5x</b> compared to no compression at all. On next generation hardware with 2.4x more storage and 2.5x higher network throughput we suddenly made our bottleneck more than 10x wider and shifted it from storage and network to CPU cost. We even got to cancel pending hardware order for Kafka expansion because of this.</p>
    <div>
      <h3>Conclusion</h3>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>Zstandard is a great modern compression algorithm, promising high compression ratio and throughput, tunable in small increments. Whenever you consider using compression, you should check zstd. If you don't consider compression, then it's worth seeing if you can get benefits from it. Run benchmarks with your data in either case.</p><p>Testing in real world scenario showed how benchmarks, even coming from zstd itself, can be misleading. Going beyond codecs built into Kafka allowed us to improve compression ratio 2x at very low cost.</p><p>We hope that the data we gathered can be a catalyst to making Zstandard an official compression codec in Kafka to benefit other people. There are 3 bits allocated for codec type and only 2 are used so far, which means there are 4 more vacant places.</p><p>If you were skeptical of compression benefits in Kafka because of old flaws in Kafka protocol, this may be the time to reconsider.</p><p>If you enjoy benchmarking, profiling and optimizing large scale services, come <a href="https://www.cloudflare.com/careers/">join us</a>.</p> ]]></content:encoded>
            <category><![CDATA[Compression]]></category>
            <category><![CDATA[Speed & Reliability]]></category>
            <category><![CDATA[Kafka]]></category>
            <guid isPermaLink="false">50nksAmOM8KO1t9ihhnJxe</guid>
            <dc:creator>Ivan Babrou</dc:creator>
        </item>
        <item>
            <title><![CDATA[Scaling out PostgreSQL for CloudFlare Analytics using CitusDB]]></title>
            <link>https://blog.cloudflare.com/scaling-out-postgresql-for-cloudflare-analytics-using-citusdb/</link>
            <pubDate>Thu, 09 Apr 2015 17:32:05 GMT</pubDate>
            <description><![CDATA[ When I joined CloudFlare about 18 months ago, we had just started to build out our new Data Platform. At that point, the log processing and analytics pipeline built in the early days of the company had reached its limits.  ]]></description>
            <content:encoded><![CDATA[ <p>When I joined CloudFlare about 18 months ago, we had just started to build out our new Data Platform. At that point, the log processing and analytics pipeline built in the early days of the company had reached its limits. This was due to the rapidly increasing log volume from our Edge Platform where we’ve had to deal with traffic growth in excess of 400% annually.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4AxkHPDBZrwj6QJQVWuQcX/fec02af530de1ab2f8a1f516ece59057/keepcalm_scaled.png" />
            
            </figure><p>Our log processing pipeline started out like most everybody else’s: compressed log files shipped to a central location for aggregation by a motley collection of Perl scripts and C++ programs with a single PostgreSQL instance to store the aggregated data. Since then, CloudFlare has grown to serve millions of requests per second for millions of sites. Apart from the hundreds of terabytes of log data that has to be aggregated every day, we also face some unique challenges in providing detailed analytics for each of the millions of sites on CloudFlare.</p><p>For the next iteration of our Customer Analytics application, we wanted to get something up and running quickly, try out Kafka, write the aggregation application in Go, and see what could be done to scale out our trusty go-to database, PostgreSQL, from a single machine to a cluster of servers without requiring us to deal with sharding in the application.</p><p>As we were analyzing our scaling requirements for PostgreSQL, we came across <a href="https://www.citusdata.com/">Citus Data</a>, one of the companies to launch out of <a href="https://www.ycombinator.com/">Y Combinator</a> in the summer of 2011. Citus Data builds a database called CitusDB that scales out PostgreSQL for real-time workloads. Because CitusDB enables both real-time data ingest and sub-second queries across billions of rows, it has become a crucial part of our analytics infrastructure.</p>
    <div>
      <h4>Log Processing Pipeline for Analytics</h4>
      <a href="#log-processing-pipeline-for-analytics">
        
      </a>
    </div>
    <p>Before jumping into the details of our database backend, let’s review the pipeline that takes a log event from CloudFlare’s Edge to our analytics database.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4I3yKJFMKyL4M3gtS2SlTy/b34a4a58d3da74e788950e6af1699582/image01.png" />
            
            </figure><p>An HTTP access log event proceeds through the CloudFlare data pipeline as follows:</p><ol><li><p>A web browser makes a request (e.g., an HTTP GET request).</p></li><li><p>An Nginx web server running <a href="/pushing-nginx-to-its-limit-with-lua/">Lua code</a> handles the request and generates a binary log event in <a href="https://capnproto.org">Cap’n Proto format</a>.</p></li><li><p>A Go program akin to <a href="https://github.com/mozilla-services/heka">Heka</a> receives the log event from Nginx over a UNIX socket, batches it with other events, compresses the batch using a fast algorithm like <a href="https://github.com/google/snappy">Snappy</a> or <a href="https://github.com/Cyan4973/lz4">LZ4</a>, and sends it to our data center over a TLS-encrypted TCP connection.</p></li><li><p>Another Go program (the Kafka shim) receives the log event stream, decrypts it, decompresses the batches, and produces the events into a Kafka topic with partitions replicated on many servers.</p></li><li><p>Go aggregators (one process per partition) consume the topic-partitions and insert aggregates (not individual events) with 1-minute granularity into the CitusDB database. Further rollups to 1-hour and 1-day granularity occur later to reduce the amount of data to be queried and to speed up queries over intervals spanning many hours or days.</p></li></ol>
    <div>
      <h4>Why Go?</h4>
      <a href="#why-go">
        
      </a>
    </div>
    <p>Previous blog <a href="/what-weve-been-doing-with-go/">posts</a> and <a href="https://www.youtube.com/watch?v=8igk2ylk_X4">talks</a> have covered <a href="/go-at-cloudflare/">various CloudFlare projects that have been built using Go</a>. We’ve found that Go is a great language for teams to use when building the kinds of distributed systems needed at CloudFlare, and this is true regardless of an engineer’s level of experience with Go. Our Customer Analytics team is made up of engineers that have been using Go since before its 1.0 release as well as complete Go newbies. Team members that were new to Go were able to spin up quickly, and the code base has remained maintainable even as we’ve continued to build many more data processing and aggregation applications such as a new version of <a href="https://www.hakkalabs.co/articles/optimizing-go-3k-requestssec-480k-requestssec">our Layer 7 DDoS attack mitigation system</a>.</p><p>Another factor that makes Go great is the ever-expanding ecosystem of third party libraries. We used <a href="https://github.com/glycerine/go-capnproto">go-capnproto</a> to generate Go code to handle binary log events in Cap’n Proto format from a common schema shared between Go, C++, and <a href="/introducing-lua-capnproto-better-serialization-in-lua/">Lua projects</a>. Go support for Kafka with <a href="https://godoc.org/github.com/Shopify/sarama">Shopify’s Sarama</a> library, support for ZooKeeper with <a href="https://github.com/samuel/go-zookeeper">go-zookeeper</a>, support for PostgreSQL/CitusDB through <a href="http://golang.org/pkg/database/sql/">database/sql</a> and the <a href="https://github.com/lib/pq">lib/pq driver</a> are all very good.</p>
    <div>
      <h4>Why Kafka?</h4>
      <a href="#why-kafka">
        
      </a>
    </div>
    <p>As we started building our new data processing applications in Go, we had some additional requirements for the pipeline:</p><ol><li><p>Use a queue with persistence to allow short periods of downtime for downstream servers and/or consumer services.</p></li><li><p>Make the data available for processing in real time by <a href="https://github.com/mumrah/kafka-python">scripts</a> written by members of our Site Reliability Engineering team.</p></li><li><p>Allow future aggregators to be built in other languages like Java, <a href="https://github.com/edenhill/librdkafka">C or C++</a>.</p></li></ol><p>After extensive testing, we selected <a href="https://kafka.apache.org/">Kafka</a> as the first stage of the log processing pipeline.</p>
    <div>
      <h4>Why Postgres?</h4>
      <a href="#why-postgres">
        
      </a>
    </div>
    <p>As we mentioned when <a href="http://www.postgresql.org/about/press/presskit93/">PostgreSQL 9.3 was released</a>, PostgreSQL has long been an important part of our stack, and for good reason.</p><p>Foreign data wrappers and other extension mechanisms make PostgreSQL an excellent platform for storing lots of data, or as a gateway to other NoSQL data stores, without having to give up the power of SQL. PostgreSQL also has great performance and documentation. Lastly, PostgreSQL has a large and active community, and we've had the privilege of meeting many of the PostgreSQL contributors at meetups held at the CloudFlare office and elsewhere, organized by the <a href="http://www.meetup.com/postgresql-1/">The San Francisco Bay Area PostgreSQL Meetup Group</a>.</p>
    <div>
      <h4>Why CitusDB?</h4>
      <a href="#why-citusdb">
        
      </a>
    </div>
    <p>CloudFlare has been using PostgreSQL since day one. We trust it, and we wanted to keep using it. However, CloudFlare's data has been growing rapidly, and we were running into the limitations of a single PostgreSQL instance. Our team was tasked with scaling out our analytics database in a short time so we started by defining the criteria that are important to us:</p><ol><li><p><b>Performance</b>: Our system powers the Customer Analytics dashboard, so typical queries need to return in less than a second even when dealing with data from many customer sites over long time periods.</p></li><li><p><b>PostgreSQL</b>: We have extensive experience running PostgreSQL in production. We also find several extensions useful, e.g., Hstore enables us to store semi-structured data and HyperLogLog (HLL) makes unique count approximation queries fast.</p></li><li><p><b>Scaling</b>: We need to dynamically scale out our cluster for performance and huge data storage. That is, if we realize that our cluster is becoming overutilized, we want to solve the problem by just adding new machines.</p></li><li><p><b>High availability</b>: This cluster needs to be highly available. As such, the cluster needs to automatically recover from failures like disks dying or servers going down.</p></li><li><p><b>Business intelligence queries</b>: in addition to sub-second responses for customer queries, we need to be able to perform business intelligence queries that may need to analyze billions of rows of analytics data.</p></li></ol><p>At first, we evaluated what it would take to build an application that deals with sharding on top of stock PostgreSQL. We investigated using the <a href="http://www.postgresql.org/docs/9.4/static/postgres-fdw.html">postgres_fdw</a> extension to provide a unified view on top of a number of independent PostgreSQL servers, but this solution did not deal well with servers going down.</p><p>Research into the major players in the PostgreSQL space indicated that CitusDB had the potential to be a great fit for us. On the performance point, they already had customers running real-time analytics with queries running in parallel across a large cluster in tens of milliseconds.</p><p>CitusDB has also maintained compatibility with PostgreSQL, not by forking the code base like other vendors, but by extending it to plan and execute distributed queries. Furthermore, CitusDB used the concept of many logical shards so that if we were to add new machines to our cluster, we could easily rebalance the shards in the cluster by calling a simple PostgreSQL user-defined function.</p><p>With CitusDB, we could replicate logical shards to independent machines in the cluster, and automatically fail over between replicas even during queries. In case of a hardware failure, we could also use the rebalance function to re-replicate shards in the cluster.</p>
    <div>
      <h4>CitusDB Architecture</h4>
      <a href="#citusdb-architecture">
        
      </a>
    </div>
    
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/74Tl2hP3Tfk1HNzpF1MV4i/121b31b62700289180653b647a653edb/image00.png" />
            
            </figure><p>CitusDB follows an architecture similar to Hadoop to scale out Postgres: one primary node holds authoritative metadata about shards in the cluster and parallelizes incoming queries. The worker nodes then do all the actual work of running the queries.</p><p>In CloudFlare's case, the cluster holds about 1 million shards and each shard is replicated to multiple machines. When the application sends a query to the cluster, the primary node first prunes away unrelated shards and finds the specific shards relevant to the query. The primary node then transforms the query into many smaller queries for <a href="http://www.citusdata.com/blog/19-ozgun/114-how-to-build-your-distributed-database">parallel execution</a> and ships those smaller queries to the worker nodes.</p><p>Finally, the primary node receives intermediate results from the workers, merges them, and returns the final results to the application. This takes anywhere between 25 milliseconds to 2 seconds for queries in the CloudFlare analytics cluster, depending on whether some or all of the data is available in page cache.</p><p>From a high availability standpoint, when a worker node fails, the primary node automatically fails over to the replicas, even during a query. The primary node holds slowly changing metadata, making it a good fit for continuous backups or PostgreSQL's streaming replication feature. Citus Data is currently working on further improvements to make it easy to replicate the primary metadata to all the other nodes.</p><p>At CloudFlare, we love the CitusDB architecture because it enabled us to continue using PostgreSQL. Our analytics dashboard and BI tools connect to Citus using standard PostgreSQL connectors, and tools like <code>pg_dump</code> and <code>pg_upgrade</code> just work. Two features that stand out for us are CitusDB’s PostgreSQL extensions that power our analytics dashboards, and CitusDB’s ability to parallelize the logic in those extensions out of the box.</p>
    <div>
      <h4>Postgres Extensions on CitusDB</h4>
      <a href="#postgres-extensions-on-citusdb">
        
      </a>
    </div>
    <p>PostgreSQL extensions are pieces of software that add functionality to the core database itself. Some examples are data types, user-defined functions, operators, aggregates, and custom index types. PostgreSQL has more than 150 publicly available official extensions. We’d like to highlight two of these extensions that might be of general interest. It’s worth noting that with CitusDB all of these extensions automatically scale to many servers without any changes.</p>
    <div>
      <h4>HyperLogLog</h4>
      <a href="#hyperloglog">
        
      </a>
    </div>
    <p><a href="https://en.wikipedia.org/wiki/HyperLogLog">HyperLogLog</a> is a sophisticated algorithm developed for doing unique count approximations quickly. And since a <a href="https://github.com/aggregateknowledge/postgresql-hll">HLL implementation for PostgreSQL</a> was open sourced by the good folks at Aggregate Knowledge, we could use it with CitusDB unchanged because it’s compatible with most (if not all) Postgres extensions.</p><p>HLL was important for our application because we needed to compute unique IP counts across various time intervals in real time and we didn’t want to store the unique IPs themselves. With this extension, we could, for example, count the number of unique IP addresses accessing a customer site in a minute, but still have an accurate count when further rolling up the aggregated data into a 1-hour aggregate.</p>
    <div>
      <h4>Hstore</h4>
      <a href="#hstore">
        
      </a>
    </div>
    <p>The <a href="http://www.postgresql.org/docs/9.4/static/hstore.html">hstore data type</a> stores sets of key/value pairs within a single PostgreSQL value. This can be helpful in various scenarios such as with rows with many attributes that are rarely examined, or to represent semi-structured data. We use the hstore data type to hold counters for sparse categories (e.g. country, HTTP status, data center).</p><p>With the hstore data type, we save ourselves from the burden of denormalizing our table schema into hundreds or thousands of columns. For example, we have one hstore data type that holds the number of requests coming in from different data centers per minute per CloudFlare customer. With millions of customers and hundreds of data centers, this counter data ends up being very sparse. Thanks to hstore, we can efficiently store that data, and thanks to CitusDB, we can efficiently parallelize queries of that data.</p><p>For future applications, we are also investigating other extensions such as the Postgres columnar store extension <a href="https://github.com/citusdata/cstore_fdw">cstore_fdw</a> that Citus Data has open sourced. This will allow us to compress and store even more historical analytics data in a smaller footprint.</p>
    <div>
      <h4>Conclusion</h4>
      <a href="#conclusion">
        
      </a>
    </div>
    <p>CitusDB has been working very well for us as the new backend for our Customer Analytics system. We have also found many uses for the analytics data in a business intelligence context. The ease with which we can run distributed queries on the data allows us to quickly answer new questions about the CloudFlare network that arise from anyone in the company, from the SRE team through to Sales.</p><p>We are looking forward to features available in the recently released <a href="https://www.citusdata.com/citus-products/citusdb-software">CitusDB 4.0</a>, especially the performance improvements and the new shard rebalancer. We’re also excited about using the JSONB data type with CitusDB 4.0, along with all the other improvements that come standard as part of <a href="http://www.postgresql.org/docs/9.4/static/release-9-4.html">PostgreSQL 9.4</a>.</p><p>Finally, if you’re interested in building and operating distributed services like Kafka or CitusDB and writing Go as part of a dynamic team dealing with big (nay, gargantuan) amounts of data, <a href="https://www.cloudflare.com/join-our-team">CloudFlare is hiring</a>.</p> ]]></content:encoded>
            <category><![CDATA[Analytics]]></category>
            <category><![CDATA[SQL]]></category>
            <category><![CDATA[Postgres]]></category>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[LUA]]></category>
            <category><![CDATA[DDoS]]></category>
            <guid isPermaLink="false">4WkjJAXrP1iZH5uthDDnAh</guid>
            <dc:creator>Albert Strasheim</dc:creator>
        </item>
    </channel>
</rss>