
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:media="http://search.yahoo.com/mrss/">
    <channel>
        <title><![CDATA[ The Cloudflare Blog ]]></title>
        <description><![CDATA[ Get the latest news on how products at Cloudflare are built, technologies used, and join the teams helping to build a better Internet. ]]></description>
        <link>https://blog.cloudflare.com</link>
        <atom:link href="https://blog.cloudflare.com/" rel="self" type="application/rss+xml"/>
        <language>en-us</language>
        <image>
            <url>https://blog.cloudflare.com/favicon.png</url>
            <title>The Cloudflare Blog</title>
            <link>https://blog.cloudflare.com</link>
        </image>
        <lastBuildDate>Tue, 07 Apr 2026 17:20:01 GMT</lastBuildDate>
        <item>
            <title><![CDATA[Automatic Audit Logs: new updates deliver increased transparency and accountability]]></title>
            <link>https://blog.cloudflare.com/introducing-automatic-audit-logs/</link>
            <pubDate>Thu, 13 Feb 2025 14:00:00 GMT</pubDate>
            <description><![CDATA[ We’re excited to announce the beta release of Automatic Audit Logs, offering greater transparency and control. ]]></description>
            <content:encoded><![CDATA[ <p></p>
    <div>
      <h2>What are audit logs and why do they matter?</h2>
      <a href="#what-are-audit-logs-and-why-do-they-matter">
        
      </a>
    </div>
    <p>Audit logs are a critical tool for tracking and recording changes, actions, and resource access patterns within your Cloudflare environment. They provide visibility into who performed an action, what the action was, when it occurred, where it happened, and how it was executed. This enables security teams to identify vulnerabilities, ensure regulatory compliance, and assist in troubleshooting operational issues. Audit logs provide critical transparency and accountability. That's why we're making them "automatic" — eliminating the need for individual Cloudflare product teams to manually send events. Instead, audit logs are generated automatically in a standardized format when an action is performed, providing complete visibility and ensuring comprehensive coverage across all our products.</p>
    <div>
      <h2>What's new?</h2>
      <a href="#whats-new">
        
      </a>
    </div>
    <p>We're excited to announce the beta release of Automatic Audit Logs — a system that unifies audit logging across Cloudflare products. This new system is designed to give you a complete and consistent view of your environment’s activity. Here’s how we’ve enhanced our audit logging capabilities:</p><ul><li><p><b>Standardized logging: </b>Previously, audit logs generation was dependent on separate internal teams, which could lead to gaps and inconsistencies. Now, audit logs are automatically produced in a seamless and standardized way, eliminating reliance on individual teams and ensuring consistency across all Cloudflare services.</p></li><li><p><b>Expanded Product Coverage: </b>Automatic Audit Logs now extend our coverage from 62 to 111 products, boosting overall coverage from 75% to 95%. We now capture actions from key endpoints such as the <code>/accounts</code>, <code>/zones</code>, and <code>/organizations</code> APIs.</p></li><li><p><b>Granular Filtering: </b>With uniformly formatted logs, you can quickly pinpoint specific actions, users, methods, and resources, making investigations faster and more efficient.</p></li><li><p><b>Enhanced Context and Transparency: </b>Each log entry includes detailed context like the authentication method used, whether the action was performed via the API or Dashboard, and mappings to Cloudflare Ray IDs for better traceability.</p></li><li><p><b>Comprehensive Activity Capture: </b>In addition to create, edit, and delete actions, the system now records GET requests and failed attempts, ensuring that no critical activity goes unnoticed.</p></li></ul><p>This new system reflects Cloudflare's commitment to building a safer, more transparent Internet. It also supports Cloudflare's pledge to <a href="https://blog.cloudflare.com/secure-by-design-principles/"><u>CISA’s Cybersecurity Commitment</u></a>, reinforcing our dedication to increase our customers’ ability to gather evidence of cybersecurity intrusions.</p><p>Automatic Audit Logs (beta release) is available exclusively through the <a href="https://developers.cloudflare.com/api/resources/audit_logs/methods/list/"><u>API</u></a>. </p>
    <div>
      <h2>The journey of an audit log: how Cloudflare creates reliable, secure records</h2>
      <a href="#the-journey-of-an-audit-log-how-cloudflare-creates-reliable-secure-records">
        
      </a>
    </div>
    <p>At Cloudflare, we’ve always made audit logs available through the <a href="https://developers.cloudflare.com/api/resources/audit_logs/methods/list/"><u>Audit Log API</u></a>, but the experience has not been very consistent.</p><p>Why? Individual product teams were responsible for creating and maintaining their audit logs. This resulted in inconsistencies, gaps in coverage, and a fragmented user experience</p><p>Recognizing the importance of reliable audit logs, we set out to improve coverage across all Cloudflare products. Our goal was to standardize, secure, and automate the process, giving users comprehensive insights into user-initiated actions while enhancing visibility and usability. Let’s take a closer look at how an audit log is created at Cloudflare.</p>
    <div>
      <h3><b>Which APIs are audit logged?</b> </h3>
      <a href="#which-apis-are-audit-logged">
        
      </a>
    </div>
    <p>Audit logs are generated for all user requests made via the public API or the Cloudflare dashboard. While a few exceptions exist, such as GraphQL requests and static assets, the majority of user actions are captured.</p><p>When a user action occurs, the request is forwarded to our audit logging pipeline. This ensures logs are generated automatically for all products, close to the source of the action, and capturing the most relevant details.</p><p>For <a href="https://en.wikipedia.org/wiki/REST"><u>RESTful</u></a> APIs that produce JSON, sanitized request bodies are logged to prevent any sensitive information from being included in the audit logs. For GET requests, which are typically read-only and may generate large responses, only the action performed and the resource accessed are logged, avoiding unnecessary overhead while still maintaining essential visibility.</p>
    <div>
      <h3>Streaming HTTP requests</h3>
      <a href="#streaming-http-requests">
        
      </a>
    </div>
    <p>Any user-initiated action on Cloudflare, whether through the API or the Dashboard, is handled by the API Gateway. The HTTP request, along with its corresponding request and response data, is then forwarded to a <a href="https://www.cloudflare.com/en-gb/developer-platform/products/workers/"><u>Worker</u></a> called the Audit Log Redactor. This allows audit logging to happen automatically without relying on internal teams to send events.</p><p>To minimise the latency, the API Gateway streams these requests to the redactor Worker via <a href="https://developers.cloudflare.com/workers/runtime-apis/rpc/"><u>RPC (Remote Procedure Calls</u></a>) using service bindings. This approach ensures the requests are successfully sent without going through a publicly-accessible URL.</p>
    <div>
      <h3>Redacting sensitive information</h3>
      <a href="#redacting-sensitive-information">
        
      </a>
    </div>
    <p>Once the Worker receives the HTTP request, it references the <a href="https://blog.cloudflare.com/open-api-transition/"><u>Cloudflare OpenAPI Schema</u></a> to handle sensitive information. OpenAPI is a widely adopted, machine-readable, and human-friendly specification format that is used to define HTTP APIs. It relies on <a href="https://blog.postman.com/what-is-json-schema/"><u>JSON Schema</u></a> to describe the API’s underlying data.  </p><p>Using the <a href="https://github.com/cloudflare/api-schemas/"><u>OpenAPI Schema</u></a>, the redactor Worker identifies the corresponding API schema for the HTTP request. It then redacts any sensitive information, leaving only those explicitly marked as <b>auditable</b> in the schema. This redaction process ensures that no sensitive data progresses further down the pipeline while retaining enough information to debug and analyze how an action impacted a resource’s value.</p><p>Each Cloudflare product team defines its APIs within the OpenAPI schema and marks specific fields as auditable. This provides visibility into resource changes while safeguarding sensitive data.</p><p>Once redacted, the data moves through Cloudflare’s data pipeline. This <a href="https://blog.cloudflare.com/cloudflare-incident-on-november-14-2024-resulting-in-lost-logs/#system-architecture"><u>pipeline</u></a> includes several key components including Logfwdr, Logreceiver and Buftee buffers, where the sanitized data is eventually pushed, awaiting further processing.</p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6aWb850BBQPt7iRNZk0rs9/2d18bd6e22f6f28e352666015ae15c1e/image1.png" />
          </figure>
    <div>
      <h3>Ingesting and building the audit log</h3>
      <a href="#ingesting-and-building-the-audit-log">
        
      </a>
    </div>
    <p>The Ingestor service consumes messages from Buftee buffers and transforms individual requests into audit log records. Using a fixed schema, the Ingestor ensures that audit logs remain standardized across all Cloudflare products, regardless of scale.</p><p>Because API Gateway — the system from which the majority of Automatic Audit Logs are recorded, as noted above — handles tens of thousands of requests per second, the Ingestor was designed to process multiple requests concurrently. </p>
          <figure>
          <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5ZdjrAiIP6Eu9DgsfDiRX5/6b19819a78911440b173e685ae9b6224/image2.png" />
          </figure><p><sup><i>Plot of audit requests rate. x-axis indicates the time and y-axis indicates the total number of audit requests handled per second.</i></sup></p>
    <div>
      <h3>Enriching and storing the logs</h3>
      <a href="#enriching-and-storing-the-logs">
        
      </a>
    </div>
    <p>From a security perspective, it is critical to capture who initiated a change and how they were authenticated. To achieve this, the audit log is enriched with user details and authentication information extracted from custom response headers.</p><p>Additional contextual details, such as the account name, are retrieved by making calls to internal services. To enhance performance, a read-through caching mechanism is used. The system checks the cache for responses first and if unavailable, it fetches the data from internal services and caches it for future use.</p><p>Once the audit logs are fully transformed and enriched, they are stored in a database in batches to prevent overwhelming the system. For the beta release, we are storing 30 days of audit logs in the database. This will be extended to 18 months for our GA (General Availability) release in the second half of 2025.</p>
    <div>
      <h3>Sample audit log</h3>
      <a href="#sample-audit-log">
        
      </a>
    </div>
    <p>Here is a complete sample audit log generated when an alert notification policy is updated. It provides all the essential details to answer the who, what, when, where, and how of the action.</p><p>Audit logs are always associated with an account, and some actions also include user and zone information when relevant. The action section outlines what changed and when, while the actor section provides context on who made the change and how it was performed, including whether it was done via the API or through the UI.</p><p>Information about the resource is also included, so you can easily identify what was altered (in this case, the <a href="https://developers.cloudflare.com/waf/reference/alerts/"><u>Advanced Security Events Alert</u></a> was updated). Additionally, raw API request details are provided, allowing users to trace the audit log back to a specific API call.</p>
            <pre><code>curl -X PUT https://api.cloudflare.com/client/v4/accounts/&lt;account_id&gt;/alerting/v3/policies/&lt;policy_id&gt; --data-raw '{...'}</code></pre>
            
            <pre><code>       {
            "account": {
                "id": "&lt;account_id&gt;",
                "name": "Example account"
            },
            "action": {
                "description": "Update a Notification policy",
                "result": "success",
                "time": "2025-01-23T18:25:14.749Z",
                "type": "update"
            },
            "actor": {
                "context": "dash",
                "email": "test@example.com",
                "id": "&lt;actor-id&gt;",
                "ip_address": "127.0.0.1",
                "token": {},
                "type": "user"
            },
            "id": "&lt;audit_log_id&gt;",
            "raw": {
                "cf_ray_id": "&lt;ray_id&gt;",
                "method": "PUT",
                "status_code": 200,
                "uri": "/accounts/&lt;account_id&gt;/alerting/v3/policies/&lt;policy_id&gt;",
                "user_agent": "Postman"
            },
            "resource": {
                "id": "&lt;resource-id&gt;",
                "product": "alerting",
                "request": {
                    "alert_type": "clickhouse_alert_fw_ent_anomaly",
                    "enabled": false,
                    "filters": {
                        "services": [
                            "securitylevel",
                            "ratelimit",
                            "firewallrules"
                        ],
                        "zones": [
                            "&lt;zone_id&gt;"
                        ]
                    },
                    "name": "Advanced Security Events Alert"
                },
                "response": {
                    "id": "&lt;resource_id&gt;"
                },
                "scope": "accounts",
                "type": "policies"
            }</code></pre>
            
    <div>
      <h2>Upcoming enhancements</h2>
      <a href="#upcoming-enhancements">
        
      </a>
    </div>
    <p>For General Availability (GA) we are focusing on developing a new user interface in the Dashboard for Automatic Audit Logs, extracting additional auditable fields for the audit logs — including system-initiated actions and user-level actions such as login events — and enabling audit log export via <a href="https://developers.cloudflare.com/logs/about/"><u>Logpush</u></a>. In the longer term, we plan to introduce dashboards, trend analysis, and alerting features for audit logs to further enhance their utility and ease of use. By enhancing our audit log system, Cloudflare is taking another step toward empowering users to manage their environments with greater transparency, security, and efficiency. </p>
    <div>
      <h2>Get started with Automatic Audit Logs</h2>
      <a href="#get-started-with-automatic-audit-logs">
        
      </a>
    </div>
    <p><b>Automatic Audit Logs</b> are now available for testing. We encourage you to explore the new features and provide your valuable feedback.</p><p>Retrieve audit logs using the following endpoint:</p><p><code>/accounts/&lt;account_id&gt;/logs/audit?since=&lt;date&gt;&amp;before=&lt;date&gt;</code></p><p>You can access detailed documentation for Automatic Audit Logs Beta API release <a href="https://developers.cloudflare.com/api/resources/accounts/subresources/logs/subresources/audit/"><u>here</u></a>. </p><p><i>Please note that the Beta release does not include updates to the Audit Logs UI in the Cloudflare Dashboard. The existing UI and API for the current audit logs will remain available until we Automatic Audit Logs reach General Availability.</i></p><p><b>We want your feedback</b>: Your feedback is essential to improving Automatic Audit Logs. Please consider filling out a <a href="https://docs.google.com/forms/d/e/1FAIpQLSfXGkJpOG1jUPEh-flJy9B13icmcdBhveFwe-X0EzQjJQnQfQ/viewform?usp=sharing"><u>short survey</u></a>.</p> ]]></content:encoded>
            <category><![CDATA[Audit Logs]]></category>
            <category><![CDATA[Beta]]></category>
            <guid isPermaLink="false">3E22tesFNZps8Sqk8VPCan</guid>
            <dc:creator>Sahidya Devadoss</dc:creator>
            <dc:creator>Arti Phugat</dc:creator>
            <dc:creator>Chris Shepherd</dc:creator>
        </item>
        <item>
            <title><![CDATA[Intelligent, automatic restarts for unhealthy Kafka consumers]]></title>
            <link>https://blog.cloudflare.com/intelligent-automatic-restarts-for-unhealthy-kafka-consumers/</link>
            <pubDate>Tue, 24 Jan 2023 14:00:00 GMT</pubDate>
            <description><![CDATA[ At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts. ]]></description>
            <content:encoded><![CDATA[ <p></p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/7eWbGD5pEX9bKf2p58iqOw/b55ba4bfd305da7ed38cf66fe770585c/image3-8-2.png" />
            
            </figure><p>At Cloudflare, we take steps to ensure we are resilient against failure at all levels of our infrastructure. This includes Kafka, which we use for critical workflows such as sending time-sensitive emails and alerts.</p><p>We learned a lot about keeping our applications that leverage Kafka healthy, so they can always be operational. Application health checks are notoriously hard to implement: What determines an application as healthy? How can we keep services operational at all times?</p><p>These can be implemented in many ways. We’ll talk about an approach that allows us to considerably reduce incidents with unhealthy applications while requiring less manual intervention.</p>
    <div>
      <h3>Kafka at Cloudflare</h3>
      <a href="#kafka-at-cloudflare">
        
      </a>
    </div>
    <p><a href="/using-apache-kafka-to-process-1-trillion-messages/">Cloudflare is a big adopter of Kafka</a>. We use Kafka as a way to decouple services due to its asynchronous nature and reliability. It allows different teams to work effectively without creating dependencies on one another. You can also read more about how other teams at Cloudflare use Kafka in <a href="/http-analytics-for-6m-requests-per-second-using-clickhouse/">this</a> post.</p><p>Kafka is used to send and receive messages. Messages represent some kind of event like a credit card payment or details of a new user created in your platform. These messages can be represented in multiple ways: JSON, Protobuf, Avro and so on.</p><p>Kafka organises messages in topics. A topic is an ordered log of events in which each message is marked with a progressive offset. When an event is written by an external system, that is appended to the end of that topic. These events are not deleted from the topic by default (retention can be applied).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2KUYbqCCL74YZVU8NXOThl/4ec5024168993a2300add7221016af0d/1-4.png" />
            
            </figure><p>Topics are stored as log files on disk, which are finite in size. Partitions are a systematic way of breaking the one topic log file into many logs, each of which can be hosted on separate servers–enabling to scale topics.</p><p>Topics are managed by brokers–nodes in a Kafka cluster. These are responsible for writing new events to partitions, serving reads and replicating partitions among themselves.</p><p>Messages can be consumed by individual consumers or co-ordinated groups of consumers, known as consumer groups.</p><p>Consumers use a unique id (consumer id) that allows them to be identified by the broker as an application which is consuming from a specific topic.</p><p>Each topic can be read by an infinite number of different consumers, as long as they use a different id. Each consumer can replay the same messages as many times as they want.</p><p>When a consumer starts consuming from a topic, it will process all messages, starting from a selected offset, from each partition. With a consumer group, the partitions are divided amongst each consumer in the group. This division is determined by the consumer group leader. This leader will receive information about the other consumers in the group and will decide which consumers will receive messages from which partitions (partition strategy).</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6Qe2Qe5nQ5gcHyhV0zpTWw/5182eea9de66164a36a28e92270fdb3f/2-3.png" />
            
            </figure><p>The offset of a consumer’s commit can demonstrate whether the consumer is working as expected. Committing a processed offset is the way a consumer and its consumer group report to the broker that they have processed a particular message.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/29Y9mQiHkvGKUzc3RGF1sk/09d2987f53eef026c164e6c49cacc95c/unnamed-6.png" />
            
            </figure><p>A standard measurement of whether a consumer is processing fast enough is lag. We use this to measure how far behind the newest message we are. This tracks time elapsed between messages being written to and read from a topic. When a service is lagging behind, it means that the consumption is at a slower rate than new messages being produced.</p><p>Due to Cloudflare’s scale, message rates typically end up being very large and a lot of requests are time-sensitive so monitoring this is vital.</p><p>At Cloudflare, our applications using Kafka are deployed as microservices on Kubernetes.</p>
    <div>
      <h3>Health checks for Kubernetes apps</h3>
      <a href="#health-checks-for-kubernetes-apps">
        
      </a>
    </div>
    <p>Kubernetes uses <a href="https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/">probes</a> to understand if a service is healthy and is ready to receive traffic or to run. When a liveness probe fails and the bounds for retrying are exceeded, Kubernetes restarts the services.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/4FagbTygES9L7dmEQ6ratD/0a6f0d4c5ac117b723ad726a12d3936a/4-3.png" />
            
            </figure><p>When a readiness probe fails and the bounds for retrying are exceeded, it stops sending HTTP traffic to the targeted pods. In the case of Kafka applications this is not relevant as they don’t run an http server. For this reason, we’ll cover only liveness checks.</p><p>A classic Kafka liveness check done on a consumer checks the status of the connection with the broker. It’s often best practice to keep these checks simple and perform some basic operations - in this case, something like listing topics. If, for any reason, this check fails consistently, for instance the broker returns a TLS error, Kubernetes terminates the service and starts a new pod of the same service, therefore forcing a new connection. Simple Kafka liveness checks do a good job of understanding when the connection with the broker is unhealthy.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/6gNWb3Rit0MmTutsurm7sf/70355c422fab7ebce7d59d8c2c682d6d/5-2.png" />
            
            </figure>
    <div>
      <h3>Problems with Kafka health checks</h3>
      <a href="#problems-with-kafka-health-checks">
        
      </a>
    </div>
    <p>Due to Cloudflare’s scale, a lot of our Kafka topics are divided into multiple partitions (in some cases this can be hundreds!) and in many cases the replica count of our consuming service doesn’t necessarily match the number of partitions on the Kafka topic. This can mean that in a lot of scenarios this simple approach to health checking is not quite enough!</p><p>Microservices that consume from Kafka topics are healthy if they are consuming and committing offsets at regular intervals when messages are being published to a topic. When such services are not committing offsets as expected, it means that the consumer is in a bad state, and it will start accumulating lag. An approach we often take is to manually terminate and restart the service in Kubernetes, this will cause a reconnection and rebalance.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/N4YalYdgNRxYJK7PVAlzY/26b55fc38c53855a6c28c71b25cdac02/lag.png" />
            
            </figure><p>When a consumer joins or leaves a consumer group, a rebalance is triggered and the consumer group leader must re-assign which consumers will read from which partitions.</p><p>When a rebalance happens, each consumer is notified to stop consuming. Some consumers might get their assigned partitions taken away and re-assigned to another consumer. We noticed when this happened within our library implementation; if the consumer doesn’t acknowledge this command, it will wait indefinitely for new messages to be consumed from a partition that it’s no longer assigned to, ultimately leading to a deadlock. Usually a manual restart of the faulty client-side app is needed to resume processing.</p>
    <div>
      <h3>Intelligent health checks</h3>
      <a href="#intelligent-health-checks">
        
      </a>
    </div>
    <p>As we were seeing consumers reporting as “healthy” but sitting idle, it occurred to us that maybe we were focusing on the wrong thing in our health checks. Just because the service is connected to the Kafka broker and can read from the topic, it does not mean the consumer is actively processing messages.</p><p>Therefore, we realised we should be focused on message ingestion, using the offset values to ensure that forward progress was being made.</p>
    <div>
      <h4>The PagerDuty approach</h4>
      <a href="#the-pagerduty-approach">
        
      </a>
    </div>
    <p>PagerDuty wrote an excellent <a href="https://www.pagerduty.com/eng/kafka-health-checks/">blog</a> on this topic which we used as inspiration when coming up with our approach.</p><p>Their approach used the current (latest) offset and the committed offset values. The current offset signifies the last message that was sent to the topic, while the committed offset is the last message that was processed by the consumer.</p>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/2fwem7NtBnO6M1RMhrezr8/af4cbbd7a63d3145f5c7fe9f405bd04d/pasted-image-0-4.png" />
            
            </figure><p>Checking the consumer is moving forwards, by ensuring that the latest offset was changing (receiving new messages) and the committed offsets were changing as well (processing the new messages).</p><p>Therefore, the solution we came up with:</p><ul><li><p>If we cannot read the current offset, fail liveness probe.</p></li><li><p>If we cannot read the committed offset, fail liveness probe.</p></li><li><p>If the committed offset == the current offset, pass liveness probe.</p></li><li><p>If the value for the committed offset has not changed since the last run of the health check, fail liveness probe.</p></li></ul>
            <figure>
            
            <img src="https://cf-assets.www.cloudflare.com/zkvhlag99gkb/5r76n2Iew7pSqA8vYNZzIy/c9e0f6a113a34d0c36a216c054e4d840/pasted-image-0--1--3.png" />
            
            </figure><p>To measure if the committed offset is changing, we need to store the value of the previous run, we do this using an in-memory map where partition number is the key. This means each instance of our service only has a view of the partitions it is currently consuming from and will run the health check for each.</p>
    <div>
      <h4>Problems</h4>
      <a href="#problems">
        
      </a>
    </div>
    <p>When we first rolled out our smart health checks we started to notice cascading failures some time after release. After initial investigations we realised this was happening when a rebalance happens. It would initially affect one replica then quickly result in the others reporting as unhealthy.</p><p>What we observed was due to us storing the previous value of the committed offset in-memory, when a rebalance happens the service may get re-assigned a different partition. When this happened it meant our service was incorrectly assuming that the committed offset for that partition had not changed (as this specific replica was no longer updating the latest value), therefore it would start to report the service as unhealthy. The failing liveness probe would then cause it to restart which would in-turn trigger another rebalancing in Kafka causing other replicas to face the same issue.</p>
    <div>
      <h4>Solution</h4>
      <a href="#solution">
        
      </a>
    </div>
    <p>To fix this issue we needed to ensure that each replica only kept track of the offsets for the partitions it was consuming from at that moment. Luckily, the Shopify Sarama library, which we use internally, has functionality to observe when a rebalancing happens. This meant we could use it to rebuild the in-memory map of offsets so that it would only include the relevant partition values.</p><p>This is handled by receiving the signal from the session context channel:</p>
            <pre><code>for {
  select {
  case message, ok := &lt;-claim.Messages(): // &lt;-- Message received

     // Store latest received offset in-memory
     offsetMap[message.Partition] = message.Offset


     // Handle message
     handleMessage(ctx, message)


     // Commit message offset
     session.MarkMessage(message, "")


  case &lt;-session.Context().Done(): // &lt;-- Rebalance happened

     // Remove rebalanced partition from in-memory map
     delete(offsetMap, claim.Partition())
  }
}</code></pre>
            <p>Verifying this solution was straightforward, we just needed to trigger a rebalance. To test this worked in all possible scenarios we spun up a single replica of a service consuming from multiple partitions, then proceeded to scale up the number of replicas until it matched the partition count, then scaled back down to a single replica. By doing this we verified that the health checks could safely handle new partitions being assigned as well as partitions being taken away.</p>
    <div>
      <h3>Takeaways</h3>
      <a href="#takeaways">
        
      </a>
    </div>
    <p>Probes in Kubernetes are very easy to set up and can be a powerful tool to ensure your application is running as expected. Well implemented probes can often be the difference between engineers being called out to fix trivial issues (sometimes outside of working hours) and a service which is self-healing.</p><p>However, without proper thought, “dumb” health checks can also lead to a false sense of security that a service is running as expected even when it’s not. One thing we have learnt from this was to think more about the specific behaviour of the service and decide what being unhealthy means in each instance, instead of just ensuring that dependent services are connected.</p> ]]></content:encoded>
            <category><![CDATA[Kafka]]></category>
            <category><![CDATA[Observability]]></category>
            <category><![CDATA[Go]]></category>
            <category><![CDATA[Kubernetes]]></category>
            <guid isPermaLink="false">7s1ijlG7zMlxJPI6Hcs3zl</guid>
            <dc:creator>Chris Shepherd</dc:creator>
            <dc:creator>Andrea Medda</dc:creator>
        </item>
    </channel>
</rss>