The Cloudflare Blog

Always Online v.2

Matthew Prince — Sun, 05 Aug 2012 19:33:00 GMT

The video on CloudFlare's home page promises that we will keep your web page online "even if your server goes down." It's a feature we dubbed "Always Online" and, when it works, it's magical. The problem is, Always Online doesn't always work.

This blog post is to announce that we've just released a new version of Always Online which we believe will make the feature significantly better. But, before I get to that, let me tell you a bit about the history of Always Online, how it has worked up until recently, and why it didn't always work. Then I'll turn to what we've done to create Always Online v.2.

An Accidental Feature

Prior to starting CloudFlare, Lee and I ran Project Honey Pot. The Project Honey Pot website is database driven and contains a virtually infinite number of pages. One of the biggest challenges we had wasn't human traffic, which followed a predictable browsing pattern and could therefore reliably be cached, but instead dealing with traffic from automated crawlers.

These crawlers, whether legitimate (e.g., Google's bot) or illegitimate (e.g., spam harvesters), tend to crawl very "deep" into sites. As a result, they hit pages that are unlikely to have been crawled in a while and, in doing so, can impose significant load on a database. I've previously written about the hidden tax web crawlers impose on web performance.

At Project Honey Pot, Lee built a number of sophisticated caching strategies in order to help lessen the load of automated crawlers on the site's database. At CloudFlare, he realized that we could provide the same type of caching in order to cut the burden bots placed on backends. In essence, we automatically cache content for a short amount of time and, if it hasn't changed since the last request from a bot, deliver it without having to burden your web application. It works great.

In the process of building the bot content cache, Lee realized he could implement something else: a system to serve static versions of pages if an origin server fails. Using human traffic to build such a cache is dangerous because you don't want to expose one user's private information to another user (e.g., we can't cache when one user visits their bank's website to view their statement and then show that statement to another user). However, search engine crawlers are the perfect anonymous user to build a site's cache. The logic was: if it's in Google, then it's already effectively cached.

Good, Not Perfect

The approach of using known search engine bot traffic to build CloudFlare's cache was clever, but it had some problems. The first was that CloudFlare runs multiple data centers around the world and the cache in each is different. The solution was to find the data center with the most search engine crawler traffic and, if a copy of the page didn't exist in the local data center's cache, fall back on the "primary" data center. In our case, our Ashburn, Virginia data center received the most crawl traffic so we added a lot more disks there and used it to build up the Always Online cache.

That worked great for some sites, but for others we still would not have content in our cache when the server went offline. Seemingly bizarrely, the more static the page the less likely it was to be in our cache. The explanation was the source of the cache data: search engine crawlers. These crawlers are generally setup to visit pages that change regularly more often, and for pages that rarely change only occasionally. If the page returned a 304 "Not Modified" response then the content didn't get recached. We didn't help things by automatically expiring items in ourcache after a period of time.

The net result was, far too often, when someone's site would go offline their visitors wouldn't see a cached version of the page but, instead, a CloudFlare error page telling them that the site was offline and no cached version was available. This became one of the top complaints from our users and the visitors to their sites. When our support team dubbed the feature "Always Offline" we knew it was time to make it better.

Version 2

We made a number of improvements in how we cache pages in order to improve Always Online, but the biggest change we made was to begin to actively crawl pages ourselves. CloudFlare now runs a crawler which periodically crawls our customers' pages if they have the Always Online feature enabled. The crawler's useragent is:

Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +http://www.cloudflare.com/always-online )

You can learn more about the crawler's behavior by visiting: www.cloudflare.com/always-online. The frequency that we refresh pages in the Always Online depends on your plan. We crawl free customers once every 9 days, Pro customers onceevery 3 days, and Business and Enterprise customers daily. We are tinkering with the amount of time we spend crawling each site as well as tuning the crawler to ensure it doesn't visit sites when they're under load or otherwise impose any additional burden.

Given that we can now control exactly what is in our Always Onlinecache, our next iteration will be to turn that control over to our usersand allow you to both "pin" the pages you want to ensure are always available and "exclude" any pages you never want cached. In the meantime, we're using data we have about the most popular portions of each site in order to choose what pages to prioritize in the cache.

Our goal is to make the Site Offline error a thing of the past. We started building the new cache a couple days ago and expect everyone with Always Online to have a more robust cache available within the next few days. While everyone hopes their origin server will never go down, with Always Online v.2 we're happy to provide better peace of mind in case it ever does.

CloudFlare Uses Intelligent Caching to Avoid the Bot Performance Tax

Matthew Prince — Fri, 16 Dec 2011 20:28:00 GMT

I originally wrote this article for the Web Performance Calendar website, which is a terrific resource of expert opinions on making your website as fast as possible. We thought CloudFlare users would be interested so we reproduced it here. Enjoy!

In 2004, Lee Holloway and I started Project Honey Pot. The site, which tracks online fraud and abuse, primarily consists of web pages that report the reputation of IP addresses. While we had limited resources and tried to get the most of them, I just checked Google which lists more than 31 million pages in its index that make up the www.projecthoneypot.org site.

Project Honey Pot's pages are relatively simple and asset light, but like many sites today they include significant dynamic content that is regularly updated at unpredictable intervals. To deliver near realtime updates, the pages need to be database driven.

To maximize performance of the site, from the beginning we used a number of different caching layers to store the most frequently accessed pages. Lee, whose background is high-performance database design, studied reports from services like Google Analytics to understand how visitors moved through the site and built caching to keep regularly accessed pages from needing to hit the database.

We thought we were pretty smart but, in spite of following the best practices of web application performance design, with alarming frequency the site would grind to a halt. The culprit turned out to be something unexpected and hidden from the view of many people optimizing web performance: automated bots.

The average website sees more than 20% of its requests coming from some sort of automated bot. These bots include the usual suspects like search engine crawlers, but also include malicious bots scanning for vulnerabilities or harvesting data. We've been tracking this data at CloudFlare across hundreds of thousands of sites on our network and have found that on average approximately 15% of web total requests originate a web threat of one form or another, with swings up and down depending on the day.

In Project Honey Pot's case, the traffic from these bots had a significant performance impact. Because they did not follow the typical human visitation pattern, they were often triggering pages that weren't hot in our cache. Moreover, since the bots typically didn't fire Javascript beacons like those used in systems like Google Analytics, their traffic and its impact weren't immediately obvious.

To solve the problem, we implemented two different systems to deal with two different types of bots. Because we had great data on web threats, we were able to leverage that to restrict known malicious crawlers from requesting dynamic pages on the site. Just taking off the threat traffic had an immediate impact and freed up database resources for legitimate visitors.

The same approach didn't make sense for the other type of automated bots: search engine crawlers. We wanted Project Honey Pot's pages to be found through online searches, so we didn't want to block search engine crawlers entirely. However, in spite of removing the threat traffic, Google, Yahoo, and Microsoft's crawlers all accessing the site at the same time would sometimes cause the web server and database to slow to a crawl.

The solution was a modification of our caching strategy. While we wanted to deliver the latest results to human visitors, we began serving search crawlers from a cache with a longer time to live (TTL). We experimented with the right TTLs for pages, but eventually settled on 1 day as being optimal for the Project Honey Pot site. If a page is crawled by Google today and then Baidu requests the same page less in the next 24 hours, we return the cached version without regenerating the page from the database.

Search engines, by their nature, see a snapshot of the Internet. While it is important to not serve deceptively different content to their crawlers, modifying your caching strategy to minimize their performance impact on your web application is well within the bounds of good web practices.

Since starting CloudFlare, we've taken the caching strategy we developed at Project Honey Pot and made it more intelligent and dynamic to optimize performance. We automatically tune the search crawler TTL to the characteristics of the site, and are very good at keeping malicious crawlers from ever hitting your web application. On average, we're able to offload 70% of the requests from a web application — which is stunning given the entire CloudFlare configuration process takes about 5 minutes. While some of this performance benefit comes from traditional CDN-like caching, some of the biggest cache wins actually come from handling bots' deep page views that aren't alleviated by traditional caching strategies.

The results can be dramatic. For example, SXSW's website employs extensive traditional web application and database caching systems but was able to reduce the load on their web servers and database machinesby more than 50% in large part because of CloudFlare's bot-aware caching.

When you're tuning your web application for maximum performance, if you're only looking at a beacon-based analytics tool like Google Analytics you may be missing one of the biggest sources of web application load. This is why CloudFlare's analytics reports the visits from all visitors to your site. Even without CloudFlare, digging through your raw server logs, being bot-aware, and building caching strategies that differentiate between the behaviors of different classes of visitors can be an important aspect of any site's web performance strategy.

CloudFlare, Now With Faster Stats!

Matthew Prince — Fri, 17 Dec 2010 19:14:00 GMT

We generate a lot of logs. To give you some sense, across the CloudFlare network every minute we write a half a Gigabyte of just log data. Collecting, reducing, sorting, and displaying that data back to you is one of the biggest challenges of running a service like CloudFlare. Over the last few days we've been upgrading our core logging infrastructure to make the displaying portion a lot faster. We want our stats page to be lightening quick whenever you load it or run a query. We were hitting an I/O bottleneck with some queries that produced what we felt were unacceptably slow results. While we have some long-term plans on how to enhance the core storage architecture to get around this, in the short term we needed a fix we knew would work so we turned to our friends at Fusion-IO.

Lee had used Fusion-IO cards at Project Honey Pot to get around file system limitations, and at CloudFlare we had become even more familiar with them because some of our investors were also investors in Fusion. Plus Steve Wozniak is on their Board of Directors and, as Damon has pointed out on Twitter, we have a Woz ninja. If you're not familiar with Fusion-IO, here's all you need to know: they'll make something other than I/O the bottleneck on your system.

Over the last few days we've moved the parts of the stats database that need to be fast onto Fusion-IO cards. Moving that much data and rebuilding indexes took longer than we anticipated. However, as we anticipated the bottleneck was the slow media and CPU, not the new Fusion cards. Going forward, we'll be evaluating their performance and deciding whether to incorporate them into our core storage architecture more broadly. In the meantime, sorry for the downtime with stats being displayed. Rest assured that we we've been gathering data all that time and what's not already back online will trickle in over the next day. And, best of all, the page to view your stats should now be a lot faster.