We then compiled our Fuzz function with go-fuzz, and launched the fuzzer on a lab server. The first thing go-fuzz does is minimize the corpus by throwing away packets that trigger the same code paths, then it starts mutating the inputs and passing them to Fuzz() in a loop. The mutations that don't fail (return 1) and expand code coverage are kept and iterated over. When the program panics, a small report (input and output) is saved and the program restarted. If you want to learn more about go-fuzz watch the author's GopherCon talk or read the README.
Crashes, mostly "index out of bounds", started to surface. go-fuzz becomes pretty slow and ineffective when the program crashes often, so while the CPUs burned I started fixing the bugs.
In some cases I just decided to change some parser patterns, for example reslicing and using len() instead of keeping offsets. However these can be potentially disrupting changes—I'm far from perfect—so I adapted the Fuzz function to keep an eye on the differences between the old and new, fixed parser, and crash if the new parser started refusing good packets or changed its behavior:
\n
func Fuzz(rawMsg []byte) int {\n var (\n msg, msgOld = &dns.Msg{}, &old.Msg{}\n buf, bufOld = make([]byte, 100000), make([]byte, 100000)\n res, resOld []byte\n\n unpackErr, unpackErrOld error\n packErr, packErrOld error\n )\n\n unpackErr = msg.Unpack(rawMsg)\n unpackErrOld = ParseDNSPacketSafely(rawMsg, msgOld)\n\n if unpackErr != nil && unpackErrOld != nil {\n return 0\n }\n\n if unpackErr != nil && unpackErr.Error() == "dns: out of order NSEC block" {\n // 97b0a31 - rewrite NSEC bitmap [un]packing to account for out-of-order\n return 0\n }\n\n if unpackErr != nil && unpackErr.Error() == "dns: bad rdlength" {\n // 3157620 - unpackStructValue: drop rdlen, reslice msg instead\n return 0\n }\n\n if unpackErr != nil && unpackErr.Error() == "dns: bad address family" {\n // f37c7ea - Reject a bad EDNS0_SUBNET family on unpack (not only on pack)\n return 0\n }\n\n if unpackErr != nil && unpackErr.Error() == "dns: bad netmask" {\n // 6d5de0a - EDNS0_SUBNET: refactor netmask handling\n return 0\n }\n\n if unpackErr != nil && unpackErrOld == nil {\n println("new code fails to unpack valid packets")\n panic(unpackErr)\n }\n\n res, packErr = msg.PackBuffer(buf)\n\n if packErr != nil {\n println("failed to pack back a message")\n spew.Dump(msg)\n panic(packErr)\n }\n\n if unpackErrOld == nil {\n\n resOld, packErrOld = msgOld.PackBuffer(bufOld)\n\n if packErrOld == nil && !bytes.Equal(res, resOld) {\n println("new code changed behavior of valid packets:")\n println()\n println(hex.Dump(res))\n println(hex.Dump(resOld))\n os.Exit(1)\n }\n\n }\n\n return 1\n}
\n
I was pretty happy about the robustness gain, but since we used the ParseDNSPacketSafely wrapper in RRDNS I didn't expect to find security vulnerabilities. I was wrong!
DNS names are made of labels, usually shown separated by dots. In a space saving effort, labels can be replaced by pointers to other names, so that if we know we encoded example.com at offset 15, www.example.com can be packed as www. + PTR(15). What we found is a bug in handling of pointers to empty names: when encountering the end of a name (0x00), if no label were read, "." (the empty name) was returned as a special case. Problem is that this special case was unaware of pointers, and it would instruct the parser to resume reading from the end of the pointed-to empty name instead of the end of the original name.
For example if the parser encountered at offset 60 a pointer to offset 15, and msg[15] == 0x00, parsing would then resume from offset 16 instead of 61, causing a infinite loop. This is a potential Denial of Service vulnerability.
\n
A) Parse up to position 60, where a DNS name is found\n\n| ... | 15 | 16 | 17 | ... | 58 | 59 | 60 | 61 |\n| ... | 0x00 | | | ... | | | ->15 | |\n\n-------------------------------------------------> \n\nB) Follow the pointer to position 15\n\n| ... | 15 | 16 | 17 | ... | 58 | 59 | 60 | 61 |\n| ... | 0x00 | | | ... | | | ->15 | |\n\n ^ |\n ------------------------------------------ \n\nC) Return a empty name ".", special case triggers\n\nD) Erroneously resume from position 16 instead of 61\n\n| ... | 15 | 16 | 17 | ... | 58 | 59 | 60 | 61 |\n| ... | 0x00 | | | ... | | | ->15 | |\n\n --------------------------------> \n\nE) Rinse and repeat
\n
We sent the fixes privately to the library maintainer while we patched our servers and we opened a PR once done. (Two bugs were independently found and fixed by Miek while we released our RRDNS updates, as it happens.)
Thanks to its flexible fuzzing API, go-fuzz lends itself nicely not only to the mere search of crashing inputs, but can be used to explore all scenarios where edge cases are troublesome.
Useful applications range from checking output validation by adding crashing assertions to your Fuzz() function, to comparing the two ends of a unpack-pack chain and even comparing the behavior of two different versions or implementations of the same functionality.
For example, while preparing our DNSSEC engine for launch, I faced a weird bug that would happen only on production or under stress tests: NSEC records that were supposed to only have a couple bits set in their types bitmap would sometimes look like this
\n
deleg.filippo.io. IN NSEC 3600 \\000.deleg.filippo.io. NS WKS HINFO TXT AAAA LOC SRV CERT SSHFP RRSIG NSEC TLSA HIP TYPE60 TYPE61 SPF
\n
The catch was that our "pack and send" code pools []byte buffers to reduce GC and allocation churn, so buffers passed to dns.msg.PackBuffer(buf []byte) can be "dirty" from previous uses.
\n
var bufpool = sync.Pool{\n New: func() interface{} {\n return make([]byte, 0, 2048)\n },\n}\n\n[...]\n\n data := bufpool.Get().([]byte)\n defer bufpool.Put(data)\n\n if data, err = r.Response.PackBuffer(data); err != nil {
\n
However, buf not being an array of zeroes was not handled by some github.com/miekgs/dns packers, including the NSEC rdata one, that would just OR present bits, without clearing ones that are supposed to be absent.
\n
case `dns:"nsec"`:\n lastwindow := uint16(0)\n length := uint16(0)\n for j := 0; j < val.Field(i).Len(); j++ {\n t := uint16((fv.Index(j).Uint()))\n window := uint16(t / 256)\n if lastwindow != window {\n off += int(length) + 3\n }\n length = (t - window*256) / 8\n bit := t - (window * 256) - (length * 8)\n\n msg[off] = byte(window) // window #\n msg[off+1] = byte(length + 1) // octets length\n\n // Setting the bit value for the type in the right octet\n---> msg[off+2+int(length)] |= byte(1 << (7 - bit)) \n\n lastwindow = window\n }\n off += 2 + int(length)\n off++\n}
\n
The fix was clear and easy: we benchmarked a few different ways to zero a buffer and updated the code like this
\n
// zeroBuf is a big buffer of zero bytes, used to zero out the buffers passed\n// to PackBuffer.\nvar zeroBuf = make([]byte, 65535)\n\nvar bufpool = sync.Pool{\n New: func() interface{} {\n return make([]byte, 0, 2048)\n },\n}\n\n[...]\n\n data := bufpool.Get().([]byte)\n defer bufpool.Put(data)\n copy(data[0:cap(data)], zeroBuf)\n\n if data, err = r.Response.PackBuffer(data); err != nil {
\n
Note: a recent optimization turns zeroing range loops into memclr calls, so once 1.5 lands that will be much faster than copy().
But this was a boring fix! Wouldn't it be nicer if we could trust our library to work with any buffer we pass it? Luckily, this is exactly what coverage based fuzzing is good for: making sure all code paths behave in a certain way.
What I did then is write a Fuzz() function that would first parse a message, and then pack it to two different buffers: one filled with zeroes and one filled with 0xff. Any differences between the two results would signal cases where the underlying buffer is leaking into the output.
\n
func Fuzz(rawMsg []byte) int {\n var (\n msg = &dns.Msg{}\n buf, bufOne = make([]byte, 100000), make([]byte, 100000)\n res, resOne []byte\n\n unpackErr, packErr error\n )\n\n if unpackErr = msg.Unpack(rawMsg); unpackErr != nil {\n return 0\n }\n\n if res, packErr = msg.PackBuffer(buf); packErr != nil {\n return 0\n }\n\n for i := range res {\n bufOne[i] = 1\n }\n\n resOne, packErr = msg.PackBuffer(bufOne)\n if packErr != nil {\n println("Pack failed only with a filled buffer")\n panic(packErr)\n }\n\n if !bytes.Equal(res, resOne) {\n println("buffer bits leaked into the packed message")\n println(hex.Dump(res))\n println(hex.Dump(resOne))\n os.Exit(1)\n }\n\n return 1\n}
\n
I wish here, too, I could show a PR fixing all the bugs, but go-fuzz did its job even too well and we are still triaging and fixing what it finds.
Anyway, once the fixes are done and go-fuzz falls silent, we will be free to drop the buffer zeroing step without worry, with no need to audit the whole codebase!
Do you fancy fuzzing the libraries that serve 43 billion queries per day? We are hiring in London, San Francisco and Singapore!
"],"published_at":[0,"2015-08-06T14:40:40.000+01:00"],"updated_at":[0,"2024-10-10T00:34:58.035Z"],"feature_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/1VaZCGQlhYp8Ql6z5IU1Rw/bbbba4e39ba9873d2db5479e41ee71ba/dns-parser-meet-go-fuzzer.jpg"],"tags":[1,[[0,{"id":[0,"5X8LHhjKihRu7muwB9u5a1"],"name":[0,"RRDNS"],"slug":[0,"rrdns"]}],[0,{"id":[0,"5fZHv2k9HnJ7phOPmYexHw"],"name":[0,"DNS"],"slug":[0,"dns"]}],[0,{"id":[0,"6QVJOBzgKXUO9xAPEpqxvK"],"name":[0,"Reliability"],"slug":[0,"reliability"]}],[0,{"id":[0,"1fCflWFtZIDnGI4cd3gRgx"],"name":[0,"Tools"],"slug":[0,"tools"]}],[0,{"id":[0,"KDI5hQcs301H8vxpGKXO0"],"name":[0,"Go"],"slug":[0,"go"]}]]],"relatedTags":[0],"authors":[1,[[0,{"name":[0,"Filippo Valsorda"],"slug":[0,"filippo"],"bio":[0,null],"profile_image":[0,"https://cf-assets.www.cloudflare.com/zkvhlag99gkb/46wTy3eKIkbnXRmuf2gsIt/3b2b4a5afc370ab87b3a189c0424f75c/filippo.jpg"],"location":[0,null],"website":[0,null],"twitter":[0,"@filosottile"],"facebook":[0,null]}]]],"meta_description":[0,null],"primary_author":[0,{}],"localeList":[0,{"name":[0,"DNS parser, meet Go fuzzer Config"],"enUS":[0,"English for Locale"],"zhCN":[0,"No Page for Locale"],"zhHansCN":[0,"No Page for Locale"],"zhTW":[0,"No Page for Locale"],"frFR":[0,"No Page for Locale"],"deDE":[0,"No Page for Locale"],"itIT":[0,"No Page for Locale"],"jaJP":[0,"No Page for Locale"],"koKR":[0,"No Page for Locale"],"ptBR":[0,"No Page for Locale"],"esLA":[0,"No Page for Locale"],"esES":[0,"No Page for Locale"],"enAU":[0,"No Page for Locale"],"enCA":[0,"No Page for Locale"],"enIN":[0,"No Page for Locale"],"enGB":[0,"No Page for Locale"],"idID":[0,"No Page for Locale"],"ruRU":[0,"No Page for Locale"],"svSE":[0,"No Page for Locale"],"viVN":[0,"No Page for Locale"],"plPL":[0,"No Page for Locale"],"arAR":[0,"No Page for Locale"],"nlNL":[0,"No Page for Locale"],"thTH":[0,"No Page for Locale"],"trTR":[0,"No Page for Locale"],"heIL":[0,"No Page for Locale"],"lvLV":[0,"No Page for Locale"],"etEE":[0,"No Page for Locale"],"ltLT":[0,"No Page for Locale"]}],"url":[0,"https://blog.cloudflare.com/dns-parser-meet-go-fuzzer"],"metadata":[0,{"title":[0],"description":[0],"imgPreview":[0,""]}]}],"translations":[0,{"posts.by":[0,"By"],"footer.gdpr":[0,"GDPR"],"lang_blurb1":[0,"This post is also available in {lang1}."],"lang_blurb2":[0,"This post is also available in {lang1} and {lang2}."],"lang_blurb3":[0,"This post is also available in {lang1}, {lang2} and {lang3}."],"footer.press":[0,"Press"],"header.title":[0,"The Cloudflare Blog"],"search.clear":[0,"Clear"],"search.filter":[0,"Filter"],"search.source":[0,"Source"],"footer.careers":[0,"Careers"],"footer.company":[0,"Company"],"footer.support":[0,"Support"],"footer.the_net":[0,"theNet"],"search.filters":[0,"Filters"],"footer.our_team":[0,"Our team"],"footer.webinars":[0,"Webinars"],"page.more_posts":[0,"More posts"],"posts.time_read":[0,"{time} min read"],"search.language":[0,"Language"],"footer.community":[0,"Community"],"footer.resources":[0,"Resources"],"footer.solutions":[0,"Solutions"],"footer.trademark":[0,"Trademark"],"header.subscribe":[0,"Subscribe"],"footer.compliance":[0,"Compliance"],"footer.free_plans":[0,"Free plans"],"footer.impact_ESG":[0,"Impact/ESG"],"posts.follow_on_X":[0,"Follow on X"],"footer.help_center":[0,"Help center"],"footer.network_map":[0,"Network Map"],"header.please_wait":[0,"Please Wait"],"page.related_posts":[0,"Related posts"],"search.result_stat":[0,"Results {search_range} of {search_total} for {search_keyword}"],"footer.case_studies":[0,"Case Studies"],"footer.connect_2024":[0,"Connect 2024"],"footer.terms_of_use":[0,"Terms of Use"],"footer.white_papers":[0,"White Papers"],"footer.cloudflare_tv":[0,"Cloudflare TV"],"footer.community_hub":[0,"Community Hub"],"footer.compare_plans":[0,"Compare plans"],"footer.contact_sales":[0,"Contact Sales"],"header.contact_sales":[0,"Contact Sales"],"header.email_address":[0,"Email Address"],"page.error.not_found":[0,"Page not found"],"footer.developer_docs":[0,"Developer docs"],"footer.privacy_policy":[0,"Privacy Policy"],"footer.request_a_demo":[0,"Request a demo"],"page.continue_reading":[0,"Continue reading"],"footer.analysts_report":[0,"Analyst reports"],"footer.for_enterprises":[0,"For enterprises"],"footer.getting_started":[0,"Getting Started"],"footer.learning_center":[0,"Learning Center"],"footer.project_galileo":[0,"Project Galileo"],"pagination.newer_posts":[0,"Newer Posts"],"pagination.older_posts":[0,"Older Posts"],"posts.social_buttons.x":[0,"Discuss on X"],"search.icon_aria_label":[0,"Search"],"search.source_location":[0,"Source/Location"],"footer.about_cloudflare":[0,"About Cloudflare"],"footer.athenian_project":[0,"Athenian Project"],"footer.become_a_partner":[0,"Become a partner"],"footer.cloudflare_radar":[0,"Cloudflare Radar"],"footer.network_services":[0,"Network services"],"footer.trust_and_safety":[0,"Trust & Safety"],"header.get_started_free":[0,"Get Started Free"],"page.search.placeholder":[0,"Search Cloudflare"],"footer.cloudflare_status":[0,"Cloudflare Status"],"footer.cookie_preference":[0,"Cookie Preferences"],"header.valid_email_error":[0,"Must be valid email."],"search.result_stat_empty":[0,"Results {search_range} of {search_total}"],"footer.connectivity_cloud":[0,"Connectivity cloud"],"footer.developer_services":[0,"Developer services"],"footer.investor_relations":[0,"Investor relations"],"page.not_found.error_code":[0,"Error Code: 404"],"search.autocomplete_title":[0,"Insert a query. Press enter to send"],"footer.logos_and_press_kit":[0,"Logos & press kit"],"footer.application_services":[0,"Application services"],"footer.get_a_recommendation":[0,"Get a recommendation"],"posts.social_buttons.reddit":[0,"Discuss on Reddit"],"footer.sse_and_sase_services":[0,"SSE and SASE services"],"page.not_found.outdated_link":[0,"You may have used an outdated link, or you may have typed the address incorrectly."],"footer.report_security_issues":[0,"Report Security Issues"],"page.error.error_message_page":[0,"Sorry, we can't find the page you are looking for."],"header.subscribe_notifications":[0,"Subscribe to receive notifications of new posts:"],"footer.cloudflare_for_campaigns":[0,"Cloudflare for Campaigns"],"header.subscription_confimation":[0,"Subscription confirmed. Thank you for subscribing!"],"posts.social_buttons.hackernews":[0,"Discuss on Hacker News"],"footer.diversity_equity_inclusion":[0,"Diversity, equity & inclusion"],"footer.critical_infrastructure_defense_project":[0,"Critical Infrastructure Defense Project"]}]}" ssr="" client="load" opts="{"name":"PostCard","value":true}" await-children="">
Here at CloudFlare we are heavy users of the github.com/miekg/dns Go DNS library and we make sure to contribute to its development as much as possible. Therefore when Dmitry Vyukov published go-fuzz and started to uncover tens of bugs in the Go standard library, our task was clear....
CloudFlare’s DNS server, RRDNS, is entirely written in Go and typically runs tens of thousands goroutines. Since goroutines are cheap and Go I/O is blocking we run one goroutine per file descriptor we listen on and queue new packets for processing....
CloudFlare's DNS server, RRDNS, is written in Go and the DNS team used to generate a file called version.go in our Makefile. version.go looked something like this....
Something that often, uh... bugs Go developers is the lack of a proper debugger. Builds are ridiculously fast and easy, but sometimes it would be nice to just set a breakpoint and step through that endless if chain or print a bunch of values without recompiling ten times....
It is no secret that we at CloudFlare love Go. We use it, and we use it a LOT. There are many things to love about Go, but what I personally find appealing is the ability to write assembly code!...
Recently, I spoke at the dotGo 2014 conference in Paris and my colleague (and creator of OpenResty) Yichun Zhang spoke at the first NGINX conference in San Francisco....
At CloudFlare, We use Go for a variety of services and applications. In this blog post, We're going to take a deep dive into some of the technical intricacies of Go....
Go's "object-orientation" approach is through interfaces. Interfaces provide a way of specifying the behavior expected of an object, but rather than saying what an object itself can do, they specify what's expected of an object....
Are you familiar with the Go programming language and looking for a job in San Francisco or London? Then think about applying to CloudFlare. We're looking for people with experience writing Go in both locations....
A cornerstone of CloudFlare's infrastructure is our ability to serve DNS requests quickly and handle DNS attacks. To do both those things we wrote our own authoritative DNS server called RRDNS in Go. ...
Some interesting changes related to timekeeping in the upcoming Go 1.3 release inspired us to take a closer look at how Go programs keep time with the help of the Linux kernel. Timekeeping is a complex topic and determining the current time isn’t as simple as it might seem at fir...
Almost two years ago CloudFlare started working with Go. What started as an experiment on one network and concurrency heavy project has turned into full, production use of Go for multiple services....
We've mentioned before that we're using Go internally for projects such as Railgun (and a new DNS server and SSL infrastructure amongst other things). ...
It's no secret that CloudFlare has adopted Go for some production systems; we've written about our use of Go in the past. But over time it's become clear to us that Go is an important language for the sort of high-performance, highly-concurrent software we have to write....
The other day I blogged here about our new Railgun software that speeds up the back haul between CloudFlare data centers and our clients' servers. At CloudFlare we're using a number of different languages depending on the task....