Down to the Wire

A DefCon 2016 Retrospective

Defcon CTF 2016 was held from August 5th to 7th during the annual Defcon conference. This year DARPA chose to host their Cyber Grand Challenge (CGC) — a CTF-style competition between fully autonomous Cyber Reasoning Systems (CRS’) — at Defcon as well, so the Legitimate Business Syndicate oriented their competition around it to allow the winning machine to compete against the human teams. The new format brought with it several interesting gameplay mechanisms as well as a couple of issues, resulting in a fun but occasionally problematic contest. During the competition I played with the Plaid Parliament of Pwning (PPP), with whom I placed first. This is a brief reflection of how the game operated, what succeeded, and what did not.

Overview of the CGC Game Format

The Cyber Grand Challenge game, as designed by DARPA, was meant to be played by autonomous machines, and the design well reflects this. It is an attack-defense style CTF with each team able to throw exploits, submit patches, and view traffic. However, it relies on a rigid api that is well suited to autonomous play. One unique aspect of this game structure is the exploits. Strictly speaking, in CGC exploitation is not required. Instead, teams submit Proofs of Vulnerability (PoVs) that demonstrate the ability to gain access to an opponent’s machine. These PoVs can take one of two forms, and are verified by an automated referee: A type-1 PoV requires the attacker to crash the opponent’s system with a segmentation fault and have control over EIP and one other general purpose register at the time of crash. A type-2 PoV requires instead that the attacker leak four consecutive bytes from the “secret page,” a set of addresses in the range 0x4347C000-0x4347D000 (CG\xC0\x00). Submitting either one of these counts as “proving” the vulnerability, and nets the attacking team a set amount of points.

Another important difference in CGC compared to a traditional CTF is its use of the DARPA Experimental Cyber Research Evaluation Environment, or DECREE. DECREE is a custom build of Linux that only has seven syscalls. These syscalls cover only the most basic functionality, and allow access to the terminate, transmit, receive, fdwait, allocate, deallocate, and random method stubs. No other syscalls are available to binaries running on DECREE. The other, and less significant change is to the executable file format. Almost all aspects of it are the same as the ELF format, except that the three bytes of ELF in the header are replaced with CGC. These changes are significant enough to ensure that DECREE binaries will not run properly in another environment without reasonable modification.

The actual structure of the game is round-based. An attacker may run one exploit per challenge, per team, per round, up to ten times. If any of those ten succeed, the attacking team earns the points for that challenge. If, in addition, they are attacked that round on a service, their service is not responding correctly to non-exploit poller traffic, or their service has poor performance, they earn fewer points for that challenge. Finally, if a team wishes to patch or change one of their binaries, they will not be able to earn any points from that service for a round. As a result, it is important for teams to not only find vulnerabilities quickly, but also actively defend and patch against attacks. Rounds last anywhere from 5 to 15 minutes, with approximately 160 rounds played in the DefCon finals.

Pre-Competition Preparation

Since CGC is designed to be played by machines, all of the information about the game is published through an open API. However this api is unwieldy for humans to interact with by hand, so it was necessary for us to design tools ahead of time to allow for easier interaction.

Among the most important piece of software we developed was Hydra, a well designed and easy to use interface for interacting with the core CGC api. It served as a way to view both offensive and defensive information about individual binaries, track published actions taken by other teams, and manage our own PoVs and patches. On the whole, it served as our main hub for all CGC-related management.

We also developed several tools to aid in the PoV and patching processes. Perhaps the most noteworthy of these was Python-POV, a custom build of python that could target DECREE. The CGC game format requires that Proofs of Vulnerability be DECREE executables that interact with the challenge binary through file descriptors 0,1 and 2, and with the game referee through file descriptor 3. By rewriting certain core python functions and trimming down the available packages, we were able to package a 5 megabyte python interpreter with all of our PoVs. Thus, we were able to write the actual proof functionality in python and bundle it with the static interpreter. The benefit of this was that we were able to use python’s quick scripting syntax and libraries to make PoV development incredibly fast and accessible.

Another remarkably helpful tool we developed was Butcher, a general purpose tool for interacting with the provided network captures. Its primary function was as a replacement for their cb-packet-log program that would produce Packet Capture files (PCAPs) from the incoming connections. Instead of producing one large conglomerate file for the round, Butcher created a PCAP for every connection and challenge. In addition to this, we included several analysis tools that could provide a color coded transcript of the connection, allow for grep-like search, and replay a packet against the original binary while checking for discrepancies. Butcher, unlike many of the tools we used, underwent significant development during the CTF as our needs were more fully realized. Many of these changes are documented further on.

In addition to these tools, we also had a number of single purpose tools that aided us in everything from reversing to patching. They acted as an interface to the CGC environment and allowed us to use non-CGC tools nearly transparently. These tools formed the backbone of our personal infrastructure and allowed us to focus on the competition.

Gameplay and Strategy

While the PPP has a number of members who are strong at binary reversing and exploitation, we find it difficult to compete on a purely exploit-driven level. Instead, we relied heavily on teamwork and distribution of labor to facilitate the process of developing active PoVs and maintaining the necessary patches. Toward this end, we had team members who took on semi-dedicated roles in the competition. Among these jobs were PoV development, infrastructure management and information dissemination, exploit reverse engineering and reflection, network analysis, and patch construction.

PoV development was easily our strongest group, with most people in other roles working on this when the other was not needed. We typically prefer to develop new exploits whenever possible, as this allows us more time to earn points while people patch the more commonly used vulnerabilities. In a similar fashion, we prefer non-reflectable attacks so that other teams cannot copy our attack and use it for themselves. The result of this is that we would often not use a vulnerability as soon as we found it, but instead build on it and turn it into a more secure PoV. We also prefered to develop Type-2 PoVs since they would not log a crash in the CGC database and the defending team would have no indication that they are being exploited other than the point differential. Following this, one of our best exploits was developed by a member who spent all night writing shellcode in PPC — which was being emulated by the program — that could communicate back to the PoV and could not be easily reflected. This ended up being one of our most useful exploits, as we were the only team to develop a patch for it, and to the best of our knowledge, no teams managed to reflect it in any form. Over the course of the game we found and developed a large number of PoVs using these general guidelines.

On the defensive side, we developed a tight pipeline for detecting and reflecting exploits being thrown at us. As mentioned previously, one tool that underwent significant development over the course of the competition was Butcher, our packet analysis tool. Since each team could throw an exploit for each challenge up to ten times, and the referee had to poll the application roughly 300 times a round, we saw approximately 400 to 450 PCAPs per challenge per round. Given that there tended to be about four challenges available at any given time, and each round lasted, on average, 8 minutes, we were receiving roughly 300 packets every minute. This is a tremendous amount of data to go through, and we realized early on that we needed a better way to process it. Beginning after the official competition start time, and continuing through the last day of the competition, we began writing b-suspect, a sub-tool of Butcher that would automatically classify and sort PCAPs. For every round and challenge, it would look at all of the packets that we received and try to coalesce the packets that came from a single source into a bucket. Once it had organized them, it used a heuristic based on a number of different characteristics to rank the packets based on likelihood of exploit. The result was a command line interface that could print out a ranked list of buckets, with indicators explaining how the rank was achieved. From there it gave options to use many of the other tools built into Butcher from this REPL. The final version allowed people to effectively analyze every single packet that interacted with our system. Once a suspicious packet was found, the client data could be passed off to another team member who would analyze it and verify if it was a vulnerability. If they determined that it was in fact a PoV, they would begin the analysis process and try to develop a reflection and patch. If able to successfully reflect it (no attempts were made to obfuscate reflected PoVs), it would be deployed against all teams that were not already being hit. By the end of the competition, we could go from being attacked to reflecting the PoV in about 15 minutes.

One of our biggest surprises came in the form of patches. Despite a joke that everyone would steal PPPs patches, we didn’t really expect it to happen. We had developed excellent patching infrastructure, and so with all of our patches we shipped a relatively unhidden backdoor. We surmised that it would serve as a deterrent from backdoor theft and in a few rare cases provide some free points. However, much to our shock, once we started shipping the backdoored patches, teams began applying our patches without modification to their own challenge binaries. Some teams did modify it enough to change only the checksum, however due to a tool we developed that could automatically test PoVs against patches, this ended up not affecting our ability to use it. Talking to teams following the competition, it seems that several teams actually discovered the backdoor, but decided to employ the patch regardless, deciding that it was better to ensure only one team had access. This may have been well founded, since in the few cases where teams did notice the backdoor and reverted to an earlier patch, we almost always had an actual PoV ready to use against it. It is less clear whether other teams employed backdoors, given that the only times we looked at other teams’ patches were when we needed a team-specific exploit, however a cursory analysis suggested that two other teams did produce them. Factoring in our ability to quickly develop patches for all the bugs we discover as well as the PoVs that were thrown against us, using our mega patches could have worked better than relying on their own or another team’s patch.

Lessons Learned from a First Time DefConer

Given that I joined the PPP only a year ago, this was my first opportunity to play with them at an event like DefCon. As such, it was incredible for me to see the team working together at full capacity, and I found a few things incredibly interesting. This was the first CTF I have played in where the preparations lasted longer than the competition itself. We began work on infrastructure about three weeks before DefCon, and continued working on it through the end of the competition. While none of the tools did our job for us, they freed us from having to do the slow, menial tasks that tend to consume so much time. One rather emphatic member of our team kept insisting that the future of security lies in good tools, and following this competition I believe he may have a point. For me certainly, as a lead developer on Butcher, this competition was as much an engineering challenge as it was a security one.

It was also interesting to see how using such a unique format helped to accentuate several of the more interesting aspects of the CGC game style. In a CGC game, players have to take much greater care to balance offense and defense. While this is certainly a requirement in any attack-defense CTF, since CGC reduces your offense score if your defense is failing, it becomes much harder to decouple the two. As a result, it encourages tight teamwork and integration of all gameplay aspects into the decisions that are made. Encouraging teams to work together to act like a fully-fledged Cyber Reasoning System ends up uniting the group in a way few other CTFs do. In this sense, I really liked the CGC format.

However, as a system designed for massively-parallel computers, CGC has many drawbacks when used with humans. Forcing teams to eat a round of downtime when they apply a patch adds in a tremendous amount of meta-gaming with regard to patch application. In the original Cyber Grand Challenge, CRS’ won and lost based on this availability score, which accounts for not only downtime between patches, but also service unavailability as a result of poor patching. Having watched the DARPA contest, we had a good feel for the importance of not over patching, but many teams did not. While with any attack-defense CTF there is a balance of CTF challenge and meta-game challenge, this high reliance on availability shifted the game more toward meta than I would prefer. While each person will have their own preferences, I enjoy CTF puzzles more than CTF game theory.

Another common complaint about this format is the lack of true exploitation. While my descriptions of the game were rather lax about calling PoVs “exploits,” they ultimately are indicators of a potential exploit, rather than exploits themselves. The end results of this are problems that lean very heavily toward buffer overflows and other types of unbounded access. For many bugs, in fact, there was no need for shellcode or arbitrary execution — the attacker could simply persuade the program to give up the flag. For a CRS, this type of problem makes sense; it can easily be fuzzed and the end goal is very well defined. However, for humans, this often results in problems that are unexciting to exploit and instead rely primarily on reversing. This was not universally true, and many of the challenges had deeper bugs that did require significant exploit development, yet the competition was short enough that many teams did not find even the shallow bugs, and so searching for the deeper ones would yield diminishing returns. Unfortunately, this seems to be the nature of CGC, not simply a poor choice by the organizers.

The most direct issue with using CGC came down to sheer numbers. Every round, the referee had to execute all teams’ POVs, deploy all of the poller traffic, update new files, serve network traffic, and store all of their data for later review. By the end of the game, they had released 8 problems, 5 of which were still actively available. As mentioned previously, every challenge binary received about 400 incoming connections per team, each of which was allowed to run for several seconds. Even if we assume an average of 750ms per connection (to account for many programs which will loop forever instead of quitting), that results in over 6 cpu hours per turn. To simulate their competition, DARPA brought in 7 water-cooled supercomputers. For Defcon CTF the organisers could only acquire two small racks. Turns that were intended to last 5 minutes by the end of the game took roughly 13 minutes to simulate. In one instance, an uploaded binary crashed the system and one turn was re-simulated 16 times. For their part, the Legitimate Business Syndicate made the best with what they had. Unfortunately for them and for the competitors, that simply was not enough.

Finally, while I do not wish to spend much time discussing operational failures, there were a number of them, and so it serves as a good reminder to test and verify everything before connecting players to infrastructure. While it only minimally affected the human teams, for the competing CRS Mayhem, these bugs ended up crippling it.

In spite of this, I would like to congratulate the Forall Secure team. Even though Mayhem was not receiving a significant amount of information about the game, it still managed to throw exploits and develop patches faster than a lot of teams (ourselves included) were able to.

Final Thoughts

While scores have not yet been published, and we do not know the full details about how teams performed, we do know that Defkor placed third after having taken an early lead, B1o0p placed second after a strong showing on the second day, and the Plaid Parliament of Pwning finished in first with what we perceived to be a narrow lead. Without a doubt, Defcon CTF 2016 was one of the most fun CTFs that I have played in. Using CGC was frustrating at times, but with the proper preparation and team unity we were able to overcome many of the challenges that it presented as a format. Furthermore, I am excited by the fact that our team played against not only other human teams, but also a fully functioning Cyber Reasoning System which, despite being crippled for most of the weekend, performed remarkably well. This game was incredibly close, and all of the teams played exceptionally well. I am thrilled that I was able to compete with and against so many amazing teams, and I look forward to Defcon 2017.

Update (September 6th)

LegitBS has posted scores from throughout the competition. In a few days, they should release all of the data from the competition, but for now, they provided this graph of scores over time.

One interesting thing that this graph demonstrates is that around turns 60 and 135 — the start of days two and three respectively — there is a noticable uptick in the slope of our score. It shows how productive our evenings were, and the crucial difference they made in the outcome of the competition.