DoSing a Public NetHack Server

What AFL found

During fuzzing (as described here), AFL reported a hang on this .nethackrc line:

MENUCOLOR=0+++++++++++++++++++++++++++++++++++++++++=blue

This line is saying that if a menu item matches the regular expression (in this case, one or more ‘0’ characters), then that line should be colored blue in the UI. Only one ‘+’ character is needed to express this, but if you string many ‘+’ characters together, NetHack will hang. I was also able to reproduce the hang with MSGTYPE and AUTOPICKUP_EXCEPTION lines.

So why does it hang? NetHack 3.6/3.7 has a pluggable API for regular expression engines in sys/share. Three implementations of this API are included in the source:

  1. posixregex.c – Uses the POSIX API from /usr/include/regex.h, which is implemented via the platform’s libc. This is the default on Unix.
  2. cppregex.cpp – Uses the C++ standard library available in <regex>, which is implemented via the platform’s libc++. This the default on Windows. Obviously, a C++ compiler is required.
  3. pmatchregex.c – This is a very limited implementation (internally named pmatch) which does not rely on any external libraries. It’s not anywhere close to a real regular expression engine, but it does support the * and ? special characters.

I was testing on Linux, so the regular expression was passing through posixregex.c to regcomp(3) in glibc. glibc’s regular expression engine was consuming all available system memory and tying up the CPU indefinitely. I eventually tracked this down to glibc bug 20095, which has been open since 2016. As described in the bug report, for every repeated + in the regular expression, parse_dup_op duplicates the parse tree. This of course leads to an exponential increase in memory consumption. I had a brief discussion about this with glibc developer Andreas Schwab in bug 25814. Schwab pointed out that POSIX says that multiple adjacent duplication symbols produce undefined results. He’s right, but the behavior still seems pathological.


Variants

NetHack is a heavily forked game, and many of those forks are available on public servers just like the base game is. So when looking at a potential security issue, it’s worth checking these variants too. There are three main branches of variants:

  • Descendants of NetHack 3.6 (EvilHack, SpliceHack, xNetHack)
  • Descendants of NetHack 3.4.3 (SLASH’EM, GruntHack, UnNetHack, etc.)
  • Nitrohack and its descendants (NetHack 4, NetHack Fourk, FIQHack, DynaHack, etc.)

The NetHack 3.6 descendants use nearly identical .nethackrc parsing code and exhibit the bug.

Most NetHack 3.4.3 descendants are also susceptible to the MENUCOLOR bug, though not the MSGTYPE and AUTOPICKUP_EXCEPTION versions. Menu coloring was a popular source code patch and it was applied to most variants in the 10 years between official 3.4.3 and official 3.6. Most of those variants use the GNU API (re_comp(3) and re_exec(3)) instead of the POSIX API, but the underlying glibc implementation is the same.

Curiously, the Nitrohack variants are not susceptible to the bug, despite supporting regular expressions in their configuration files. That’s because they never integrated with an external regular expression engine. They use the limited pmatch code exclusively.


Is this a security bug?

With a one line .nethackrc file, an attacker can cause a public server to consume all available memory and CPU. That’s the classic definition of a Denial of Service (DoS) attack. Turns out DoS attacks via regular expressions are so common that they have their own name: ReDoS attacks. And it gets worse. As described in the Wikipedia article, there are many “evil” regular expressions, so a targeted patch for just this input would not be enough. Looking through the glibc bug database I quickly found two more inputs that can cause a hang on pattern compilation, or a crash on pattern matching (during gameplay): bug 12896 and bug 17070. Undoubtedly many more exist.


RE2

So this is a security bug, and clearly it’s in glibc not NetHack, so why hasn’t the glibc team fixed it? Well again, even if they did fix this one bug, many more exist. And because of that prevalence of ReDoS bugs, the glibc team has washed their hands of the security issues entirely:

Implementing regular expressions efficiently, in a standard-conforming way, and without denial-of-service vulnerabilities is very difficult and impossible for Basic Regular Expressions. Most implementation strategies have issues dealing with certain classes of patterns.

 

Consequently, resource exhaustion issues which can be triggered only with crafted patterns (either during compilation or execution) are not treated as security bugs.

In other words, if you as a client of glibc consider a regular expression input as a security bug, it is your responsibility to mitigate that bug in the application code. Frustratingly, this information does not even appear in the glibc manual! I filed glibc bug 25974 requesting this. So now what? Certainly pmatch is still an option, but can we do better?

The main problem, as described in this paper, is the use of backtracking in most regular expression engines, including glibc’s. A follow-on paper describes how Google faced this problem when developing Code Search, and ultimately created a new regular expression engine named RE2 that avoids backtracking, and so can safely accept arbitrary regular expressions over the internet. A caveat is that RE2 does not support backreferences, which are a POSIX requirement, because they can never be implemented efficiently and would always be a ReDoS attack vector. The glibc hand-washing text wasn’t lying when it said that DoS vulnerabilities are impossible to avoid for standards-conforming engines. Fortunately for NetHack, backreferences are not an important use case, so RE2 looks like a great solution. I mapped RE2 to NetHack’s regular expression API here.


Counterpoint

I shared my analysis of this bug with the DevTeam and many variant authors on April 18, 2020. To date, however, none have adopted RE2. And, perhaps surprisingly, I don’t necessarily disagree with their decisions, though I still think RE2 is a great option and more people should know about it.

First of all, a DoS attack that targets a server’s memory and CPU can be mitigated with modern OS resource management, like cgroups(7). Indeed configuring cgroups is probably a good idea for any public server for any game or service, even if there are no known vulnerabilities. It’s another layer of protection.

And if other mitigations (like cgroups) do exist, it then may be more important to prioritize portability and backward-compatibility. NetHack endeavors to run on many different systems and in the smallest possible footprint. RE2 may not easily build on those systems, or may be too large to be practical. And although I can’t really imagine a NetHack regular expression that really requires backreferences or another regex feature RE2 doesn’t support, those features are available today and removing them could invalidate existing .nethackrc files.

Second, what’s the real-world impact of DoSing a public NetHack server? (I hesitate to use the phrase “business impact” because there is no business here.) Many servers are hosted on AWS EC2, so initially at least the servers would stay up, but the AWS bill rates would increase. Multiple, simultaneous attacks would be necessary to bring down a server, and each individual attack would require a separate registration. The server admins could disable the problematic accounts, or even block new account creation entirely for a period of time. Even if the servers did come down, NetHack is not a business-critical service the way Code Search was intended to be. My previous NetHack security reports have been buffer and integer overflows that led to arbitrary remote code execution. Such vulnerabilities could make a public server part of a botnet. Nothing like that is possible here.

Finally, I want to share (with permission) some feedback I received from ais523, the author of NetHack 4:

…there are denial of service attacks available against NetHack and its variants even without exploits, simply via giving a very large number of game commands; because this is part of the program's intended functionality, it's something that is inherently quite hard to fix at the software level. Instead, server administrators tend to deal with such problems via diagnosing excessive resource usage. (An illustrative example is the game on nethack.alt.org that attempted to gain a 1-turn ascension via overflowing the turn counter (involving playing 232 turns); it was killed either by the server administrators or by a program that they had installed, and then recovered, several times. I can't remember what happened to it eventually, but simulating 4 billion turns is inherently going to use up a lot of server resources.)

If you're interested in the "denial of service using legitimate in-game commands" angle of things, you may want to read the NetHackWiki page <https://nethackwiki.com/wiki/1-turn_ascension#Bug_exploit>. It estimates that if the in-game state is set up to maximise the ratio of in-game turns elapsed to the amount of effort for the user playing, overflowing the turn counter will require 505200 copies of a 22-byte string, i.e. 11114400 bytes or about 10.6 MiB, and will take about 19 days of CPU time to execute. Thus, in this case, the resources required by the attacker will be trivial compared to the resources expended by the server to deal with them.

It’s a fair point. I’ve focused my research almost exclusively on .nethackrc and argv parsing. The game accepts a lot more inputs than that! There are standard tools for filtering out DoS attacks via network saturation, such as AWS Shield, but something hand-crafted like this sounds like it would defeat them. Perhaps we have to accept that DoS attacks are inevitable and just focus on second-order mitigations. (Ironically, this sounds a lot like the glibc team’s hand-washing!)


So why again are we talking about this?

I deliberated a long time before contacting the DevTeam and variant authors on this issue because I sensed many of the counterpoint arguments upfront. Eventually, I decided that it’s important to disclose and discuss security issues even if taking no action is the appropriate response. If nothing else, more people should know about the prevalence of ReDoS attacks and how the computing community has seemingly come to accept them as a cost of doing business. I also think that RE2 is an amazing library that solves a really hard problem and should be more widely used.


Acknowledgements

Many thanks to Dan Price who talked through this problem with me at length, pointed me at Russ Cox’s research on RE2, and reviewed a draft of this post. Thanks to ais523 for sharing his insights and agreeing to be quoted in this post.

Comments

Popular posts from this blog

Dungeon Crawl Stone Soup

NetHack 3.6.6, or, How to Glitch NetHack

Why NetHack?