Back

Time-Travel Debugging in PANDA

During a weekend hackathon with some of the Lincoln Lab maintainers of PANDA, I implemented a really useful feature — time-travel debugging!

As has been discussed in Ret2Systems’ great blog post, time-travel debugging is an invaluable tool in the reverse engineer’s arsenal. While Mozilla’s brilliant rr is the dominant choice for Linux user binaries and WinDBG Preview works on Windows binaries, PANDA can debug user and kernel space on both systems.

In this blog post, I’ll talk about the simple design behind reverse-execution and demonstrate its utility in root-causing a Linux kernel n-day.


Design

PANDA, which is built on the QEMU emulator, has the ability to record-replay a full system running a variety of architectures. Non-deterministic events such as hardware interrupts, timestamp reads, etc. are recorded in a logfile and read back during replay, synchronized by the guest instruction index. This allows you to capture and analyze a recording of a whole system, but until now, only in the forward direction.

QEMU also contains a GDB stub that can talk to a remote GDB client. We extended this debug mechanism to enable reverse execution, as well as other useful debugging commands inspired by Mozilla rr.

Reverse execution

We implemented the two commands reverse-step and reverse-continue in the simplest possible way given existing mechanisms.

To be able to time-travel, we wrote a checkpointing API built on QEMU’s savevm. Checkpoints are just snapshots of all guest state. When checkpointing is enabled, checkpoints are periodically taken, depending on how much space you make available on RAM or disk to store them.

When reverse-step is invoked, PANDA sets a breakpoint on the previous guest instruction count, restores to the latest checkpoint and runs forward until that breakpoint is hit.

reverse-continue is a little more complex. It requires at least two restores to reach the next breakpoint — on the first run forward, PANDA doesn’t break on any breakpoints, but records the last breakpoint encountered. Then, it restores to that checkpoint again and runs until that last breakpoint.

Here’s this case, represented pictorially:

However, if no breakpoints were encountered, PANDA will restore to the previous checkpoint and do the same procedure, until it either finds a breakpoint to halt on, or reaches the beginning of the replay.

This means that you can use watchpoints and breakpoints just like normal under reverse! We also added several GDB utility commands, including when to get the current guest instruction count and rrbreakpoint to set a breakpoint on a guest instruction count. You can peruse the full list here.

Using time-travel to root cause a Linux kernel n-day

To demonstrate an exciting use of PANDA time-travel, I figured it would be hip and cool to show off this capability with a Linux kernel root-cause-analysis.

CVE-2014-3153

CVE-2014-3153 was famously used in geohot’s towelroot exploit to jailbreak the Samsung Galaxy S5. Broadly, it involves a bug in Linux’s fast userspace mutexes, or futexes — a certain sequence of a few interleaved futex syscalls will cause memory corruption. futexes are basically locks which can have a list of waiters with different priority levels. Each futex is associated with a userspace int uaddr, which stores the thread id of the thread, if any, that holds the lock. See references for more background info, but it’s not necessary for this post.

Using a PoC borrowed from this fantastic writeup, I’ll show how to record it under PANDA and use reverse-debugging to identify the vulnerability. Here’s the simplified PoC, with operations ordered by number (you’d have to enforce the order using thread locks).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
int uaddrA = 0, uaddrB = 0;  

void* requeue_thread(void* arg){
int ret = futex_wait_requeue_pi(&uaddrA, &uaddrB); // 2. waits until uaddrB is freed
return NULL; // 6.
}

int main(){
int ret = futex_lock_pi(&uaddrB); // 1. Associates uaddrB with a futex and locks it
pthread_t t;
pthread_create(&t, NULL, requeue_thread, NULL);

futex_requeue_pi(&uaddrA, &uaddrB, uaddrA); // 3. Adds thread 2 to list of waiters on uaddrB

uaddrB = 0; // 4. ??

futex_requeue_pi(&uaddrB, &uaddrB, uaddrB); // 5. ??

return 0; // 7.
// 8. eventually crash?!
}

The futex_wait_requeue_pi man page basically says that the caller will acquire uaddrA and wait for uaddrB to be released. futex_requeue will take care of either adding the waiting thread to uaddrB's wait list or waking it up.

Making a recording

First, let’s make a PANDA recording of the PoC and the resulting crash. I’m using an x86 qcow which has the Linux kernel 3.2.51.

As you can see, we got an invalid opcode kernel panic in rt_mutex_top_waiter, which grabs the waiter at the head of a futex’s wait list. rt_mutexes are just another type of lock that’s used to help implement futexes. We can use time-travel on the replay to work out how this happened.

Loading kernel symbols and isolating the panic

To start the replay and wait for a GDB attach, we just need to do {path_to_qemu}/qemu-system-i386 {path_to_qcow}/wheezy_panda2.qcow2 -replay towelroot_poc -S -s -panda checkpoint

Next, connect GDB with gdb -x ~/panda/panda/scripts/gdbinit -ex 'target remote localhost:1234'. To get kernel source and debug symbols, you can download from this mirror and this link, then run these commands:

1
2
3
dir /home/raywang/linux-3.2.51  
set substitute-path /build/linux-n2St39/ /home/raywang/
symbol-file /home/raywang/vmlinux-3.2.0-4-686-pae

If you have symbols for a user binary, you can add more symbol files if you specify the load address, like add-symbol-file towelroot_poc 0x080486c0.

Let’s begin by setting a breakpoint on rt_mutex_top_waiter(). We can see exactly where the illegal ud2 op is generated, resulting in the panic we saw above.

So, the offending line is

1
2
3
4
=> rtmutex_common.h:72  
w = plist_first_entry(&lock->wait_list, struct rt_mutex_waiter,
list_entry); // grabs first waiter on lock's wait_list
BUG_ON(w->lock != lock); // sanity check

So why does the sanity check w->lock != lock fail? If we print the rt_mutex_waiter object w, we can clearly see that it’s corrupted.

1
2
3
4
5
6
7
8
9
10
11
74              BUG_ON(w->lock != lock);  
=> 0xc105cd06 <rt_mutex_top_waiter+3>: cmp DWORD PTR [edx+0x20],eax
0xc105cd09 <rt_mutex_top_waiter+6>: je 0xc105cd0d <rt_mutex_top_waiter+10>
0xc105cd0b <rt_mutex_top_waiter+8>: ud2

(gdb) p/x $eax
$4 = 0xc75e9da8
(gdb) p/x *w
$6 = {list_entry = {prio = 0xc5230000, prio_list = {next = 0xc7550ae0, prev = 0xc5d0a1c0}, node_list = {next = 0x0,
prev = 0xc5230000}}, pi_list_entry = {prio = 0xc103b7fc, prio_list = {next = 0x8, prev = 0xc5231ef4}, node_list = {
next = 0xc5d0a1c0, prev = 0xc5230000}}, task = 0xc104565c, lock = 0xc5391580}

Notice the nonsensical priority list fields, such as the int priority prio = 0xc5230000 and list pointer next = 0x8, as well as the incorrect lock value.

Let’s set a watchpoint on this rt_waiter to see where it came from and how it’s modified. We also set breakpoints on the futex syscalls, futex_requeue() and futex_wait_requeue_pi().

Returning to source

Going back, we can see that several unrelated kernel functions, like apic_timer_interrupt, __do_fault, etc. are overwriting the rt_waiter, which seems like a dangling object.

But if we rc a few more times, we can return to the first call to futex_requeue, which actually iterates over all queued waiters and adds them to uaddrB's wait_list, via rt_mutex_start_proxy_lock and task_blocks_on_mutex. But where is waiter initialized? We can rewind even further, using a watchpoint on the entire waiter object, to see where it is declared.

Aha! We’ve found the origin of the waiter! It’s allocated on the stack of futex_wait_requeue_pi, which makes sense, because futex_wait_requeue_pi’s thread will be the waiter for the uaddrB futex. So this stack object must be dangling after futex_wait_requeue_pi returns.

Pinning down the bug

So, we now need to figure out what caused futex_wait_requeue_pi to return while the waiter it created is still in use. To do that, let’s set a breakpoint at the end of futex_wait_requeue_pi and step backwards.

As you can see, after this thread is awakened from futex_wait_queue_me, its q.rt_waiter is NULL, but before it went to sleep, it was defined as q.rt_waiter = &rt_waiter. This is, in fact, very bad — because q.rt_waiter is now NULL, futex_wait_requeue_pi thinks that its waiter has been already been cleaned up, so doesn’t bother to try itself and leaves the dangling rt_waiter in the wait_list.

1
2
3
4
5
6
7
8
9
/* Here q->rt_waiter is NULL, so this branch is taken */  
if (!q.rt_waiter) { // line 2353
...
} else { // line 2363
...
/* Skips this branch, which would have removed rt_waiter from the wait list */
ret = rt_mutex_finish_proxy_lock(pi_mutex, to, &rt_waiter, 1);
...
}

Finally, let’s figure out who set q.rt_waiter to NULL.

It’s the call to requeue_pi_wake_futex, which is called when the main thread’s second futex_requeue attempts to wake thread 2.

Reverse-stepping into the previous function, we can determine exactly why futex_requeue decides to wake thread 2. futex_lock_pi_atomic checks if the uaddrB futex is free with an atomic cmpxchg. If uaddrB is 0, then the cmpxchg succeeds and futex_lock_pi_atomic thinks it’s acquired the futex, so the main thread thinks it is time to wake up thread 2.

And of course, our PoC did exactly that — set uaddrB to 0 right before re-calling futex_requeue in step 4, hijacking the control flow of futex management.

Summary

To summarize, we used time-travel debugging under PANDA to record a kernel bug and explore it deterministically. We found that that mysterious uaddrB = 0 line caused all this trouble. By making futex_requeue think that uaddrB was free and had no more waiters, that line caused thread 2 to wake up in an inconsistent state. This led to a dangling rt_waiter on uaddrB's waitlist that eventually got corrupted, triggering a kernel panic.

You can take a look at the patch for this bug, which simply checks for a dangling waiter instead of blindly trusting the userspace uaddr value.

Future improvements

The above example demonstrated a powerful application of time-travel debugging, and more features should make it even more valuable. Support for ARM and PowerPC is on the horizon, and I’d like to reach command parity with Mozilla rr.

Performance

The most expensive part of time-travel in terms of space is checkpointing, aka making snapshots. Every checkpoint takes up as much memory as the guest VM has access to. It would be nice to add a delta-checkpoint mechanism to QEMU to apply smaller diffs between checkpoints, which could greatly reduce the space requirements.

The slowest part is the forward pass through every checkpoint, which could take a while depending on how widely spaced your checkpoints are. Also, if you give rsi a repeat count like rsi 20, GDB will naïvely run the rsi command 20 times, which involves 20 checkpoint restores. These perf issues should be addressed in future iterations.

While it’s still a work in progress and does have some instability, I invite you to try it out on your next full-system debugging task. Happy reversing!


References

https://appdome.github.io/2017/11/23/towelroot.html
https://github.com/SecWiki/linux-kernel-exploits/tree/master/2014/CVE-2014-3153