Hacking the CPython virtual machine to support bytecode debugging

As you may know, Python is an interpreted programming language. By Python, I am referring to the standard implementation i.e CPython. The implication of being interpreted means that python code is never directly executed by the processor. The python compiler converts the source code into an intermediate representation called as the bytecode. The bytecode consists of instructions which at runtime are interpreted by the CPython virtual machine. For knowing more about the nitty-gritty details refer to ceval.c.

Unfortunately, the standard python implementation does not provide a way to debug the bytecode when they are being executed on the virtual machine. You may question, why is that even needed as I can already debug python source code using pdb and similar tools. Also, gdb 7 and above support debugging the virtual machine itself so bytecode debugging may seem unnecessary.

However, that is only one side of the coin. Pdb can be used for debugging only when the source code is available. Gdb no doubt can debug without the source as we are dealing directly with the virtual machine but it is too low level for our tasks. This is akin to finding bugs in your C code by using an In-Circuit Emulator on the processor. Sure, you would find bugs if you have the time and patience but it is unusable for the most of us. What we need, is something in between, one which can not only debug without source but also is not too low-level and can understand the python specific implementation details. Further, it would be an icing on the cake if this system can be implemented directly in python code.

Implementation details of a source code debugger

Firstly, we need to know how a source code debugger is implemented with respect to Python. The defacto python debugger is the pdb module. This is basically a derived class from bdb. Pdb provides a command line user interface and is a wrapper around bdp. Now, both pdb and bdp are coded in python. The main debugging harness in CPython is implemented right within the sys module.

Among the multifarious utility functions in the sys module, settrace allows us to provide a callback function which just as its name suggest can trace code execution. Python will call this function in response to various events like when a function is called, a function is about to return, an exception is generated or when a new line of code is about to be executed. Refer to the documentation of settrace for knowing about the specifics.

However, there are a couple of gotchas. Unlike a physical processor, the CPython virtual machine has no concept of breakpoints. There is no such instruction like an INT 3 on x86 or BKPT on ARM to automatically switch the processor to debug state. Instead, the breakpoint mechanism must be implemented in the trace callback function. The trace function will be called whenever a new line of code is about to be executed. We need to check whether the user has requested a break on this line and if so yield control. This mechanism is not without its downside. As the callback will be invoked for every line, and for every other important event, execution speed will be severely reduced. To speed things up, this may be implemented in C as an extension module like cpdb.

So far so good, and it seems line tracing is just the functionality we require, however, this works only at a source code level. The lowest granularity on which tracing works is at the line level. and not at the instruction level as we require.

How does line tracing work?

Python code objects have a special member called co_lnotab. also known as the line number table. It contains a series of unsigned bytes wrapped up in a string. This is used to map bytecode offsets back into the source code line from where the particular instruction originated.

When the CPython virtual machine interprets the source code, after execution of each instruction it checks whether the current bytecode offset is the start point of some source code line, if so; it calls the trace function. An example trace function taken from the bdb module is shown below.

def trace_dispatch(self, frame, event, arg):
    if self.quitting:
        return # None
    if event == 'line':
        return self.dispatch_line(frame)
    if event == 'call':
        return self.dispatch_call(frame, arg)
    if event == 'return':
        return self.dispatch_return(frame, arg)
    if event == 'exception':
        return self.dispatch_exception(frame, arg)
    if event == 'c_call':
        return self.trace_dispatch
    if event == 'c_exception':
        return self.trace_dispatch
    if event == 'c_return':
        return self.trace_dispatch
    print 'bdb.Bdb.dispatch: unknown debugging event:', repr(event)
    return self.trace_dispatch

The trace function is provided with the currently executing frame as an argument. The frame is a data structure that encapsulates the context under which a code object is executing. We can query the frame using the inspect module. We can change the currently executing line by changing f_lineno of the frame object. Similarly, we can modify variables by using the eval function in the context of the globals and locals obtained from the frame.

Bytecode Tracing Techniques

Listed below are some existing techniques for tracing python bytecode execution.

Extending co_lnotab

We have seen co_lnotab, the line number table is used for determining when to call the trace function. Ned Batchelder (2008) showed that it is possible to modify the line number table to include an entry for each instruction offset in the bytecode. To the Python VM, this implies that every instruction corresponds to a different line of source, and hence it calls the trace function for every instruction executed. This technique is very easy to implement and requires no modification to python. We only need to alter the line number table for each code object to include an entry for each instruction. The downside of this approach is that it increases the pyc file size, and more so if the bytecode is obfuscated when we have no idea which bytes are instruction and which are junk. To be on the safer side, we can add an entry for each byte no matter if it is a real instruction or a junk byte.

Compiling python with LLTRACE

An undocumented way to trace bytecode execution is to compile python from source with the LLTRACE flag enabled. At execution time, python prints every instruction it executes on the console. This method is not without its flaws. Printing every executed instruction on the console is an expensive operation slowing down execution speed. Further, we have no control over the execution, i.e. we cannot modify the operation of the code in any way and it is not possible to toggle off this feature when we do not need it.

Introducing a new opcode

Yet another way to implement tracing is to introduce a new opcode altogether (Rouault, 2015). This is a complicated process and requires a lot of modifications to python. The entire process with all its gory details is described on this page. The gist of the approach is that we create a new opcode which Roualt (2015) calls as DEBUG_OP. Whenever Python VM encounters this opcode, it calls a previously user supplied function. passing the execution context consisting of the Frame and the evaluation stack as the arguments.

Undoubtedly, this method is superior to the pre-existing methods, although it requires a lot of changes in the implementation of python. However, the main drawback of this approach is that it requires to modify the instruction stream and slip a DEBUG_OP opcode in between. This is feasible for normal bytecode generated by python but definitely not for the ones which are obfuscated. When the instructions are obfuscated, it is not possible to insert DEBUG_OP opcode in advance as we cannot differentiate between normal instructions and junk instructions.

The proposed method

Keeping note of the limitations of the above techniques, our proposed method must overcome these. Specifically, it must be resistant to obfuscation and should not require any changes to the bytecode itself. It would be ideal if we could reuse or extend existing functionality to support bytecode tracing and debugging.

As said before, the Python VM consults co_lnotab, the line number table before execution of each instruction to determine when to call the trace function. It looks like we can somehow modify this to call our tracing function right before execution of the individual instructions without checking the line number table. This is the approach we will take.

The function responsible for calling the tracing function is maybe_call_line_trace at line #4054 within ceval.c.

/* See Objects/lnotab_notes.txt for a description of how tracing works. */
static int
maybe_call_line_trace(Py_tracefunc func, PyObject *obj,
                      PyFrameObject *frame, int *instr_lb, int *instr_ub,
                      int *instr_prev)
{
    int result = 0;
    int line = frame->f_lineno;

    /* If the last instruction executed isn't in the current
       instruction window, reset the window.
    */
    if (frame->f_lasti < *instr_lb || frame->f_lasti >= *instr_ub) {
        PyAddrPair bounds;
        line = _PyCode_CheckLineNumber(frame->f_code, frame->f_lasti,
                                       &bounds);
        *instr_lb = bounds.ap_lower;
        *instr_ub = bounds.ap_upper;
    }
    /* If the last instruction falls at the start of a line or if
       it represents a jump backwards, update the frame's line
       number and call the trace function. */
    if (frame->f_lasti == *instr_lb || frame->f_lasti < *instr_prev) {
        frame->f_lineno = line;
        result = call_trace(func, obj, frame, PyTrace_LINE, Py_None);
    }
    *instr_prev = frame->f_lasti;
    return result;
}

Those If statements are mostly checking whether the current bytecode instruction maps to the beginning of some line. We can simply remove them to make it call our trace function per executed instruction than per source line.

static int
maybe_call_line_trace(Py_tracefunc func, PyObject *obj,
                      PyFrameObject *frame, int *instr_lb, int *instr_ub,
                      int *instr_prev)
{
    int result = 0;
 result = call_trace(func, obj, frame, PyTrace_LINE, Py_None);
    *instr_prev = frame->f_lasti;
    return result;
}

After building Python from source with those teeny-tiny changes in-place, we have implemented an execution tracer re-using the existing settrace functionality. We now need to code the callback function which will be called by settrace. This can be realized either in Python or C as an extension (like cpdb), but we choose the former for ease of development.

The Tracer

The code of the tracer is listed below and can also be found on GitHub at https://github.com/extremecoders-re/bytecode_tracer

import sys
import dis
import marshal
import argparse

tracefile = None
options = None

# List of valid python opcodes
valid_opcodes = dis.opmap.values()

def trace(frame, event, arg):
    global tracefile, valid_opcodes, options
    if event == 'line':
        # Get the code object
        co_object = frame.f_code

        # Retrieve the name of the associated code object
        co_name = co_object.co_name

        if options.name is None or co_name == options.name:
            # Get the code bytes
            co_bytes = co_object.co_code

            # f_lasti is the offset of the last bytecode instruction executed
            # w.r.t the current code object
            # For the very first instruction this is set to -1
            ins_offs = frame.f_lasti

            if ins_offs >= 0:
                opcode = ord(co_bytes[ins_offs])

                # Check if it is a valid opcode
                if opcode in valid_opcodes:
                    if opcode >= dis.HAVE_ARGUMENT:
                        # Fetch the operand
                        operand = arg = ord(co_bytes[ins_offs+1]) | (ord(co_bytes[ins_offs+2]) << 8)

                        # Resolve instriction arguments if specified
                        if options.resolve:
                            try:
                                if opcode in dis.hasconst:
                                    operand = co_object.co_consts[arg]
                                elif opcode in dis

For demonstrating the usage I have chosen the following piece of code taken from programiz.

# Python program to find the factorial of a number using recursion

def recur_factorial(n):
   """Function to return the factorial
   of a number using recursion"""
   if n == 1:
       return n
   else:
       return n*recur_factorial(n-1)

# Change this value for a different result
num = 7

# check is the number is negative
if num < 0:
   print("Sorry, factorial does not exist for negative numbers")
elif num == 0:
   print("The factorial of 0 is 1")
else:
   print("The factorial of",num,"is",recur_factorial(num))

Suppose, we want to trace the execution of the recur_factorial function. We can do so, by running the following:

$ python tracer.py -t=only -n=recur_factorial -r factorial.pyc trace.txt

We are tracing the execution of only those code objects having a name of recur_factorial.
The -r flag means to resolve the operands of instructions. Instructions in python can take an argument. For some instructions like LOAD_CONST, the argument is an integer specifying the index of an item within the co_consts table which will be pushed on the evaluation stack. If resolution (-r flag) is enabled, the item will be written to the trace instead of the integer argument.

The input file name is factorial.pyc and the trace file name is trace.txt. Running this we get the execution trace like the following

recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 16 LOAD_FAST (n)
recur_factorial> 19 LOAD_GLOBAL (recur_factorial)
recur_factorial> 22 LOAD_FAST (n)
recur_factorial> 25 LOAD_CONST (1)
recur_factorial> 28 BINARY_SUBTRACT
recur_factorial> 29 CALL_FUNCTION (1)
recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 16 LOAD_FAST (n)
recur_factorial> 19 LOAD_GLOBAL (recur_factorial)
recur_factorial> 22 LOAD_FAST (n)
recur_factorial> 25 LOAD_CONST (1)
recur_factorial> 28 BINARY_SUBTRACT
recur_factorial> 29 CALL_FUNCTION (1)
recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 16 LOAD_FAST (n)
recur_factorial> 19 LOAD_GLOBAL (recur_factorial)
recur_factorial> 22 LOAD_FAST (n)
recur_factorial> 25 LOAD_CONST (1)
recur_factorial> 28 BINARY_SUBTRACT
recur_factorial> 29 CALL_FUNCTION (1)
recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 16 LOAD_FAST (n)
recur_factorial> 19 LOAD_GLOBAL (recur_factorial)
recur_factorial> 22 LOAD_FAST (n)
recur_factorial> 25 LOAD_CONST (1)
recur_factorial> 28 BINARY_SUBTRACT
recur_factorial> 29 CALL_FUNCTION (1)
recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 16 LOAD_FAST (n)
recur_factorial> 19 LOAD_GLOBAL (recur_factorial)
recur_factorial> 22 LOAD_FAST (n)
recur_factorial> 25 LOAD_CONST (1)
recur_factorial> 28 BINARY_SUBTRACT
recur_factorial> 29 CALL_FUNCTION (1)
recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 16 LOAD_FAST (n)
recur_factorial> 19 LOAD_GLOBAL (recur_factorial)
recur_factorial> 22 LOAD_FAST (n)
recur_factorial> 25 LOAD_CONST (1)
recur_factorial> 28 BINARY_SUBTRACT
recur_factorial> 29 CALL_FUNCTION (1)
recur_factorial> 0 LOAD_FAST (n)
recur_factorial> 3 LOAD_CONST (1)
recur_factorial> 6 COMPARE_OP (==)
recur_factorial> 12 LOAD_FAST (n)
recur_factorial> 15 RETURN_VALUE
recur_factorial> 32 BINARY_MULTIPLY
recur_factorial> 33 RETURN_VALUE
recur_factorial> 32 BINARY_MULTIPLY
recur_factorial> 33 RETURN_VALUE
recur_factorial> 32 BINARY_MULTIPLY
recur_factorial> 33 RETURN_VALUE
recur_factorial> 32 BINARY_MULTIPLY
recur_factorial> 33 RETURN_VALUE
recur_factorial> 32 BINARY_MULTIPLY
recur_factorial> 33 RETURN_VALUE
recur_factorial> 32 BINARY_MULTIPLY
recur_factorial> 33 RETURN_VALUE

That's pretty cool!. Now we now the exact opcodes that are executing. Tracing obfuscated bytecode is no longer a problem.

Extending tracing to full-fledged debugging

The tracer developed does not have advanced debugging capabilities. For instance, we cannot interact with the operand stack, tamper the values stored, modify the opcodes dynamically at the run time etc. We do have access to the frame object but the evaluation stack is not accessible from python. However, everything is accessible to a C extension. We can develop such a C extension which when given a frame object can allow python code to interact with the objects stored on the operand stack.

This will be the topic for another blog post. I also intend to show, how we can use such an advanced tracer to unpack & deobfuscate the layers of a PjOrion protected python application.

References

Batchelder, N. (2008, April 11). Wicked hack: Python bytecode tracing. Retrieved March 15, 2017, from https://nedbatchelder.com/blog/200804/wicked_hack_python_bytecode_tracing.html

Rouault, C. (2015, May 7), Understanding Python execution from inside: A Python assembly tracer. Retrieved March 15, 2017, from http://web.archive.org/web/20160830181828/http://blog.hakril.net/articles/2-understanding-python-execution-tracer.html?

Coding | Reversing

Thursday, 16 March 2017