Symbolic Analysis With angr

I came across an interesting cybersecurity challenge which I think is an excellent example of the power of symbolic analysis.

The challenge is a binary which looks for a base64 encoded file and then simply checks that the file is the correct one by traversing it. The file is not provided, so presumably one is supposed to recreate the file based on the check routine. The interesting part is that the traversal is not a loop, but hardcoded in a very long routine which breaks some traditional tools.

  • Disassembling with IDA works fine and is quick, but trying to build a graph view takes a whole day on my machine, followed by a warning that the maximum number of nodes is limited to 1000. We can increase that, but at somewhere between 100000 and 1000000 ida crashes instead of making a graph. In any case it's probably pointless since the function is so long.
  • Trying HexRays to decompile the check routine results in a program crash.
  • The RetDec IDA plugin seems to keep running forever. I don't know if it will ever finish
  • RetDec command line version fails after taking a long time:
Running phase: Initialization ( 0.01s )
Running phase: LLVM ( 0.02s )
Running phase: Providers initialization ( 0.02s )
Running phase: Input binary to LLVM IR decoding ( 0.28s )
Running phase: LLVM ( 207465.34s )
Running phase: x87 fpu register analysis ( 207467.14s )
Running phase: Main function identification optimization ( 207467.63s )
Running phase: Libgcc idioms optimization ( 207467.92s )
Running phase: LLVM instruction optimization ( 207467.92s )
Running phase: Conditional branch optimization ( 207468.81s )
Running phase: Syscalls optimization ( 207477.50s )
Running phase: Stack optimization ( 207477.50s )
Running phase: Constants optimization ( 207495.33s )
Running phase: Function parameters and returns optimization ( 207519.06s )
Error: Decompilation to LLVM IR failed
  • Cutter, interestingly, works fairly quickly in the pseudocode section, but limits the length of the code to 127 basic blocks as it's using the pdc decompilation command
  • radare2 on the command line does a pretty good job decompiling with the pdd command:
$ r2pm -i r2dec
$ r2 a.out
[0x000007c0]> aa
[0x000007c0]> afl
... snip ...
0x000008f0    1 33           sym.badboy
0x00000911    1 31           sym.goodboy
0x00000930   10 405          main
0x00000ac5  127 3047         sym.check
... snip ...
[0x000007c0]> s sym.check 
/* r2dec pseudo code output */
/* ch30.bin @ 0xac5 */
#include <stdint.h>
 
int64_t check (int32_t arg1) {
    int32_t var_18h;
    int32_t var_8h;
    int32_t var_4h;
    var_18h = rdi;// this is pointer to base64 encoded string from file
    var_4h = 0;// initialize counter
    var_8h = 0xf7;// copy constant to local variable
    eax = var_4h;// copy counter value to eax
    rdx = (int64_t) eax; //sign extend eax to rdx
    rax = var_18h; // point rax to base64 encoded string from file
    rax += rdx;// increment pointer by counter
    eax = *(rax);// copy first character from string into eax
    eax = (int32_t) al;// just keep low byte of character
    al ^= 0xa3; // xor character with 0xa3
    if (eax != var_8h) { // check if (first char xor 0xa3) == 0xf7 -- 'T' 
        eax = var_4h;
        edi = var_4h;
        badboy ();
    }

... snip ...

    var_4h++;
    var_8h = 0x54;
    eax = var_4h;
    rdx = (int64_t) eax;
    rax = var_18h;
    rax += rdx;
    eax = *(rax);
    eax ^= 0x15;
    eax = (int32_t) al;
    /* Beware that this jump is a conditional jump.
     * r2dec transformed it as a return, due being the
     * last instruction. Please, check 'pdda' output
     * for more hints. */
    return void (*0x16b6)() ();
  • Ghidra gobbles up resources and crashes when trying to create the pseudocode

The function goes character by character through the file, checks if it is the correct one and then moves on to the next. There are thousands of checks. So one approach is to take the pseudocode or disassembly listing and use that to create the proper string. I tried it and it works. But this is also a good case for using a symbolic analysis engine.

Angr is "angr is a python framework for analysing binaries. It combines both static and dynamic symbolic ("concolic") analysis, making it applicable to a variety of tasks." And the very interesting task we are going to use it for here is static symbolic analysis. In essence, angr can disassemble a binary and we analyse what series of actions will result in a desired outcome. The binary under analysis here has a pretty simple structure. After making some initial checks on the input file, such as length, it will pass the file contents to a check routine. The check routine then calls one of two functions, goodboy or badboy. So what we can do with angr is determine which file contents result in the check routine calling goodboy instead of badboy.

I find that a lot of the example content on how to use angr online does not function any more due to deprecation of API calls. Angr is being developed at a fast pace.....

Here are some resources on the details of symbolic execution:

So I think it's good to have another example out there.

Here I will use Ubuntu 18.10 in a clean virtual machine and set it up for angr with the following commands:

sudo apt update
sudo apt upgrade
sudo apt install python3-dev libffi-dev build-essential virtualenvwrapper
source /usr/share/virtualenvwrapper/virtualenvwrapper.sh
mkvirtualenv --python=$(which python3) angr && pip install angr

one can then run a script just by calling it like this:

python angr angrscript

There is a useful cheat sheet on the angr github:

https://github.com/angr/angr-doc/blob/master/CHEATSHEET.md

I think it's worth going through a simple example from angr.io in order to have a good starting point for the thing we are going to do here:

consider the code given:

import os
import angr
project = angr.Project("defcamp_quals_2015_r100", auto_load_libs=False)
path_group = project.factory.path_group()
path_group.explore(find=lambda path: 'Nice!' in path.state.posix.dumps(1))
print path_group.found[0].state.posix.dumps(0)

The imports are pretty self-explanatory.

project = angr.Project("defcamp_quals_2015_r100", auto_load_libs=False)

loads an executable into the current project. The executable is called defcamp_quals_2015_r100 and this executable takes a command line argument as a password. If the correct password is passed, then it prints a message: "Nice!"

auto_load_libs=False keeps angr from loading and analysing the libraries associated with the executable to save time. In this case all the logic we are interested in happens in the program, so the libraries need not be analysed.

After loading the executable, we have to decide on what analysis to perform, and in this case it's a path group. path_group = project.factory.path_group()

Then, we can look for the path group that causes the program to print "Nice!" on stdout, path_group.explore(find=lambda path: 'Nice!' in path.state.posix.dumps(1))

Finally, print the stdin of the path group that leads us there: print path_group.found[0].state.posix.dumps(0)

This is very powerful! The executable at hand uses a file input, rather than stdin, so we need an angr script that works on that.


import angrimport claripyproject = angr.Project('ch30.bin', auto_load_libs=False) # load executable, don't analyse libaries data = claripy.BVS('data', 0x100000*8) # declare how long the file can be; this is used by the concretize() function laterfilename = 'test.txt' # name for simulated file to be passed to the executablesimfile = angr.SimFile(filename, content=bytestring) # declare contents for simulated file to be passed to the executablestate = project.factory.entry_state(args=['./executable', filename], fs={filename: simfile}) # initialize state machine to executable's entry pointsimulationManager = project.factory.simulation_manager(state)addr = project.loader.find_symbol('goodboy').rebased_addr # function we want to get toavoid_addr = project.loader.find_symbol('badboy').rebased_addr # functin we want to avoid - allows angr to cull execution paths.simulationManager.explore(find=[addr, addr2,addr3, addr4,addr5, addr6,addr7, addr8,addr9, addr10,addr11, addr12],avoid=avoid_addr, step_func=lambda lsimulationManager: lsimulationManager.drop(stash='avoid')) # find a way through to the goodboy routinetest = simulationManager.one_found.fs.get(filename).concretize() # actually patch together the file input that results in the desired pathprint(repr(test))

This works, but it very very slow and memory intensive. I found that it was much faster to just simulate some way into the check routine, copy the concretized file input up to that point and start from that point:

import angrimport claripyimport loggingimport timet0 = time.time()project = angr.Project('ch30.bin', auto_load_libs=False) # load executable, don't analyse libaries bytestring = None data = 'TVqQAAMAAAAEAAAA' # this is the start of the file as we know it up to nowstringAsList = list(data); # turn string into liststringAsList.append(claripy.BVS('data', 0x1000*8)) # declare how long the file can be; this is used by the concretize() function laterbytestring = claripy.Concat(*stringAsList) # turn list into bytestringfilename = 'test.txt' # name for simulated file to be passed to the executablesimfile = angr.SimFile(filename, content=bytestring) # declare contents for simulated file to be passed to the executablestate = project.factory.entry_state(args=['./ch30.bin', filename], fs={filename: simfile}) # initialize state machine to executable's entry pointsimulationManager = project.factory.simulation_manager(state)simulationManager.activeprint("start exploration")logging.getLogger('angr.sim_manager').setLevel('INFO')addr = project.loader.find_symbol('check').rebased_addr # + 51 * len(data) # go a little further into the check routine than last timeaddr2 = addr + 1addr3 = addr + 2addr4 = addr + 3addr5 = addr + 4addr6 = addr + 5addr7 = addr + 6addr8 = addr + 7addr9 = addr + 8addr10 = addr + 9addr11 = addr + 10addr12 = addr + 11avoid_addr = project.loader.find_symbol('badboy').rebased_addrsimulationManager.explore(find=[addr, addr2,addr3, addr4,addr5, addr6,addr7, addr8,addr9, addr10,addr11, addr12],avoid=avoid_addr, step_func=lambda lsimulationManager: lsimulationManager.drop(stash='avoid'))print("start concretize")test = simulationManager.one_found.fs.get(filename).concretize()print(repr(test))t1 = time.time()print (t1-t0)

This results in a base64 encoded executable:

start concretize
b'TVqQAAMAAAAEAAAA
[.....]
nAA==\n\x00\x00\x00
[.....]
\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00'
1452.8494970798492
(angr) :~$

The rest of the challenge was straight forward - the executable was a windows PE file with no surprises.