Back

Using the LLVM MC Disassembly API

In this post, I’ll walk through how to link an application against LLVM and show a simple usage of the LLVM McDisassembler API. It’s a little more complex that it seems, probably because there’s not many good resources for using this API.

Linking a program with LLVM

The handy llvm-config utility, which comes with LLVM, can be used to determine the compiler/linker flags you need for LLVM. The relevant options are --ldflags, --cxxflags, and --libs. Let’s see what the output of these will be.

1
2
3
4
5
6
7
8
9
10
11
> llvm-config --ldflags
-L/usr/local/lib -lz -lpthread -ldl -lm

> llvm-config --cxxflags
-I/usr/local/include -D_DEBUG -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS -O3 -fomit-frame-pointer -fvisibility-inlines-hidden -fno-exceptions -fno-rtti -fPIC -Woverloaded-virtual -Wcast-qual

> llvm-config --libs
-lLLVMLTO -lLLVMLinker -lLLVMipo -lLLVMVectorize -lLLVMBitWriter -lLLVMTableGen -lLLVMDebugInfo -lLLVMOption -lLLVMX86Disassembler -lLLVMX86AsmParser -lLLVMX86CodeGen -lLLVMSelectionDAG
-lLLVMAsmPrinter -lLLVMX86Desc -lLLVMX86Info -lLLVMX86AsmPrinter -lLLVMX86Utils -lLLVMIRReader
-lLLVMBitReader -lLLVMAsmParser -lLLVMMCDisassembler -lLLVMMCParser -lLLVMInstrumentation -lLLVMInterpreter -lLLVMMCJIT -lLLVMJIT -lLLVMCodeGen -lLLVMObjCARCOpts -lLLVMScalarOpts
-lLLVMInstCombine -lLLVMTransformUtils -lLLVMipa -lLLVMAnalysis -lLLVMRuntimeDyld -lLLVMExecutionEngine -lLLVMTarget -lLLVMMC -lLLVMObject -lLLVMCore -lLLVMSupport

If you have a C++ file that includes LLVM headers, first, compile your application to a .o object file with the -c option of g++, which tells it not to run the linker.

1
g++ -std=c++11 -I/home/raywang/panda -c -o panda/tools/slice_analyzer.o panda/tools/slice_analyzer.cpp

Now, we want to link this .o file against LLVM.

The tricky thing is that you can’t just add llvm-config --ldflags --cxxflags --libs to g++, because the order of these flags matters. When linking libraries, the linker goes from left to right through the libraries, building up a list of missing symbols and resolving symbols as it encounters new libraries. However, it does not search backwards for symbols!

So, if you specify a library too early in the command, it will never get used, even when other libraries depend on it!

The correct way to link is to first specify the cxxflags and libs, then the ldflags like so:

1
g++ panda/tools/slice_analyzer.o -o slice_analyzer `llvm-config --cxxflags --libs` `llvm-config --ldflags`

Now, all the missing symbols are filled in by libraries further to the right, so the linker can work correctly!

McDisassembly

The LLVM MC (Machine Code) library is well-suited to large-scale disassembly applications. Let’s see the most basic way to use it.

We start with a buffer of x86 assembly formatted as a std::string of hex characters: 89e5. We want to disassemble this to the mov ebp, esp instruction.

You’ll need to include these header files:

1
2
3
4
5
6
7
8
#include "llvm/MC/MCAsmInfo.h"
#include "llvm/MC/MCContext.h"
#include "llvm/MC/MCDisassembler.h"
#include "llvm/MC/MCInst.h"
#include "llvm/MC/MCInstPrinter.h"
#include "llvm/MC/MCInstrInfo.h"
#include "llvm/MC/MCRegisterInfo.h"
#include "llvm/MC/MCSubtargetInfo.h"

Now, initialize everything

1
2
3
4
5
6
7
8
9
10
11
12
13
LLVMInitializeAllAsmPrinters();
LLVMInitializeAllTargets();
LLVMInitializeAllTargetInfos();
LLVMInitializeAllTargetMCs();
LLVMInitializeAllDisassemblers();

LLVMDisasmContextRef dcr = LLVMCreateDisasm (
"i386-unknown-linux-gnu", // TripleName
NULL,
0,
NULL,
NULL
);

These functions are defined in llvm/lib/MC/MCDisassembler/Disassembler.cpp.

We need a handle to a LLVMDisasmContextRef for all future functions, and we can use LLVMCreateDisasm to make one. The first argument to LLVMCreateDisasm is a TripleName, which is formatted like archType-vendor-OS. I think the default vendor is "unknown". Some valid triples include x86_64-unknown-linux-gnu, i486--linux-gnu, etc.

If you want to set Intel syntax, you need to use LLVMSetDisasmOptions(). If you set the correct flag, it will toggle one of three options:

1
2
3
4
5
6
/* The option to produce marked up assembly. */
#define LLVMDisassembler_Option_UseMarkup 1
/* The option to print immediates as hex. */
#define LLVMDisassembler_Option_PrintImmHex 2
/* The option use the other assembler printer variant */
#define LLVMDisassembler_Option_AsmPrinterVariant 4

By default, the asm printer is AT&T syntax, so we need to toggle option flag 4 for Intel:

1
LLVMSetDisasmOptions(dcr, 4);

Getting disassembly

Finally, let’s disassemble our hex string. For this, we’ll use the LLVMDisasmInstruction function.

1
2
3
size_t LLVMDisasmInstruction(LLVMDisasmContextRef DC, uint8_t *Bytes,
uint64_t BytesSize, uint64_t PC,
char *OutString, size_t OutStringSize);

This function takes an input buffer of uint8_t, an output buffer of chars, proper lengths, and a program counter PC.

Here’s a routine to convert a std::string to a uint8_t buffer.

1
2
3
4
5
6
7
8
9
10
int hex2bytes(std::string hex, unsigned char outBytes[]){
// An unsigned char is just an alias for uint8_t

// Get raw chars from std::string
const char* pos = hex.c_str();
for (int ct = 0; ct < hex.length()/2; ct++){
sscanf(pos, "%2hhx", &outBytes[ct]);
pos += 2;
}
}

Now, we’re all set to use LLVMDisasmInstruction!

1
2
3
4
5
6
7
8
9
// targetAsm is a std::string of hex chars
// the length of the resulting array of uint8_t's is hex string length/2
unsigned char* input = new unsigned char[targetAsm.length()/2];
hex2bytes(targetAsm, input);
//This is the output buffer for the human-readable instruction
char *outstring = new char[50];
LLVMDisasmInstruction(dcr, input, targetAsm.length()/2, 0, outstring, 50);
printf("%s", outstring);
// Prints mov EBP, ESP

To see more usages of the LLVM McDisassembly API, check out the LLVM Project Blog.

References

https://stackoverflow.com/questions/18267803/how-to-correctly-convert-a-hex-string-to-byte-array-in-c
http://blog.llvm.org/2010/01/x86-disassembler.html