IDAStealth v1.3.1 - Bugfixes

A little update for the IDAStealth plugin. Bugs were fixed concerning the improved NtClose Hook and the context emulation for the advanced hardware breakpoint protection. Apart from that, the new version allows you to start the stealth drivers with a custom name so as to evade detection by searching for common stealth driver names such as (rdtsc*|stealth*).sys

IDAStealth v1.3 - VMProtect and improved Debugger Support

The new IDAStealth version v1.3 includes support for anti-debugging techniques used e.g. by VMProtect. More specifically, the stealth driver as well as the user-mode hooks now support two additional process information class parameters: ProcessDebugObjectHandle and ProcessDebugFlags for the NtQueryInformationProcess API. VMProtect actually uses the ProcessDebugFlags information class to detect a debugger (and a few additional techniques).

Apart from that, v1.3 now has some features to support you while debugging. It allows for halting the debugger automatically as soon as an SEH handler is invoked by the OS. Additionally, the debugger can also be halted after SEH kicked in but as soon as the OS re-schedules the faulting thread with a (possibly) changed instruction pointer (or context in general). Both events can also just be logged without stopping the debugger.

Debugging Support

To illustrate the new debugger support features a bit, let's analyze the following assembler program and see to what extent IDAStealth can assist us in doing so.
The code uses a few tricks to complicate the analysis a bit. In fact, the code is quite trivial but it's still good enough to fool IDA with standard settings. Naturally, you can just turn off analysis and generate code at two places and IDA disassembles everything correctly. Also note, that these tricks are targeting recursive disassemblers so OllyDbg isn't heavily affected since it apparently uses an algorithm based on the simple linear-sweep disassembling technique (I might be wrong about this, though).

  1. call lbl_1
  2. dd 25360ABEh ; random junk
  3. ; seh handler starts here
  4. xor eax, eax
  5. jnz lbl_1 + 3 ; fake jump
  6. jz lbl_2
  7. dd 25ff6645h ; random junk
  8.  
  9. lbl_1:
  10. add dword ptr [esp], 4 ; skip 4 junk bytes
  11. push fs:[0]
  12. mov fs:[0], esp
  13. int 3
  14. jmp strings + 2 ; fake jump
  15.  
  16. strings:
  17. szText ttl,"1337"
  18. szText msg, "Hello!"
  19.  
  20. lbl_2:
  21. mov ecx, [esp+12] ; get context pointer
  22. assume ecx:ptr CONTEXT
  23. mov [ecx].regEip, offset quit-123456789
  24. add [ecx].regEip, 123456789 ; never use direct offsets :)
  25. ret
  26. jmp lbl_2 + 2 ; fake jump
  27.  
  28. quit:
  29. pop fs:[0]
  30. add esp, 4
  31. invoke MessageBox, 0, addr msg, addr ttl, 0
  32. invoke ExitProcess, 0
  33. ret

Actually, this code is really simple, but nevertheless it is able to confuse most (recursive) disassemblers (and specifically IDA).
The first call implicitly pushes the address of the 4 junk bytes onto the stack and dispatches to lbl_1. The code at lbl_1 fixes the address pushed by the call to lbl_1 to point past the junk bytes. Then it adds an SEH record to the thread environment block, using the corrected address as the SEH handler. Finally, the code raises an exception by means of the int 3 instruction in line 13. As a consequence, the OS redirects control flow to the recently installed SEH handler.

The handler in turn uses a very simple trick to confuse the disassembly process employed by recursive traversal disassemblers: one of the successor addresses points to invalid code by jumping "into the middle" of an instruction, thereby confusing the disassembler.
The important point here is to know how a recursive traversal disassembler explores a binary program: it starts to disassemble until an instruction is encountered which causes a change in control flow. Then, the disassembler has to determine the successors, i.e. the possible branch targets of this instruction. After the successor addresses have been determined, the algorithm starts over.

So the disassembly process can be disturbed if there is one successor which points to invalid code but this code is actually never reached at runtime.
The simple trick here consists of xor'ing eax with itself in order to erase the zero flag. Although the jnz branch is never taken, the disassembler has to assume that it might be taken. Since the successor address of that instruction points into the middle of an instruction, the disassembler creates an incorrect disassembly listing because it tries to disassemble invalid code but does not know that this program point is actually unreachable.
In this simple case however, a disassembler could determine that this is indeed unreachable code by performing a simple data flow analysis, e.g. conditional constant propagation.

The junk bytes right behind the jump instructions (line 7) are crafted in a way such that if disassembled, the resulting instruction overlaps with the next valid instruction starting at line 10.
The last trick used at lbl_1 is the jump instruction after the interrupt. Again, this is unreachable code since the control flow won't reach the instruction following the interrupt. The effect however is, that the strings are treated as code.

After the exception, control flow resumes at line 4 and branches to lbl_2.
The exception handler then modifies the CPU context and adjusts the instruction pointer. In order to further complicate the disassembling process, the instruction pointer is indirectly modified by computing its value instead of simply assigning it.
Finally, the handler returns zero which signals the OS to continue execution with the modified context.
So in the end, the code reaches the quit label, uninstalls the SEH handler, displays a message box and terminates the process gracefully.

If we assemble the code and let IDA analyze the executable we can see that the tricks worked pretty well:

So how do we analyze this code? By manually creating an instruction at 401013h we can at least analyze the functionality used to install the SEH handler and to generate the int3 exception. Note the impact of the fake jump after the interrupt: the instructions from 40102A until 401038 are actually the embedded strings which are passed later to the MessageBox API.

However, we still need to manually create new code at 401009h in order to reveal the instructions belonging to the installed exception handler. Without doing so, we are unable to follow the flow of execution.
The picture below shows the first few instructions of the SEH handler.

The end of the SEH handler modifies the instruction pointer in order to let the OS continue the thread after the exception at the quit label. Obviously, IDA can not easily figure out that these calculations actually result in a valid code section offset and consequently completely fails to disassemble any further instructions.

In order to follow the control flow from here we either need a good understanding of how SEH works or guess that these calculations yield some interesting address and start over from there. By doing so, we end up with the situation shown in the picture below.

Seems like a lot of work - and this is still an extremely trivial example. Wouldn't it be nice if we could automatically halt the debugger at the beginning of the SEH handler, or better yet, at the new position after the handler modified the instruction pointer?
Well, just download the new IDAStealth and enable the two corresponding options in the tab Other Options.

If you let the debugger run, it will automatically halt at 401009h (line 4) and at 401054h (line 29), respectively. You can also tell IDAStealth to just log SEH events to IDAs output window without having the debugger stopped each time.
You can download the test executable along with the source code from here.

IDAStealth v1.2 - Themida Support

Finally, IDAStealth is able to successfully hide the IDA debugger from Themida. The previous version of IDAStealth failed to provide "enough stealth" because Themida creates private mappings of various system dlls, particularly of ntdll.
This effectively bypasses any user mode hooks so I had to resort to a kernel mode driver which replaces some function pointers in the Service Descriptor Table [1, 2].

The driver actually replaces two functions: NtQueryInformationProcess and NtSetInformationThread.
The former provides a query mechanism for the ProcessDebugPort, ProcessDebugObjectHandle and ProcessDebugFlags flags.
The latter is used by Themida to detach threads from the debugger. As a consequence, the debugger does not receive events for the detached threads anymore. Furthermore, the debugger is unable to stop or suspend the process and you have to kill the debugger itself to stop the process.

Naturally, the source code for the stealth driver is also included in the package.

[1] Rootkits: Subverting the Windows Kernel
[2] Uninformed vol. 8

Hex-Rays Plugin Contest (Updated)

I placed 3rd in the Hex-Rays plugin contest, thanks to the Hex-Rays people :)

There are still some issues with the IDAStealth plugin, though. I hope to be able to fix them asap, but I'm currently writing my diploma thesis, so I've not very much spare time at the moment, but I'm working on it.

However, if you find bugs, please report them to me so I can improve the plugin. Thanks :)

I didn't have time yet to look into the other submissions more deeply, but the DWARF plugin as well as the flash disassembler plugin look very promising - kudos to the winners and all other contestants.

Update:
IDAStealth has currently problems (at least) with Themida and executables with fake TLS callbacks. I'm working on it and trying to fix these issues asap.

Update2:
The new version restores the complete IMAGE_NT_HEADERS upon injection, so it works with packers which checksum the PE header or rely upon the original representation of e.g. the import directory.
Fake TLS callbacks are triggered as a side effect of an additional DLL (i.e. the HideDebugger.dll) being present in the address space of the process, otherwise the windows loader does not attempt to invoke the fake TLS callbacks (if you happen to know why this is the case, please let me know).
However, the TLS callback invocation in ntdll_LdrpRunInitializeRoutines is guarded by an SEH frame, so it is safe to just pass any exception back to the process to let it resume gracefully.

Project Updates

IDAStealth

First of all, the new IDAStealth v1.1 supports remote debugging, has a new WTL based GUI and supports profiles.
As requested by some people, the source now builds out of the box, given you have the required libraries in your include path (see readme for instructions). Some minor bugfixes also made it into the new version.

IDA Stealth v1.0 final

What's new?

It's been some time since the last update, so here it is.
Finally, a driver to emulate the return value of the RDTSC instruction has been added, errors in the debug register handling were fixed and the stealthiness of the GetTickCount hook has been improved.

RDTSC emulation

Well, RDTSC emulation is actually rather widely used, so I wanted to include this technique, before releasing a final version.
A common anti debugging trick is to use the RDTSC instruction, which returns a 64 bit time stamp value in the EDX:EAX register pair. Under the assumption that tracing or even halting a program introduces a noticeable delay, an attached debugger can be caught easily by evaluating the deltas between consecutive time stamps. So how can we address this issue?
Those who read the Intel manuals carefully, might have noticed, that the CR4 register has the time stamp disable flag, which can (only) be modified from ring0. If this specific bit is set, the CPU will raise a #GP (general protection fault) every time the RDTSC instruction is executed from a privilege level > 0 aka user mode.
#GP is simply an interrupt (actually fault nr. 13), which means that as soon as the interrupt is fired, the CPU switches to privilege level 0 (i.e. kernel mode) and grabs the associated entry from position 13 in the interrupt descriptor table (IDT). This entry describes an address to the respective interrupt handler. This handler is responsible to somehow deal with this kind of exception.
Now you see how we can hide our debugger from the aforementioned detection technique: we can write a simple driver, which replaces the #GP fault handler with it's own handler. All this handler has to do, is to check if the #GP has actually been caused by a RDTSC instruction (a #GP can occur due to many other reasons) and, if that's the case, return a fake value accordingly. IDAStealth currently allows you to force the driver to always return zero or to let it start with a random value and increase it every time by a given delta.
To additionally increase stealthiness, the driver is optionally given a random name each time it is loaded, so it cannot be easily unmasked by scanning all loaded driver modules for suspicious names.
Important note: when using this option, you must be sure to NOT have two or more instances of this driver running at the same time, because there is no way for IDAStealth to check if another instance already started this driver (this is the whole purpose of driver object name randomization!). If those drivers aren't unloaded in the exact opposite order they were started, the system will crash when each driver tries to restore the original handler in the IDT. This is due to the fact that each driver doesn't know (or can't know) that the respective IDT entry has already been hooked by another instance.

If you want to build the driver by yourself, you will need the DDK and the ddkbuild script. The source code is heavily commented and I tried to use as few inline assembler as possible.

Armadillo 4.x

IDAStealth had a bug in the GetThreadContext hook, causing the emulation routine to always return the complete thread context for the given thread, even if the caller only asked for the debug registers. Some versions of Armadillo detect this misbehavior and eventually enter an endless loop if a call to the GetThreadContext API returns "more information as requested".

GetTickCount hook

The implementation of the GetTickCount hook was flawed, because after a few iterations the returned value would drop to zero and never change that value again.
The new GetTickCount hook now always mimics the original algorithm of the API and uses the performance counter to initialize an internal 64 bit value. This internal tick counter is feed into the original algorithm. The delta by which this counter is increased every time the handler is executed can be adjusted from the IDAStealth GUI.

ToDo

The new version passes all tests in xADT besides the "find tool complex" (which I don't consider to have any meaning in practice) and the new NtSystemDebugControl test. This test seems to be very unreliable: sometimes the debugger was detected, sometimes not (tested on XP SP3).
The rootkit based test always made the application crash and I didn't investigate why the crash occurred anyway.
If your favorite packer is still able to detect the IDA debugger or if anything isn't working as expected, please let me know and I will try to fix it.
Everything else as usual on the IDAStealth site.

Compiler optimizations for constant divisors

Optimizing for speed

Today's compilers do a decent job in optimizing high level code to gain speed in execution time. As compilers are getting more and more complex over time, so does their emitted code. This tutorial will focus on a specific arithmetic optimization done by an optimizing compiler to avoid costly instructions such as div resp. idiv by transforming the calculations to fixed point arithmetic.
Actually, I just wanted to write some random stuff to test the new DruTeX plugin ;-)

To give you an idea of how such an optimized block of code looks like, here's a block of code I came across some time ago while analyzing some binary with IDA:

mov ebx, some_value
mov eax, 0CCCCCCCDh
mul ebx
mov eax, edx
shr eax, 3

where some_value represents an arbitrary integer value. At first sight the purpose of this code doesn't seem to be that obvious, does it?. Before you think about it, recap how the mul instruction works: mul ebx computes EAX*EBX and stores the 64-bit result in EDX:EAX, i.e. higher 32 bit in EDX and lower 32 bit in EAX.

Magic constants and fixed point arithmetic

First of all, let me clarify that the code above just represents an optimized version of a regular (integer) division. We will see in a few seconds how to make sense of this code.

So what is fixed point arithmetic anyway?
Fixed point arithmetic allocates a fixed set of bits to the integral as well as the fractional part of a rational number.
In this situation the compiler exploits the fact, that the x86 CPU is capable of handling 64 bit numbers by the register pair EDX:EAX, that is we reserve 32 bits for the integral and another 32 bits for the fractional part of our rational number.
So if we want to represent a fractional number in terms of CPU registers, we could say: EDX is our integral part whereas EAX is our fractional part. And that's exactly whats going on here.
This means that the high 32bit (EDX) represent a number between 0 and $2^{32}$-1, and EAX represents a number between 0 and 1.
Let's say you want to represent the number $\frac{1}{3} = 0,333333333$.
The registers would look like this:
EDX=0 and EAX=55555555h. Now let's see how we get to that magic constant.

We know, that we have 32 bits to represent a value between 0 and 1. The situation where all bits are set would represent a one, so we scale the fractional part by $2^{32}$, which means that if we compute $\frac{1}{3}*2^{32}$, we get our magic constant: 55555555h.
The important fact is, that $2^{32}$ represents a one, so in order to represent $\frac{1}{3}$ we have to divide $2^{32}$ by 3.
If we now calculate X*55555555h we get the result in EDX:EAX! And now you see the magic:
EDX contains the integral part of the formula: $\frac{x}{3}$. We just divided x by 3 without using the div instruction!
Notice that this code is generated by an optimizing compiler because the div instruction is usually much slower than the mul instruction. Of course this can only be done if the compiler knows the divisor at compile time because the magic constant and the surrounding code is generated from it.

Example revisited

Now let's look at the code above again:
It's easy to see that CCCCCCCDh represents $\frac{4}{5}=0.8$.
Let some_value be denoted by x for now: we compute x*CCCCCCCD which corresponds to $x*\frac{4}{5}$ and store the result in EDX:EAX. Then the code grabs the integral part from EDX, puts it in EAX and performs a right shift by 3, which is equivalent to dividing by 8. So the complete calculation is:
$x*\frac{4}{5}*\frac{1}{8} = \frac{x}{10}$.
So in the end we just divided by 10, but with ridiculous - err ludicrous speed!
Before we reach the end of this little tutorial, let me give you one last example:

mov ebx, some_value
mov eax, 24924925h
mul ebx
mov eax, ebx
sub eax, edx
shr eax, 1
add eax, edx
shr eax, 2

Let some_value be denoted by x again:
Frist we have to find out what the magic constant is, so we calculate:
$\frac{613566757}{2^{32}} = \frac{1}{7}$. By translating the above asm statements it's easy to see that we get to the following equations:

$\frac{\frac{x-\frac{x}{7}}{2}+\frac{x}{7}}{4} = \frac{\frac{\frac{6x}{7}}{2}+\frac{x}{7}}{4} = \frac{\frac{4x}{7}}{4} = \frac{x}{7}$.
The additional operations are emitted by the compiler to minimize rounding errors. This becomes significant for very large numbers. If you want to understand what i mean, have a look at division by 5 code produced by an optimizing compiler.

Common constants

Here's a table with some randomly chosen constants and the corresponding fractions:

Magic constant Resulting fraction
55555555h $\frac{1}{3}$
66666667h $\frac{2}{5}$
CCCCCCCDh $\frac{4}{5}$
24924925h $\frac{1}{7}$
38E38E39h $\frac{2}{9}$

The complexity of the assembly code varies depending on the divisor, e.g. the code for $\frac{1}{7}$ is more complex than the one for $\frac{1}{3}$.

Conclusion

As you saw an optimizing compiler can make a reverse engineers life a little bit harder, but also more interesting if you know what's going on under the hood.

Evil Client v1.5.2

This is a minor update: some crashes were fixed and the GUI of the console window now also has the EC theme - no nasty dos box anymore! Besides, the setup will allow you to keep your settings upon uninstall. That's all :)
Details and download on the Evil Client page.

Evil Client v1.5.1

Some minor fixes made it into the new version. First of all, the --reconnect command line switch now uses a possibly specified reconnect delay, so the EC is better suited for scripting. Other than that, the setup is now multi user aware, which means that it detects if the current user has Administrator rights. If that's the case, an option appears to install the shortcuts for all users, otherwise only shortcuts for the current user are created. Keep in mind though, that you might need to clear your EC configuration (found under %APPDATA%\Evil Client\) manually when uninstalling with Administrator rights, because the uninstaller will only be able to remove the configuration for the user which is currently executing the uninstaller. So this issue only appears when you're using the EC from a restricted account, but install from an Admin account (applies to XP and Vista). If you have Administrator rights on Vista however, UAC jumps in and the current user is elevated.
Although this solution is probably not 100% satisfactory, it basically resembles the situation you have under any *nix based system, where the super user installs applications and all the user data reside in the respective home dirs. If you've any suggestions, pls let me know.

Evil Client v1.5 available!

It appeared to be dead, but the Evil Client strikes back again ;)
The new version comes with a bunch of new features, still has the drop-dead gorgeous GUI, occupies very few resources and keeps your VPN connection up 24/7. Besides, I added some documentation, which contains some usage instructions and explains the new features. Everything else on the Evil Client site. Cheers!

Syndicate content