Here’s another Reddit stub dealing with a topic that is near and dear to my heart: debugging! Unfortunately, the comments on that article seem to focus more on the “fuzzy” aspects of debugging – the “go home and mull it over while watching TV” kind of stuff, rather than more concrete debugging techniques.
Whenever I run into a bug whose cause is not immediately obvious, I have a standard bag of tricks that I fall back upon. Every programmer has a toolbox like this – I figured I would write about some of the techniques that I use, and why I use them. Some of them are not applicable to every situation, but there are still many that can be applied to any given bug. These are presented in no particular order.
- Change the inputs.
Many times, by changing the inputs to a function, you can cause a recognizable change to occur in the output. This helps you to envision what’s actually happening in the function, and where things might be going wrong. This can include changing parameter values, input files, textures, etc.
- Do things in a different sequence, or with different timing.
Pretty self-explanatory, and related to the first item. The idea is to observe differences in the behavior of the program in similar circumstances. This is mostly useful for interactive programs.
- Run all of your automated test code, even the slow tests, and examine any issues that are reported.
This is helpful if for no other reason than as a sanity check.
- Check the logs. This one is pretty standard. Even though most debug logs tend to be overflowing with spam messages, you might still find a smoking gun in there.
- Ensure that you’re validating all of the return values of function calls.
This is also known as the “be paranoid” rule, or perhaps the “re-check all of your assumptions” rule. It’s easy to forget to check return values, but it’s crucial to do so. Code that silently ignores failure can cause problems or symptoms unrelated to the function call that actually failed.
A related problem is returning pointers to objects on the stack – this will result in havoc since they will be freed as the function exits.
- Ensure that pointer values get cleared out when the struct or object to which they point is freed. Using stale pointer values is a surefire way to get in trouble. Clearing them out when the associated object is freed (except in very, very special circumstances) will help keep you sane.
- Check for any masked exceptions in managed code.
I wrote about this the other day. Ensure that no unexpected exceptions are being silently masked in your code.
- Check the data.
Make sure that the data you’re trying to use is actually valid! The GIGO principle is as true as ever. Check for data out of the expected range, QNANs being generated (which are infamous for screwing up subsequent floating-point operations), the ordering of data, legacy data, and correct offsets/sizes of data.
- Put in additional logging or debug visualization code.
This can help provide additional information, but this strategy can also backfire, as it can significantly change the timing of your code. (Disk, socket, and/or pipe I/O are relatively expensive operations.) Use with caution. If you have general-purpose code for validating the state of the application, sprinkle calls to that code throughout the application – this can be useful in determining when things go off the rails.
- Check the crash dump, if you have one.
Post-mortem analysis of core dumps is often extremely useful in tracking down bugs that you didn’t personally witness, or for which the steps to reproduce are lengthy or time-consuming.
- Step through the code in the debugger.
It can sometimes be quite slow if you’re processing large data sets, but it’s often the easiest way to monitor the control flow of a function.
- Inspect the disassembly.
This sounds hardcore, but being able to do this is invaluable in some cases. It’s useful not only for checking the compiler’s output, but also for cases where you’re examining a minidump of managed code. Unless you’re working with CLR4 and Visual Studio 2010, opening these minidumps in Windbg results in callstacks that don’t include line numbers. You do have the instruction pointer value, though, so you can actually print out the disassembly with the !u SOS command, and compare the disassembly with the original source code of the function to figure out the exact point at which the crash occurred.
- Inspect memory.
If you have pointer problems, look at the contents of memory in the debugger to try and figure out what’s going on. A frequent problem is an invalid offset, which results in struct member values “shifting” forwards or backwards. It helps to be familiar with memory representations of things like floating-point numbers – knowing a couple of common values (0x3f800000 == 1.0f, etc.) can be very handy.
- Use conditional or memory breakpoints to isolate the bug.
If you know that a particular object or memory address is related to the bug, you can set up breakpoints to pick out a particular loop iteration or write to a memory location. In cases where you’re interacting with a large body of unknown code, memory breakpoints can be particularly useful for tracking state changes.
- Try a different build configuration (or turn on asserts).
This is intended to provoke behavior changes, add validation, and otherwise provide additional data points for determining exactly what’s going on. Turning on validation such as array bounds checks and heap checking can help find some tough bugs (albeit at a tremendous cost in execution speed).
- Run on a different platform, or build with a different compiler. There are often significant differences in timing and other behavior when you run software on a different platform. Endianness and word size also often differ between platforms, which can expose problems or bad assumptions about data in code. Like changing the inputs, careful observation of these differences can help you get an understanding of what’s actually happening. Additionally, if you are using a different compiler, you may see different warnings or code behavior due to optimization.
- Check the version control history for anything suspicious.
Examining changes to the source code can give you a good idea of how the behavior of the program has changed (even if you don’t know much about the changed code to begin with), and can shed some light on a bug. Checking all of the cases where a function is called to ensure that they take into account any changed behavior is essential.
- If your codebase easily allows you to do so, try running the buggy code synchronously instead of asynchronously, for testing purposes.
This can help determine if a race condition is what’s causing the bug to occur. (It should be noted that I’m not crazy about just adding random sleeps into an asynchronous function to determine this – it’s too unreliable for my tastes.)
- Check (and re-check) your data dependencies in asynchronous code.
Don’t fall into the trap of trying to envision multithreaded code by imagining each possible combination of instruction pointer values. Instead, when trying to prove correctness, focus entirely on data dependencies and ensuring that locks are used correctly and respected. (For deadlock bugs, inspect the order in which locks are taken, and check for proper use of back-off and other algorithms for avoiding deadlock.)
Note that applying these techniques won’t help you write optimal multithreaded code – this requires much broader insight into the particular algorithm in question, and the overall architecture of the code. However, they will help in tracking down correctness issues.
- Turn off chunks of the code, or switch to a different implementation of an interface.
If you have multiple providers of an interface available, try using a different one, and see how the behavior of the program changes. (Using null/echo interfaces is a common debugging technique.) Additionally, you can try disabling features of your application to see if they are somehow related to the bug.
- Use third-party validation tools and/or debugging information.
This includes things like the Direct3D debug runtime, FxCop, lint, valgrind, the Visual C++ runtime debug heap functions, the Application Verifier, and the checked build of Windows. The more debugging aids you have active, the more likely that you’ll get a clue upon which you can act.
- If the code is unfamiliar, find out who wrote it, and start asking them basic questions about it.
This is similar to the “be paranoid” rule, except that by asking all of these basic questions of the author, you’re forcing them to re-check all of their assumptions. It’s not uncommon to have a eureka moment while explaining a bit of code to someone else.
- Try turning off optimizations for a chunk of code.
My experience has been that people tend to fall back on the “it must be a compiler bug”
excuseexplanation way earlier than they really should. Nevertheless, turning off optimizations for a section of code might help you debug a problem that occurs in optimized builds. (Whether it’s a genuine optimizer bug, or, say, a misuse of the C99 restrict type qualifier, is for you to find out. Anyone interested in using the latter, incidentally, should really read this excellent article by Mike Acton on the topic.) Performing a quasi-“binary search” when turning off optimizations can help minimize the time spent searching for the problem code snippet for a genuine optimizer bug.
- Try running on a different machine, or piece of hardware.
Hardware failure is another bug explanation of which programmers tend to be a little too fond. However, it does happen occasionally, so it’s definitely something worth testing if you run out of other ideas.
That’s pretty much all I can think of at this point. There are a few more tips that spring to my mind, but they are pretty specific to Windows or Visual Studio development, so I won’t recount them here.