What Just Happened?

Sometimes, my computer freezes for a moment. It might be after I pressed a button or typed some text. It might not have an obvious cause – just a background daemon doing something at its scheduled time. It’s not usually a problem – the system goes back to normal after a few seconds. However, it’s annoying, and it takes away my ability to use my computer as a tool for those few seconds. It’s a tiny usability bug.

However, this bug is harder to fix than most. For starters, there’s not always a clear reason for the freeze. If I just clicked on something, I might assume that the program that received the click is
doing some processing. But is it really, or is it doing blocking I/O and waiting for the kernel to load a fragmented file from disk? Or is it trying to make a network connection and waiting for my (sometimes slow) Internet connection to return? If there’s no obvious cause, the problem is even worse.

This post is part of a new idea I had to blog about things I wish I could do, but I probably will never do because I don’t have that much time. In the ideal case, someone else would take the idea and do it, but at the very least I’ll have them down in bits so I can look back years later and say, “yeah, that was a good idea. Too bad I was busy.”

Today’s idea is a way to debug problems like the ones above. Like any nice open source software developer, I want to be able to either fix programs that are causing problems or send in good bug reports. But with strange bugs like the above, that’s hard. First of all, I don’t really know what program is causing the problem, although in some cases I can make educated guesses. But even if I did know which program was the problem, how should I tell someone to reproduce it? “Run the program for 2 hours or so, then click this button and see if you get a pause. If not, keep clicking.” I think the best way to get information on something like this is to be profiling the code at the time the slowdown happens.

Therefore, the way to find and catch all of the small, subtle performance bugs like that is to be profiling all the time. What I want is a daemon that sits on my system just running oprofile.

To keep its memory usage down, it only needs to keep logs for the last, say, 30 seconds of my computer usage. But when I hit a pause, I want to be able to press a button to save those last 30 seconds of logs to a file somewhere.

Basically, what I want is a “what just happened?” button for my computer. It should include profiling information and enough information about input events that I know what might plausibly have
caused it. It should also know something about inter-process communication and networking, for the same reason. I want to be able to get all of this output in text form, so I can send it as an email to the mailing list of my favorite open source project, or in graph form, if I just want to visualize my own system.

Once you think of it as a “what just happened?” button, other ideas pop into your head. For instance, what was the top function on the call stack in every program during the times we’re logging? I’m not sure if we can even record that, but it would be interesting information.

It is a key feature of the idea that the daemon is always running, because you don’t know when a small pause like that will happen. The button is worthwhile because it lets you look backwards in time to find bugs that you don’t know how to reproduce yet. Therefore it would have to be pretty efficient. I’ve run oprofile, and I haven’t noticed any performance problems, so I’m not worried about that.

How many bugs would this really catch? I don’t know. There’s only one way to find that out. But this project would lower the barrier to finding and fixing hard bugs, which can only be a good thing. It might even make it easier for non-programmers to file useful bug reports, which would be really good. And most importantly, I think it has enough potential to be worth a try.