What Just Happened?

Sometimes, my computer freezes for a moment. It might be after I pressed a button or typed some text. It might not have an obvious cause – just a background daemon doing something at its scheduled time. It’s not usually a problem – the system goes back to normal after a few seconds. However, it’s annoying, and it takes away my ability to use my computer as a tool for those few seconds. It’s a tiny usability bug.

However, this bug is harder to fix than most. For starters, there’s not always a clear reason for the freeze. If I just clicked on something, I might assume that the program that received the click is
doing some processing. But is it really, or is it doing blocking I/O and waiting for the kernel to load a fragmented file from disk? Or is it trying to make a network connection and waiting for my (sometimes slow) Internet connection to return? If there’s no obvious cause, the problem is even worse.

This post is part of a new idea I had to blog about things I wish I could do, but I probably will never do because I don’t have that much time. In the ideal case, someone else would take the idea and do it, but at the very least I’ll have them down in bits so I can look back years later and say, “yeah, that was a good idea. Too bad I was busy.”

Today’s idea is a way to debug problems like the ones above. Like any nice open source software developer, I want to be able to either fix programs that are causing problems or send in good bug reports. But with strange bugs like the above, that’s hard. First of all, I don’t really know what program is causing the problem, although in some cases I can make educated guesses. But even if I did know which program was the problem, how should I tell someone to reproduce it? “Run the program for 2 hours or so, then click this button and see if you get a pause. If not, keep clicking.” I think the best way to get information on something like this is to be profiling the code at the time the slowdown happens.

Therefore, the way to find and catch all of the small, subtle performance bugs like that is to be profiling all the time. What I want is a daemon that sits on my system just running oprofile.

To keep its memory usage down, it only needs to keep logs for the last, say, 30 seconds of my computer usage. But when I hit a pause, I want to be able to press a button to save those last 30 seconds of logs to a file somewhere.

Basically, what I want is a “what just happened?” button for my computer. It should include profiling information and enough information about input events that I know what might plausibly have
caused it. It should also know something about inter-process communication and networking, for the same reason. I want to be able to get all of this output in text form, so I can send it as an email to the mailing list of my favorite open source project, or in graph form, if I just want to visualize my own system.

Once you think of it as a “what just happened?” button, other ideas pop into your head. For instance, what was the top function on the call stack in every program during the times we’re logging? I’m not sure if we can even record that, but it would be interesting information.

It is a key feature of the idea that the daemon is always running, because you don’t know when a small pause like that will happen. The button is worthwhile because it lets you look backwards in time to find bugs that you don’t know how to reproduce yet. Therefore it would have to be pretty efficient. I’ve run oprofile, and I haven’t noticed any performance problems, so I’m not worried about that.

How many bugs would this really catch? I don’t know. There’s only one way to find that out. But this project would lower the barrier to finding and fixing hard bugs, which can only be a good thing. It might even make it easier for non-programmers to file useful bug reports, which would be really good. And most importantly, I think it has enough potential to be worth a try.


16 Comments on “What Just Happened?”

  1. ASDFGuy says:

    If your computer (especially the mouse) freezes for seconds at a time out of the blue, I would say you have an issue with your hard disk….

    This is 95% likely.

  2. thargol says:

    SystemTap is your friend here. Or possible dtrace, depending on which OS you’re using.

  3. Michael Gebis says:

    I agree with the main point of your post: It really would be cool to have the tool you envision.

    But in the meanwhile, you should look at the tool “LatencyMon”, which might give you insight into the system pauses. (It really will only help you if the pauses are related to long device driver ISR or DPC times, but in that case, it will tell you which one is the culprit.) Good luck.

    http://www.resplendence.com/latencymon

  4. brucedawson says:

    I rely on continuous profiling on Windows. I leave xperf recording profiles to a circular memory buffer at all times on all of my Windows machines. It is amazing and I am able to diagnose one-off performance problems most of the time by retroactively saving the buffers to disk.

    http://randomascii.wordpress.com/category/xperf/

    If there is a similar setup on Linux then I would like to use it.

    Note that the tricky part of this problem is analyzing the trace. The causes can be extremely difficult to disentangle and there is no substitute for an expert analyzing the trace.

    • noahlavine says:

      That is interesting. I didn’t think much about the problem of analyzing the traces, but if it’s too difficult to do then only a few people could use such a tool anyway. (Or you could send your trace to a service that analyzed it for you, as someone suggested on Reddit.)

      • brucedawson says:

        As with crash analysis, not everybody needs to be an expert. A centralized system for uploading traces for the community to look at is nice, but the main thing is to have a way of recording sufficiently rich traces. I find and investigate perf-problems in all of the software that I use. I then either fix the bug or report the problem. If the team I report the bug to fixes it then many people benefit.

        Microsoft has a rich perf-analysis system that, if people have opted in, automatically uploads perf-reports for millions of people and perf-traces for thousands. Ubuntu could do something similar.

    • noahlavine says:

      Thanks for the pointer! I read your most recent article; it was very interesting.

      It appears that SystemTap may be that setup, but I haven’t learned much about it yet.

  5. starwed says:

    Mozilla is battling exactly this problem in Firefox. With telemetry enabled, Firefox will even report back to Mozilla when a pause above a certain threshold occurs. You can read more about it on Taras Glek’s blog.

    (Here’s a recent post: https://blog.mozilla.org/tglek/2012/09/18/snappy-in-warsaw-pierogy-fueled-hackfest/)

  6. Michael says:

    On Windows you get this for free via the tracing infrastructure built into the kernel, see for example this blog post for someone using it to debug some slowdown: http://randomascii.wordpress.com/2012/09/04/windows-slowdown-investigated-and-identified/

  7. Have you tried upgrading to an SSD yet? I used to see that a lot, now all gone after upgrading. I think it was just page faults when my older hard drive was already busy. E.g. system goes to run part of a program, doesn’t have that part in memory, goes to retrieve it from disk, disk is already working on a bunch of other stuff and is a super slow spinning magnetic platter with large seek time, etc..

    That said, we do have something called strict mode in Android development where doing networking and the like on the same thread as the UI is running will flat out crash your program. Then you can go examine the stack trace. So that’s another approach.

    • noahlavine says:

      No, I haven’t tried that. It should be possible to make the system responsive even with a hard drive, but then I guess you get into performance trade-offs – should you keep your GUI libraries/programs always in memory, even if some other program would be faster if it could use more memory? I think you could make this work well with good policy, but it is hard to do.

    • brucedawson says:

      Oddly enough the last slowdown that I investigated that was caused by slow disk I/O was from a customer with an SSD. A single read took about ten seconds to be satisfied. The read got stuck behind a lot of writes, and I suspect those got slowed down because there weren’t any clear pages. I suspect either an overly full disk or a bad driver. It’s sad when an SSD actually fails to help.

      More memory would have helped in this case, by avoiding much of the disk I/O pressure.


Leave a reply to ASDFGuy Cancel reply