Wednesday, August 3, 2011

Why Microsoft Can't Write Good Error Messages

One of my pet peeves is that there is so much software out there with bad error messages, and much of it seems to come from Microsoft. To be sure, there are many other companies just as guilty of writing poor diagnostics, but Microsoft is such a big part of all our lives, and it is fun to pick on them. Here's a 20 year old joke I still love tell...

There is this helicopter flying towards Seattle Airport, but it is very foggy. Eventually he sees a tall building projecting above the clouds and flies over for a look. He spots someone on the roof and hovers near, opens the window and shouts "WHERE AM I?" The person on the roof shouts back "YOU ARE IN A HELICOPTER!" The pilot immediately takes off west and in a few minutes lands safely at the airport. The passenger looks at the pilot and says "how did you know where to go?" The pilot says "well his answer was 100% correct, and 100% useless, so I figured he must work for Microsoft. From there I knew which way Seattle was."

I really hate doing software development in Microsoft-land, Visual Studio, .NET, COM, Microsoft C++, and all that crap. I find I am far more productive using Eclipse, Java, and open source artifacts. What I really hate is when something is not working, the diagnostic messages are incredibly poor or even nonexistent. One day I was working on a hard problem and could make no headway, so I asked a teammate with more Visual Studio experience for some help. He said just step through your program with the debugger. I took his advice, and eventually I found the problem because I reached a point in the debugger where an error result appeared that I have never seen emitted before. The point was, the only way to solve the problem was with the debugger, the error result was not logged or emitted in any place outside of the debugger.

My teammate told me that when developing Microsoft applications you have to spend a lot of time in the debugger, everyone does, it's just what you have to do.

This was quite alien to me. I have used debuggers before, but I only used them as a last resort. I prefer to rely on logging messages, because when you are troubleshooting you do not always have a debugger - for example at a customer site.

It finally occurred to me there are two camps of thought on this: one camp, the one I am in, only uses debuggers as a last resort; while the other camp always uses the debugger as a first resort. Here is what happens
  • When you avoid using the debugger you tend to write a lot of logging messages for diagnostic purposes. When troubleshooting you write even more messages to zero in on the problem, until it becomes clear what the code is actually doing. What you have done is to codify your diagnostic process into the software itself. The more you do this, the more experienced you get at writing better and better messages. When you are really experienced, your messages not only tell you clearly what the problem is, but often how to fix the problem as well. For example, a message that says "can't find configuration file foo" is like that guy standing on the roof of the Microsoft building. On the other hand, a message that says "MyApp.Configurator cannot find the file C:\Program Files\My Application\web\data\foo.xml" is a lot more meaningful. When it comes to writing user facing error messages I also find that the people in this camp are much better at producing these types of messages too, because the more diagnostic messages you write, the more logs your read, the more crappy messages you find, the better you get at writing clear and meaningful messages.
  • When the debugger is your first resort at solving a problem, you step through the code, you think about the problem, you reason stuff out, and eventually you find the solution and move on. All that reasoning and problem solving wisdom from that moment does not get written down anywhere for anyone else to see or learn from. Even if you have to revisit the same problem months later you have likely paged-out how you figured out the problem in the first place, and have to reinvent the reasoning from scratch. Also, because you are never writing any diagnostic messages, you don't get any good at writing diagnostic messages. When you are forced to write some user facing diagnostic messages because there requirements mandate it - well, you are still a neophyte moron when it comes to writing diagnostic messages - your are just that guy standing on the roof of the Microsoft building.
To be fair, I reiterate Microsoft are not the only one's guilty of this practice, I have seen this time and time again over the years in Unix and Mac OS, and open source software, etc. Also to be fair, when I am working in Java culture, I do notice the diagnostic messages generally are better than I see in other cultures.

Architecture Atronauts

The first time I heard the term Architecture Astronaut was when someone forwarded me a blog article from Joel on Software "Don't let the Architecture Astronauts Scare You"

I have always enjoyed Joel Spolsky's articles and interesting insight on things, so this one was particularly interesting because various people accuse me of being an Architecture Astronaut, and I wanted to find out what that meant.

I can certainly appreciate his warnings about too much abstraction and being too far removed from the problem, especially if you don't actually write any code. The funny thing is, I do consider myself and Architecture Astronaut, but I do write a lot of code, very much lately. Joe has a lot of good points, but he fails to address the problem of "Code Monkeys" who have no appreciation for design, let alone architecture - these are the ones who scare me the most. They create a lot of bad code, bad APIs, terrible diagnostics (or no diagnostics), and bad documentation (or no documentation).

In his article, "Silos and Architecture Astronauts" Patrick Dubroy makes a good point about the balance between working code and perfect code. It is a very good point, but, in my experience code, and solutions, are increasingly copied. Someone trying to solve a problem looks through the code base for a similar solution, and does a lot of copy and paste. If what they found was bad code, then you are propagating even more bad code around. Even worse, if what they found was a bad solution, then you are propagating more bad solutions, and junior software developers come to think these are normal and acceptable solutions.

There is an old saying "the hurrier I go, the behinder I get." I have often found myself spending hours, days, even weeks, refactoring such terrible code that was insanely incomprehensible and unmaintainable. In almost every case, the new code I leave behind is less code, sometimes significantly less code. After these exercises when I look back I always ask "what the fuck were they thinking?" The problem is obvious, they were in a hurry to just get things working, and left it for everyone else to get further and further behind just trying to maintain their crappy code.

A few years ago we got a new team member and she started working on some legacy code. After a couple of week she phoned me to say "I feel so stupid, I cannot understand this code." I had to reassure her "you are not stupid, it truly is bad code. When I first started I felt exactly as you do, I felt so stupid, like I was missing some important methodology or design practice." It did not help much that one of our junior software developers (not her) kept gleefully propagating more and more of this bad code, bad design, and bad solutions throughout our code base faster than I could repair the damage.

So what am I going to talk about in this blog? I am going to talk about computers, computer technology, software and programming; design and architecture; attitudes, practices and methodology. I am not always going to be polite or politically correct - this blog is for adults.