Sunday, December 30, 2007

Yet Another New Year's Resolution

Because this is a time of the year when people make resolutions about... ahem... loosing weight, and because I recently read in a blog how various programming languages are inherently bloated, I decided I should write down my opinions on fat code.

In this summer post back in June, I was reporting that the ZeroBUGS project had 135,601 lines of code, out of which 120815 (89.10%) were C++. Tonight, after just finishing a big round of changes, the project totals 142,992 lines of code, 126779 (88.66%) being C++.

So one could argue that my project got inflated by roughly 6k lines of (C++) code. And what do I have to show for six months of fattening up?

Bug fixes notwithstanding: I have ported the code to the PowerPC platform, added support for visualizing wide strings and Qt strings, added a feature that allows debug events to be ignored by thread, and (hot from the oven and about to be released) added support for D programming language's associative arrays. In only six thousand lines of code. Not too bad, but that's just my opinion.

What are in general the factors that cause source code to bloat up?

System Refactoring
Back when I was working for Amazon.com, we had to re-write the ordering system (one of the many Amazon software components, the ordering system was in charge of all the magic that happens from the time you click Proceed To Checkout until your items are shipped). At the time when I joined the team that got tasked with the re-write, we had a subsystem consisting mainly of a few tens (or maybe hundreds?) C functions.

This thing was not very flexible, and was already giving up at the seams whenever new functionality was being requested by the business people. It also had statical dependencies to almost anything else in the system. The decision was to replace it with a middle-tier service, with clean APIs.

My boss at that time (an ex Bell Labs guy) had a good plan:
  • design a set of object-oriented, abstract interfaces;
  • implement them in terms of the legacy C code;
  • then rewrite all client code to use this new C++ API;
  • and finally, once there is no more coupling to the C implementation, change the implementation, one small piece at a time.
And this is what we did, and I think it was a successful project, with a few wrinkles though:
  • the migration effort took a couple of years to complete; meanwhile, the old system co-existed and was being actively changed and maintained;
  • do not forget the people factor: some of the middle-managers (I hear these days they got promoted, accordingly to the Peter Principle) had personal political agendas that caused the project to take longer than necessary

The overall effect was that in fact we had two parallel systems in existence: a legacy one, and the "new" one.

The problem is that by the time you are ready to throw away the legacy system, the new system is already old enough to be called "legacy" itself.

Work on another system designed to replace the "new legacy" may start before the "old legacy" is completely retired. So the company may end up with three or more systems being maintained in parallel (sure, the plan is to eventually phase out the legacy, but that may not happen as soon as we wish). And here is one place where bloat, and its first cousin needless redundancy thrive: when code bases are being unnaturally kept alive. Two systems in parallel, and old one and a replacement, are fine. Three or more systems that are trying to solve the some problem is not a Good Thing. And just in case you did not catch it: needless redundancy is in itself a needlessly redundant association of words.

Supporting multiple platforms
Another reason for code to grow in size is portability. In order to make your code portable, you need abstractions and indirections. I started ZeroBUGS in late 2003, because I wanted to best GDB.

My first debugger prototype was less than three thousand lines of code. I was inadequately enthused: the code was stable as a rock, but that's pretty much all I can say about it. It did not work with multiple threads; it did not load core dumps. The support for STAB was sketchy, support for DWARF was none. There was no expression interpreter. The code was stable as a rock, and that's pretty much all I can say about it. And GUI? What GUI?

It took another four years or so to add all these features. Maybe a third to a half a of this time was spent making said features portable. And I do not even mean across OS-es or CPU architectures. You see, when I started to write the GUI I went with Gtk-1.2 and the corresponding C++ wrapper, Gtk--. By the time I was done the world had already moved to Gtk-2.x, and the Gtkmm C++ wrapper was a standard package in most distributions. I had to write an adaptation layer, not unlike the ordering system adaptation API back at Amazon. And that bloated my source code. But today I can compile against Gtkmm or Gtk-1.2 (and 95% of the details are transparent to my client GUI code).

But Gtk-1.2 may no longer be relevant, some people may say. And I think they are right. But let's look at Professor Tanenbaum's MINIX for a second, shall we? Not only because it is a lean and robust system (easy when you do not have features, multi-threading anyone?) but because if you read the source code you notice a strange thing: there are a lot of macros dedicated to ensuring compatibility with Kernighan & Ritchie C compilers. In the 2006 3rd edition of the book. What the heck?

I guess that the lesson here is that writing for portability is fine, but keep an eye on things that may become obsolete sooner than you think. The code that deals with one particular OS, compiler, etc, will then turn into dead weight.

So my 2008 resolution is to get ZeroBUGS on the treadmill.
Happy New Year!