Wednesday, November 16, 2005

Source Code Archaeology

WARNING: This blog entry was imported from my old blog on (which used different blogging software), so formatting and links may not be correct.

As should be obvious from my recurring blog topic of
coding style, I care a lot about
code readability and style.

One criticism I've heard frequently is that you shouldn't reformat code to be style-compliant, since that
introduces lots of diffs into the version history of a file. Oh no, diffs! What horror! Think of the children! Somebody, The children!

I have two responses to that line of thinking:

  • First, you clearly don't agree with me on the importance of consistent formatting and readability. Perhaps
    I should just grow up, but when I see code like this I cringe:

    void myfunc(int a_foo, int a_bar)
    foo = a_foo;
    myfunc (foo);

    Source formatting typically doesn't rename variables (to wipe out the underscores ( _ ) from variable names) but
    it certainly could put the opening brace back up where it belongs, as well as remove the space between the method name and the opening parenthesis, etc.

    Reading the above code is the common operation. Tracking down line history is the occasional operation.

  • Second, and this is the more important point: You're probably not familiar with, or at least well versed in,
    source code archaeology.

Source code archaeology is very similar to the archaeology you're familiar with: you dig down, layer after layer, uncovering artifacts from a particular time period. You essentially get to see the state of affairs at one particular point in time. Then historical events occur and layers upon layers are deposited on top.

With source code archaeology, you can dig back into the history of a file. Let's pretend we agreed on a particular coding style for the NetBeans source code base, and I went and ran a filter over the entire code base, reformatted it, and checked the code in with the comment "Source code reformatted to spec using codestyle ide/formatsettings.xml". Reformatting the entire source base (yes, I know, blasphemy!) is usually criticized as bad because you can't find out why a particular line was added or changed, since now the version information for the line points to the reformatting delta. But but but - remember the layers! (Insert mandatory Shrek reference here...) It's easy to go to the layer below the reformat, and see what the version of the file looked like prior to the reformatting delta. So, once you've done a giant reformat of the source base, looking for the origin of a particular source file line may involve multiple operations. First you determine what delta corresponds to the current version of the line. If you see that it was for a reformat, you peek at the version below it, locate the same line, and look at its original comment.

Now, you might argue that that's an extra step. You would be right - but think about it, what's more common, reading code, or reading the checkin comment for a particular source line? As long as you can get the team to use the same source formatting conventions (perhaps even performed or enforced automatically at checkin) you'll only need to do this global operation once.

Tools should make this job easy. It's not all that hard on the command line either. You may have heard of the "cvs blame" command.

tor:1% cvs blame
Unknown command: `blame'

CVS commands are:
add Add a new file/directory to the repository

Ok, so that's not really the command, but it's the common name for the cvs annotate method! It lets you figure out who to blame for the ridiculously stupid bug you just found in the code. The real name is cvs annotate.

With cvs annotate you can get annotations for the current version. Let's say that you discovered that a reformat happened in revision 1.43 of file To see what the file looked like before that, simply do

tor:2% cvs annotate -r 1.42

1.1 (tor 24-Mar-03): /*
1.1 (tor 24-Mar-03): * Copyright 2003 Sun Microsystems, Inc. All rights reserved.
1.1 (tor 24-Mar-03): * SUN PROPRIETARY/CONFIDENTIAL. Use is subject to license terms.
1.1 (tor 24-Mar-03): */
1.1 (tor 24-Mar-03):
1.1 (tor 24-Mar-03): package;
1.1 (tor 24-Mar-03):
1.1 (tor 24-Mar-03): import java.awt.geom.Rectangle2D;

As I mentioned, tools should make this easy. I'm not sure what the current state of this is in NetBeans' new CVS support, or in other IDEs. I had a strong need for this back when I worked with SCCS and Teamware on Sun's C++ development environment. I built a prototype for doing this kind of exploration; I think it's still relevant today, seven years later, so I'll include a couple of screenshots of it. Consider this a wishlist for $favorite_ide, in my case NetBeans.

The idea is that you get to open your source file in a version-annotated view, where all the source lines are colored according to some source code metrics. For example, really old lines have a dark background, and recent lines have a bright red background. (To show which lines belong to the same delta, look for the < and > signs in the left margin showing contiguous blocks, since code about the same age can have roughly the same color but should be logically distinguishable.

The point is that you want to be able to click on a line to be able to see its checkin comment - and to quickly cycle through previous versions of the file (and diff between versions).

Another possible coloration assigns a different color to each user who has touched the file. The overview gives you an indication of who's most responsible for this source file.

There are some other minor things here too, but I think I've made my point: Source Code Archaeology - it's important. It needs better tools support. But it's possible today, and should help you make the transition to a properly formatted source code base.


  1. Move that curly up, and I'll reformat the code to put it again down where it Belongs..=)
    Seriously speaking, what about a coder having differentn coding styles?

  2. I have to agree with Ken Arnold. The basic idea he puts forward is that we need to consider whitespace and style as part of the language design rather than leaving that up for each person to decide. This would insure that all code by all team members is similar ( at least in regards to style ). A modern IDE should have little trouble setting and helping to format any code that requires such, we are already almost there as it is now....

  3. > ... think about it, what's more common, reading code, or reading the checkin comment for a particular source line?
    This really doesn’t have much to do with what's more common, or with CVS in particular. It has to do with the software lifecycle; Software quality assurance, configuration management and in the end risk management. Presumably if you are fixing someone else’s code, it’s baselined code. There are more reasons not to reformat than to reformat it.
    It is often the case when tracking defects and compiling SQA metrics that several metrics are tied to the number of lines of code changed between updates/releases/baselines. Software that counts the changes to compile the metric doesn’t know that you changed an entire file just to reformat it, and counts every changed line against the software team’s metrics.
    A change control board wanting to review code changes prior to accepting a change to a baseline can’t be expected to do the archeology to figure out what really changed. Automated tools document what’s changed between the baseline and the proposed change, but not if there are intermediate steps where the entire file has been reformatted.
    Unless you can prove that the reformatting module/program has zero defects and therefore no chance whatsoever of breaking code that works, you run the risk of introducing bugs along with the intended fix.
    CVS in particular has a very nasty problem with bulk code reformatting in a distributed team environment, namely conflicts. You can never be certain that someone else isn’t fixing something else in the same file unless you are going to force locking ala SCCS. If you commit a reformat, you will force the other person to waste time resolving white space conflicts.
    I am very pedantic about coding style, but I have also been coding long enough to have gained quite a bit of flexibility in reading other people’s code.

  4. Ug. Paragraph breaks were there when I previewed, but gone when I posted. Sorry, hope it's still readable.