My SCM Philosophy (take 2)


I started to congeal some of my thoughts about source control into a
more concrete form. This is take 2.. it's still very much a work in
progress.

============================

This paper outlines some facts about source control usage, and goes on
to describe some of the keys problems I see with existing systems.

** Fact Review 

1) A set of source revisions plot a linear path from the first
revision to the last revision.

2) A branch is a location in this linear path where there are two
revisions derived from the same earlier revision. This can be thought
of as two linear paths from start to finish which share some points
along the way.

2) Merging (otherwise known as 'the dreaded merge') occurs when you
want to merge two or more of these paths back into a single path. The
only way to avoid merging is to prevent the possibility of two changes
being made to the same file at the same time, that is avoid two
separate paths from being followed. (i.e. sometimes called strict
locking)

3) Losing data is bad. 

** Where are all the SCM systems?

There is much confusion surrounding Source Control, Source
Configuration Management, and Software Configuration Management. Most
products claim to be in one of those categories but have some features
from each of them.

I'd like to start with my concept of what the ideal Software
Configuration Management system should do. Keep in mind that this is
an ideal, and as such, no system will achieve it. However, I feel it
most accurately describes the angle from which I approach this
tricky problem.

Software Configuration Management is the ability to store and later
reproduce the state of a software system. Think of the best automated
build system you have ever come in contact with. It probably had the
ability to tag or label all the software which was built for that
release. Hopefully the process of reproducing that build later was
fairly automated. However, how much of the underlying system did it
pay attention to? Did it record the version of the compiler you were
using when the build occured? Did it record the versions of all the
dependent libraries used during the build? How about the version of
the operating system being used, or versions of minor (but still
significant) tools like shells and scripts used?

In a true Software Configuration Management system, I argue that I
should be able to (1) buy a brand new machine from a local retailer,
(2) go through some simple setup process to install my SCM agent on
it, and (3) have the SCM system install all necessary software to
recreate any build or state which occured during our development
process. This includes installing operating systems, tools, scripts,
and all dependent files.

Of course there is the question of granularity. Should it be capable
of reproducing _any_ state? Certainly not. Even if you save the entire
state of every machine every minute, there are still items in-between
that minute which would not be saved. There are many reasons for
deciding how often state should be saved. If we call the operation of
saving state 'checkpointing', I'd like to make the statement that we
should be able to get back to any checkpoint made. 

Clearly few tools have even _attempted_ to achieve this goal, and I
don't know of any integrated tool which has.  In the real world,
products are shooting for _Source_ configuration management, the
products are concerned with recreating the consistant state of a
specific piece of source code. The dependencies involved in turning
that source code back into the shippable release binary are left 'as
an exercize to the reader'. 

** Source Configuration Management

Now that we've established some concepts for Software Configuration
Management, lets apply them to Source Configuration Management. We
should be shooting for the same goals as I explained above, but they
shuold be confined to versions of your source code. In short, a Source
Configuration Management system should be capable of recreating the
state of the source code as of a 'checkpoint'. Logical places which
checkpoints should occur include "file checkins" and "builds". We'll
consider the simpler case where 'checkpoints' only occur during a
checkin, simply because it better relates to most available source
configuration management systems.

ASIDE: It's significant to remember that when a user is working on his
files and building his software repeatedly, that there is of
significant utility for each build to be a checkpoint (i.e. someplace
he can go back to) even if he's not ready to 'check in' those changes
in the traditional sense. However, we'll save this concept for later.

Lets first look at two features which required in order to achieve
our goals:

1) Trackability of _all_ changes (and as a result, undo-redo-ability
   of all changes)

2) Sequenceability of all changes in the context they were applied in.

These are two simple points, yet most existing SCM systems violate
them both in some way or another. Keep in mind also that SCM products
are just tools. It's the entire development, build, release system
which comprises the entire SCM system.

* Trackability

Trackability is the easier of these two features to understand and
provide. Trackability is simply the ability to record all changes
which are made. In the real world, this really means recording all
changes which users want to record. As we discussed above when talking
about checkpointing, there are always changes which go unrecorded
because the user decided not to save the file, or perhaps the system
decided not to save for him by crashing. The best we can do is assure
that every checkpoint is recorded.

Trackability is necessary because without trackability it's not
possible to make sure you can get back to a state you had at one
point. 

I know this all sounds somewhat self-evident. Of course you have to
record all changes to restore those changes. However, I hope the
importance of this will become clear in the next paragraph.

The most popular feature which breaks trackability is 'non-versioned
movable labels'. Many people will argue that movable labels are fine,
because they allow you to try a few times to get the label where you
want it. However, the truth is that you can try a few times to get
your software the way you need it whether or not labels are
tracked/versioned. However, if they are not versioned, not only can
you not guarantee you can get back to a particular 'temporary point'
in the development process, but someone can later move a label (even
accidentally) and remove your ability to get somewhere which was
critical for your companies survival.

Does this mean non-versioned labels are useless? Not exactly. However,
it does mean that they are not reliable. Information tracked with
these sloppy labels should be considered disposable. Build systems and
versioning systems which rely on non-versioned movable labels are poor
SCM systems.

* Sequencability

Creating sequencability of all changes is something done so
infrequently that I don't belive most people in the SCM industry even
know what it is. Think of your favorite source control system. Imagine
the process of checking in a change you've made. Does that change get
applied in the context of the local files you were using it with, or
does it get applied in the context of the 'latest and greatest' in the
source control tree? To better understand this question, consider this
scenerio:

  (1) You check out the latest project files from your team's repository. 
  (2) You work for a week making changes to your local files. 
  (3) Because you want trackability, you check in your changes periodically 
      as the week moves along. However, you have not been pulling down 
      the 'latest and greatest' every day because frankly, you don't 
      want someone elses broken stuff to interfere with your work. 
  (4) You reach the end of the week, at which time you update all your 
      files to the 'latest and greatest'. 

Now, lets return to my original question. Did those changes you
checked in during the week get applied in the context of the files
they were being built against (your local files) or the did they get
applied in the context of the 'latest and greatest'? For nearly every
system in active use today (RCS, CVS, Perforce, PVCS, StarTeam,
SourceSafe ...and the list goes on), the answer is 'they get applied
in the context of the latest and greatest'. Are you beginning to
understand the problem?  What happens when you want to get back to
something you were doing in the middle of the week? If there is no
record of what local files versions you were working with while these
changes were committed, it is impossible for the system to get you
back there.


  ASIDE: The most common argument I hear against this idea is 'why do
         you need to get back there?' It should be pretty apparent that
         an argument like that isn't an argument at all, but rather some
         kind of derision tactic to help you think it's not important.
         However, there are many reasons you may want to get back there.
         The most common one, and one which I'll bet most people reading 
         this have come across at least once, if not often, is when the
         software breaks after you perform that "update" in step #4 at the
         end of the week. You sit there hitting your head against the wall
         that you didn't save your local files. What you need at this point
         is a list of all the "other" files (i.e. dependent files) which
         changed between 'a minute ago' and after the update, so you can
         narrow down what caused the problem. However, if the source 
         control system does not remember your local file versions,
         and sequence your changes in according to your local file versions,
         then there is simply no way to return to that point after the fact.

However, there is another evil which was lurking all week that nobody
warned you about either. If anyone else on your project checked out
the project files in the middle of the week, he gets a jumble of
changes which are known to work in the context of each developers
local build, but which have never been tested or built against the
latest head revisions, because nobody else has pulled the latest head
revisions yet.

  ASIDE: Another evil workaround exists in existing systems to
         'solve' this problem. Have you ever heard of the source control
         policy "don't check in changes until they work", or "don't
         break the tree". Every time I hear one of these statements, my 
         brain translates it into this: "Our SCM does not provide 
         sequencability. As a result, if you check in changes which don't
         work, you will affect other people's ability to work. We have
         decided that it's more important to keep the group working than
         to allow trackability of individual developer changes." To which
         the other part of my brain responds, "I need trackability of
         my changes, that's the whole reason I want a SCM system!".

         How many times have you felt like you wanted to have your own little
         private RCS tree to keep track of local changes which were not
         ready to be shown to others, but which are too complicated to keep
         untracked? How many times have you created numbered save files to
         track your work privately, because you were not ready to check 
         it into the public tree?

If we look at the cycle described above, we see that the software went
through a cycle of stability. In the beginning of the week you could
check out the tree and it would build correctly. By the middle of the
week, everyone's un-sequenced changes have crept into the tree and
made it unstable. By the end of the week, eveyone has updated to the
latest and greatest, and presumably they have fixed any problems their
new code had with other new code.

These observations are all well and good, but now I'd like to talk
about how I think this process _should_ work. First some simple facts:

1) It's much easier to keep working software working than it is to fix
   broken software. As a result, developers can't efficiently make
   changes to a moving (and potentially broken) source-base. They need
   a stable platform (i.e. a control group) where their changes
   (i.e. an experiment) are the only changes which are responsible for
   causing changes and problems.

2) The process of 'fixing' the head of the tree can not be removed, it
   must occur when multiple people are working on software simultaneously.

3) The 'dirty tree' scenerio described above loses information. The state
   of local files which a developer had at any point during the week is lost.

I propose that instead of using this sloppy 'dirty tree' method, we
should be sequencing all changes correctly. When a developer checks
in, he should be checking into the context of his local files. The
state of his tree as of a 'checkpoint' should be reproducable at a
later date. When he's ready to bring his work into the live tree, he
can deal with step #2 above where he must bring in all the new 'latest
and greatest' changes. Only by making his code work with the 'latest
and greatest' can he make his code BE the 'latest and greatest'.