My SCM Philosophy (take 1)
These are some thoughts about SCM/Source Control that I have finally
started to write down coherently.
From jeske@... Mon Mar 1 20:17:05 1999
Date: Mon, 1 Mar 1999 20:17:02 -0800
From: David Jeske
To: shr@...
Subject: [SCM writing..]
Hey Scott,
I started to congeal some of my thoughts about source control into a
more concrete form. I'd really appreciate your feedback on
this... mostly the first section. Do you understand it? Do you agree?
Is my position clear?
============================
This paper outlines some facts about source control usage, and then
rolls thes ideas together with a new and simpler perspective on
modeling revision relationships called the 'leader-follower
model'. This model, and the requirements it aims to satisfy, are goals
I have created for my personal use of source control systems after
having been cornered in frustrating situations because of incomplete
source control tools and practices.
** Fact Review
1) A set of source revisions plot a linear path from the first
revision to the last revision.
2) A branch is a location in this linear path where there are two
revisions derived from the same earlier revision. In a sense, there
are two direction choices in the path.
2) Merging (otherwise known as 'the dreaded merge') occurs as a result
of two changes being made to the same file at the same time. The only
way to avoid merging is to prevent the possibility of two changes
being made to the same file at the same time. (i.e. sometimes called
strict locking)
3) Losing data is bad.
** Where are all the SCM systems?
There is much confusion surrounding Source Control, Source
Configuration Management, and Software Configuration Management. To
reduce confusion I'd like to discuss my private idea of what SCM
really means.
Software Configuration Management is the idea of being able to store
and later reproduce the state of a software system. Think of the best
automated build system you have ever come in contact with. It probably
had the ability to tag or label all the software which was built for
that release. Hopefully the process of reproducing that build later
was fairly automated. However, how much of the underlying system did
it pay attention to? Did it record the version of the compiler you
were using when the build occured? Did it record the versions of all
the dependent libraries used during the build? How about the version
of the operating system being used, or versions of minor (but still
significant) tools like shells and scripts used?
In a true Software Configuration Management system, I argue that I
should be able to (1) buy a brand new machine from a local retailed,
(2) go through some simple setup process to install my SCM agent on
it, and (3) have the SCM system install all necessary software to
recreate any build or state which occured during our development
process. This includes installing operating systems, tools, scripts,
and all dependent files.
Clearly few tools have even _attempted_ to achieve this goal, and I
don't know of any integrated tool which has. How is it then that we
continue to use the term Software Configuration Management?
Apparently, it's considered SCM to be capable of recreating the
consistant state of a specific piece of source code. The dependencies
involved in turning that source code back into the shippable release
binary are left 'as an exercize to the reader'. However, lets look at
two features which required in order to achieve this goal:
1) Trackability of _all_ changes (and as a result, undo-redo-ability
of all changes)
2) Sequenceability of all changes in the context they were applied in.
These are two simple points, yet most existing SCM systems violate
them both.
* Trackability
Trackability is the easier of these two features to understand and
provide. Trackability is simply the ability to record all changes
which are made. In the real world, this really means recording all
changes which users want to record. There are always changes which go
unrecorded because the user decided not to save the file, or perhaps
the system decided not to save for him by crashing. The best we can do
is assure that every point at which the user requests tracking is
recorded.
Trackability is necessary because without trackability it's not
possible to make sure you can get back to a state you had at one
point. The most popular feature which breaks trackability is
'non-versioned movable labels'. Many people will argue that movable
labels are fine, because they allow you to try a few times to get the
label where you want it. However, the truth is that you can try a few
times to get your software the way you need it whether or not labels
are versioned. However, if they are not versioned, not only can you
not guarantee you can get back to a particular 'temporary point' in
the development process, but someone can later move a label (even
accidentally) and remove your ability to get somewhere which was
critical for your companies survival.
* Sequencability
Creating sequencability of all changes is something done so
infrequently that I don't belive most people in the SCM industry even
know what it is. Think of your favorite source control system. Imagine
the process of checking in a change you've made. Does that change get
applied in the context of the local files you were using it with, or
does it get applied in the context of the 'latest and greatest' in the
source control tree? To better understand this question, consider this
scenerio:
(1) You check out the latest project files from your team's repository.
(2) You work for a week making changes to your local files.
(3) Because you want trackability, you check in your changes periodically
as the week moves along. However, you have not been pulling down
the 'latest and greatest' every day because frankly, you don't
want someone elses broken stuff to interfere with your work.
(4) You reach the end of the week, at which time you update all your
files to the 'latest and greatest'.
Now, lets return to my original question. Did those changes you
checked in during the week get applied in the context of the files
they were being built against (your local files) or the did they get
applied in the context of the 'latest and greatest'? For nearly every
system in active use today (RCS, CVS, Perforce, PVCS, StarTeam, ...and
the list goes on), the answer is 'they get applied in the context of
the latest and greatest'. Are you beginning to understand the problem?
What happens when you want to get back to something you were doing in
the middle of the week? If there is no record of what local files
versions you were working with while these changes were committed, it
is impossible for the system to get you back there.
ASIDE: The most common argument I hear against this idea is 'why do
you need to get back there?' It should be pretty apparent that
an argument like that isn't an argument at all, but rather some
kind of derision tactic to help you think it's not important.
However, there are many reasons you may want to get back there.
The most common one, and one which I'll bet most people reading
this have come across at least once, if not often, is when the
software breaks after you perform that "update" in step #4 at the
end of the week. You sit there hitting your head against the wall
that you didn't save your local files. What you need at this point
is a list of all the "other" files (i.e. dependent files) which
changed between 'a minute ago' and after the update. However, if
the source control system does not remember your local file versions,
and sequence your changes in according to your local file versions,
then there is simply no way to return to that point after the fact.
However, there is another evil which was lurking all week that nobody
warned you about either. If anyone else on your project checked out
the project files in the middle of the week, he gets a jumble of
changes which are known to work in the context of each developers
local build, but which have never been tested or built against the
latest head revisions, because nobody else has pulled the latest head
revisions yet.
ASIDE: Another evil workaround exists in existing systems to
'solve' this problem. Have you ever heard of the source control
policy "don't check in changes until they work", or "don't
break the tree". Every time I hear one of these statements, my
brain translates it into this: "Our SCM does not provide
sequencability. As a result, if you check in changes which don't
work, you will affect other people's ability to work. We have
decided that it's more important to keep the group working than
to allow trackability of individual developer changes." To which
the other part of my brain responds, "I need trackability of
my changes, that's the whole reason I want a SCM system!".
How many times have you felt like you wanted to have your own little
private RCS tree to keep track of local changes which were not
ready to be shown to others, but which are too complicated to keep
untracked? How many times have you created numbered save files to
track your work privately, because you were not ready to check
it into the public tree?
======================================================================
WARNING:
It gets lengthy and 'less well thought out' below this line.. enter
at your own risk.
======================================================================
** The Leader/Follower Model
This model is based on the idea that developers know where they
headed, and they know who they are. However, they shouldn't have to
know whether to branch early or branch late, or whether to branch at
all.
In this model, every project is thought of as a 'Leader' headed in
some direction. This might be instead thought of as a trunk, or a
'Line of Development'. You've seen the pictures before, some file,
several revisions neatly checked into a single line. However, consider
what is happening when a user is working on the private files in his
local work directory. Certainly he's not spending his time following
the project while he's doing his own local changes. He is moving in
his own direction. In our terms, 'he is his own leader and
follower'. At some point, he will decide that he's happy with the
result of his work, and he wants to get back in sync with the rest of
people on his project. In a sense, he's going to start following the
project for a while, so he can get his changes incorporated, and so he
can get the new things they've been doing while he's been away.
- working on private files
Traditional source control systems ignore this private part of the
development process and make no mechanism for recording changes made
during this imporant time. Furthermore, traditional systems which
allow concurrent changes by multiple users force users either to
perform a lossy-merge of their changes, or to go through a troublesome
branch process first. By lossy-merge, I mean that the state of the
files as they existed in the local workspace is lost, because it is
only recorded as a product of a merge with the current project head
revisions.
The Leader/Follower Model provides a powerful paradigm for recoding
this private work.
- minimize branch work
We all want to minimize the work of creating branches and the work of
merging branches. The best way to minimize the work of merging
branches is to create them as seldom as possible. Some simple
observation leads us to the conclusion that the latest possible time a
branch can occur is at the exact point where two different versions
'collide'. (By collide we mean that two different versions are
provided which are based on the same head revision).
The best way to minimize the work of creating branches is to never
have to explicitly create them at all. The leader follower model
promotes the idea that branches are created automatically when needed,
and only when needed.
- Keep all your data
Every merge should occur on two revisions which actually exist in the
source control system.
------------------
As yet unused verbage:
Every merge should occur on two revisions which actually exist in the
source control system.
Most people know how to merge during an update, but few people know
how to merge between branches.
Branching is normally an explicit mechanism by which a developer
decides he wants to split one development path into two. Much debate
is centered around when to branch (i.e. branch early vs. branch
late). Branching is often too complicated. Seldom do individual
developers learn enough to do proper branching when they should.
-------------------
Personal Experiences
[currently this is just ripped from an email I sent to bitkeeper-users
some time ago, but I want to work-it-over and include it.]
These are the things which I believe are considerably helped by the
staging area/work area/multi-level repository model of bitkeeper:
1) Trackability
2) There is a 'first class' model for working offline
3) It formalizes operations performed by non-syncronized groups
1) Trackability
One of the goals for any SCM system (for me) is complete
trackability. The words "Software Configuration Management" congure up
a picture in my mind of wanting to recreate a build which happened a
year ago, and having the SCM system install whatever is necessary on
my machine, OS, applications, whatever... to recreate the environment
of the build. There may be some combinations of systems which can make
this possible, however, currently the troubles of doing something
which has such complete control are many. So most SCM systems settle
out at something less then the scenerio I describe above. The simplest
of them is really more of a source control system than an SCM, but it
is able to recreate the state of a source tree consistantly, and that
makes it different than something like RCS which basically just
records per-file deltas.
At any rate, it strikes me as ironic for an SCM or Source Control
system to have a mechanism where policy like "don't checkin until it
works" is commonly used. The whole point of source control (for me) is
complete trackability.
I can express the need for trackability with this simple scenerio:
Forget about checkins and checkouts for a moment. Trackability is
about being able to say "damn, what I had a minute ago worked, and I
just broke it.... SCM take me back to that working version I had a
minute ago".
This dosn't work very well if you have to wait until you get something
working in order to check it in.
In single repository systems, in order to provide a work area, users
must go off and create branches, or worse, the policy of the
environment as a whole must be for 'label promotion' style. For
example, I tried this with perforce, where I created a branch view
which was "mine" and I did all work against the branch view. If I
wanted to work on something, I would branch it into my personal
branch.. I could then submit against it as often as I liked, and when
I was ready, I would push those changes back to where they came
from. However, because of how Perforce (and most systems) deal with
branching, I basically created a whole lot more work for
myself. Perforce does not find the original source of the change to
retrieve a description, so you have to re-include the desciption of
the change each time, and it basically turns out to be a whole lot
more work just to get your changes into the tree. This is not to say
that it didn't work, it did... and it achieved my goal. However, it
created a whole lot more work for me in the process, because it just
wasn't designed to be used this way.
2) There is a 'first class' model for working offline
Let me recount a recent scenerio I had with working offline. At work,
we currently run SourceSafe (we're switching soon). However, here it's
only relevant is that it's a 'single repository'.
I went offsite to make some changes, which isn't something I do very
often. It was a special case brought on by a special customer need. So
I checked out the source code onto a laptop, made sure everything was
buildable and took off the for the customer site. While there I
changed the software and left them with a custom build, satisfying
their need.
When I returned, I had no clean record of exactly what I changed,
because obviously source safe wouldn't work while I was offline. I
could have created an add-hoc policy of recording exactly what files I
changed, but I didn't. So I went through the tree, mostly remembering
what changes I made, and applying them to my 'real' tree on my
stationary development machine. Fortunarly, there were pretty few
changes. I couldn't just do an 'after the fact' checkout and checkin,
because my coworkers were working on the files during my
trip. SourceSafe just didn't have any way to understand what I had
been doing while I was offline.
Not only did this whole process waste my time and energy, but I ended
up missing one line of one file. It was a week later before I found
the one line, and I found it because I had sent off an 'official
build' of the changes I had made while i was offsite to that
customer. (Just trust me when I say that this particularly deliverable
is hard to QA internally, and that it's not for lack of resources..).
At any rate, if there were an offline model for submitting changes,
this never would have happened.
3) It formalizes operations performed by non-synchronized groups.
This is most relevant for the free source model, and is likely much of
the intent of BitKeeper's use with Linux. Although it's certainly not
limited to this case, I'll discuss it in terms of free source.
Free source projects, by their nature, have many developers who are
not always in constant contact. Often times, a developer will start
working on his own addition, and not complete it until several
'releases' later. It takes people time to learn how to correctly
create patches, particularly in situations where the version they
changed is an older version. As a result, with my own project, I end
up having new developers send me their whole tree, and I have to
perform the diffing process myself. This is extra work that I don't
want to do. So, I consider the currently available (in a world without
bitkeeper) alternative... I consider setting up a public repository
(CVS or Perforce).
However, that solution is fraught with it's own set of problems. I
don't want to let all developers just submit whatever changes they
like, frankly because I don't trust that they will all follow the
goals I have in mind. (If they want to establish their own goals, they
should split off and establish their own coordinated
distribution). So, I can compromise and let trusted developers have
access to the repository (which is what many projects do), but then
we're back to the same unformal issues above when it comes time for
developers to submit patches.
Don't get me wrong, it's not that carefully made patches don't work
well. It's just that the developers basically have to invent their own
systems for making sure they get trackability with some local source
control system, and that they can generate appropriate patches.
It gets much worse when someone _wants_ to stay out of sync, because
they have their own private changes that arn't going to be included in
the main distribution (whether that's for a short period of time, or
forever).
--
David Jeske (N9LCA) + http://www.chat.net/~jeske/ + jeske@...