
As I've started working on this small CPU project, one of the first decisions I've been considering has been which version control system to use. I've been a user of subversion for most of my personal projects for several years now and am currently using it at a client. As a result I'm quite familiar with the ins and outs of using it on a variety of sizes of projects. I've become more aware of distributed systems, such as Git and Mercurial over the last year, but haven't really been able to get my head around the advantages of them. In particular, the quote below from Linus Torvalds has been in the back of my mind.
"The slogan of Subversion for a while was 'CVS done right', or something like that, and if you start with that kind of slogan, there's nowhere you can go. There is no way to do CVS right."
- Linus Torvalds
One of the main source control issues I've seen on several of the projects I've worked on has been the aversion to branches that most users have. Typically there is a big central source repository that everyone will check out from. You then develop in your own little world. When the particular piece of work is complete, you check it back in. Usually, there is a fairly high barrier or cost to those commits, with sets of test suites that you must pass before you can commit your code back to the central repository. The checks take hours to run and you can not check back in until your code passes all the tests. Otherwise everyone else is at risk. But I always found that if I was working on something non-trivial, I'd really like to make some progress and check point that half way, committing it in to just a local branch, then working on further. That would give me the confidence to make larger changes, safe in the knowledge I can revert back to a midway working point. That's what a branch would be for, after all, but not if they are hard to make and not if the commit cost is so high. So we never did that, working for days or weeks before committing any changes.
The second common frustration I've seen with a centralized repository occurs when two people are working closely together on a piece of the system. This happens to characterise almost every verification endeavor, for example. By common definition, the verification and design work should be done by two different people, just to get extra eyes on the spec. This avoids duplicating erroneous assumptions about the design and is fundamental to the whole process. As a consequence, we are almost always faced with the situation where changes need to be made by two or more people, in distinct parts of the code (e.g., testbench and rtl) but cannot be checked in because of mutual dependencies. The changes depend on each other and all the commit checks will fail for either change on its own. Various ways around this exist, disabling affected checks in the commit scripts, copying files into each others workspaces and other hacks. All because fundamentally the centralised server approach, with costly branches and high commit costs, doesn't really let this sort of work proceed in an effective way.
The third frustration is the general speed of the repository. Time to check things out, time to do merges, how long it takes to do a diff or an update. These operations can usually mean a break for coffee or a walk around while the tool fetches the changes, compares them and attempts to merge it all together. Compound that by working in remote sites or across multiple geographic locations.
Git claims to solve these problems and be a whole lot faster at the same time.
The key is in breaking away from a centralised server. The database is distributed to every developer. As a result, everyone works on their own branch by default. Making further branches is trivial, because they don't get sent to every other developer. Fewer issues with namespace collisions when naming a branch, no real concern about checking code in and someone else getting your partially finished work. Earlier today I'd listened to Joel Spolsky and Jeff Atwood talking about the fact that Git makes branching trivial, but I didn't really understand why until I watched a really interesting presentation from Linus Torvalds on the subject. It is supposed to be a talk about Git, but really he focuses almost exclusively on the advantages of a distributed repository. I'd initially thought the real advantage was the 'always available' nature of a distributed repository, so that you could work on a plane or generally away from a network and still be able to check in, look at histories and all the things you normally need the central server access for. That's certainly part of the reason why it is interesting, but the branching and merging cost reduction that Git claims to offer is a much bigger deal.
For my second source of frustration above, Git also provides a solution. As there is no central repository, everyone can pull and push data to each other. The verification engineer and designer can exchange files more easily, through a tracked, version controlled system, rather than the usual sideband exchanges or hacks to the check-in scripts. Git also addresses that third issue, because all of the files are local and it has been designed for performance. Network overhead isn't an issue for a diff or history request as you have all the data locally. Merges are similarly less painful. The claimed performance is impressive and part of the reason why I want to try Git out.
Now, the most glaring problem with all this is that it sounds like anarchy. There is no central organisation, check-ins can happen any time, so where did all the quality assurance go? Linus talks about the network of trust relationships in his presentation. But, you can still have acceptance tests on when you actually pull data from a particular user or set of users. You can require them to run a battery of tests before they are allowed to share their work with the rest of the project. The usual checks and balances can be put back in place for when the whole database gets reassembled, but the individual developers or groups of designers can work more efficiently in a sub-repository. Git also supports hierarchical projects that combine various blocks of code, in fact that seems to be the preferred use model. Each sub-system on a design would be a unique Git repository. It could be even broken down further and have each IP block in their own repository. The general approach that has been used in the past, with quality checks, can still be used with some changes, as a gate to when larger mergers take place. This probably requires some trusted people in the organisation to act as gatekeepers or guardians for each level, but the basic methodology shouldn't be too difficult to layer on top.
You can read a lot more about Git on the homepage, including conversion documents from other common source control systems and details on the actual commands to use. Looking through the SVN conversion document, the git command syntax appears a bit cleaner and generally more intuitive to me. I also played around with the merge and diff tools and they seem powerful. It was very easy to create and populate a repository, for example. I plan on using it for the next few projects I work on to get a feel for how really useful it is and where the issues are hidden.
Edit to add: I found this draft version of the differences between Git and Subversion quite useful.
Another distributed VC system we were testing out was Bazaar (http://bazaar-vcs.org). It's written in Python, has a very similar similar command set to Subversion, except that it understands branch history and easily merges branches (push and pull). I think we decided against it in favor of Subversion only due to it not being ready for prime time 2 years ago, and not having support for Trac (http://trac.edgewall.org).
One big downside with the distributed approach is that it's hard to see which changes are in each branch (everyone is in their own sandbox basically), and since there's no central repository, you can't just get an entire listing of each branch that exists and what's changed in that branch. On the other hand, some people like this invisibility...
Posted by: Payton Quackenbush | June 27, 2008 at 07:56 PM
Hi Payton,
Thanks for taking the time to comment. I'd looked at Bazaar a few months ago, too. It doesn't appear to have quite the name recognition at the moment that Git and Hg are attracting, not sure why. Git gets the attention no doubt because of the Linux kernel using it, as well as Wine and X. There seems to be a lot of cross pollination of ideas between the 3 tools.
I'd be curious to see what the performance of the three are, given how much of a big deal Torvalds makes about the need for Git to be very fast. It looks like both Mercurial and Bazaar are written in Python, compared to Git being coded in C. It is a bit of an open question in my mind if the performance really makes a difference for chip sized projects (maybe different to the linux kernel's issues and requirements in terms of the scale of the problems)
Your comment about the ability to hide changes is true. In general, the whole issue of how you ensure everything gets rolled up correctly to the final chip and how the tape-out repository gets assembled together seems like an interesting problem for a distributed system, that I haven't really got quite straight in my mind. The processes that need to be layered on top of it would be key, rather than just which tool you pick. I've been aware of similar issues with centralised repositories, where the wrong revision id's listed in a configuration file, meant that the wrong chip was taped out. Probably similar checks and controls need to be in place for a distributed system as are needed for a centralised approach, to avoid this sort of fundamental error. The SCM tool is just that, a tool after all.
There does seem to be the potential for entire sections of code or revisions to be 'lost' in someone's local directory, never to see it to the final production version, but that could happen with the centralised server too.
Posted by: GordonMcGregor | June 27, 2008 at 11:56 PM
You are correct, Bazaar performance two years ago wasn't great. Apparently they've spent a bunch of their cycles improving that, but I'm sure it's nowhere close to git, which was designed from the ground up with performance in mind (and I like how persistent Linus is on never getting corrupted data into the system). I found bzr had an easy to learn usage model though.
Posted by: Payton Quackenbush | June 29, 2008 at 10:01 AM
What is the advantage that you are looking for in a distributed versus centralized system? Is it just better fine-grain branching capabilities? I've used one centralized system (Perforce, non-free) that has more robust branching capabilities than svn. So far {git, hg, bzr} seem like feature overkill, at least for traditional commercial projects that can maintain a central server.
Posted by: Josh Scheid | October 20, 2008 at 07:00 PM
Hello Josh, thanks for taking the time to comment. Easier branching is certainly a welcome feature, in any source control tool. My general experience is that most users are scared of branching and avoid it - mainly because of the complexities in the process for most tools.
The other advantage that is significant that git claims is the performance. Similar to a recent post on compiler performance, a slow version control tool can reduce the amount of work you get done. If it takes 20 minutes to do an update and compare, you switch off to do something else, rather than keeping engaged. Git at least claims to address this, even for large projects. I haven't had a chance to evaluate that, but I know that slow source control eats away at the time you have to work each day.
Posted by: GordonMcGregor | October 22, 2008 at 07:59 AM
Hi Gordon,
thanks for your angles on git. I found the hardest part to be the re-thinking of concepts while re-using the old terms. It doesn't help that git branches carry their name more deservedly than svn branches. That's nice for git but the term is teinted by the technicalities one has come to expect and with branches, tags, revisions, commits, heads etc mostly being the same but then again in some important aspects not at all ... that's why it's so hard to get one's head around it, I guess. Anyway, I found your article one of the more helpful ones, to do just that.
Posted by: P. Row | November 10, 2009 at 05:34 AM