Peer Code Reviews in a Mercurial World

Mercurial, or Hg, is a brilliant DVCS (Distributed Version Control System). Personally I think it’s much better than Git, but that’s a whole religious war in itself. If you’re not familiar with at least one of these systems, do yourself a favour and take a read of hginit.com.

Massive disclaimer: This worked for our team. Your team is different. Be intelligent. Use what works for you.

The Need for Peer Code Reviews

I’ve previously worked in a number of environments which required peer code reviews before check-ins. For those not familiar, the principle is simple – get somebody else on the team to come and validate your work before you hit the check-in button. Now, before anybody jumps up and says this is too controlling, let me highlight that this was back in the unenlightened days of centralised VCSs like Subversion and TFS.

This technique is just another tool in the toolbox for finding problems early. The earlier you find them, the quicker and cheaper they are to fix.

  • If you’ve completely misinterpreted the task, the reviewer is likely to pick up on this. If they’ve completely misinterpreted the task, it spurs a discussion between the two of you that’s likely to help them down the track.
  • Smaller issues like typos can be found and fixed immediately, rather than being relegated to P4 bug status and left festering for months on end.
  • Even if there aren’t any issues, it’s still useful as a way of sharing knowledge around the team about how various features have been implemented.

On my current project we’d started to encounter all three of these issues – we were reworking code to fit the originally intended task, letting small issues creep into our codebase and using techniques that not everybody understood. We identified these in our sprint retrospective and identified the introduction of peer code reviews as one of the techniques we’d use to counter them.

Peer Code Reviews in a DVCS World

One of the most frequently touted benefits for DVCS is that you can check-in anywhere, anytime, irrespective of network access. Whilst you definitely can, and this is pretty cool, it’s less applicable for collocated teams.

Instead, the biggest benefit I perceive is how frictionless commits enables smaller but more frequent commits. Smaller commits provide a clearer history trail, easier merging, easier reviews, and countless other benefits. That’s a story for a whole other post though. If you don’t already agree, just believe me that smaller commits are a good idea.

Introducing a requirement for peer review before each check-in would counteract these benefits by introducing friction back into the check-in process. This was definitely not an idea we were going to entertain.

The solution? We now perform peer reviews prior to pushing. Developers still experience frictionless commits, and can pull and merge as often as possible (also a good thing), yet we’ve been able to bring in the benefits of peer reviews. This approach has been working well for us for 3 weeks so far (1.5 sprints).

It’s a DVCS. Why Not Forks?

We’ve modelled our source control as a hub and spoke pattern. BitBucket has been nominated as our ‘central’ repository that is the source of truth. Generally, we all push and pull from this one central repository. Because our team is collocated, it’s easy enough to just grab the person next to you to perform the review before you push to the central repository.

Forks do have their place though. One team member worked from home this week to avoid infecting us all. He quickly spun up a private fork on BitBucket and started pushing to there instead. At regular intervals he’d ask one of us in the office for a review via Skype. Even just using the BitBucket website, it was trivial to review his pending changesets.

The forking approach could also be applied in the office. On the surface it looks like a nice idea because it means you’re not blocked waiting on a review. In practice though, it just becomes another queue of work which the other developer is unlikely to get to in as timely a manner. “Sure, I’ll take a look just after I finish this.” Two hours later, the code still hasn’t hit the central repository. The original developer has moved on to other tasks. By the time a CI build picks up any issues, ownership and focus has long since moved on. An out-of-band review also misses the ‘let’s sit and have a chat’ mentality and knowledge sharing we were looking for.

What We Tried

To kick things off, we started with hg out. The outgoing command lists all of the changesets that would be pushed if you ran hg push right now. By default it only lists the header detail of each changeset, so we’d then run though hg exp 1234, hg exp 1235, hg exp 1236, etc to review each one. The downsides to this approach were that we didn’t get colored diff outputs, we had to review them one at a time and it didn’t exclude things like merge changesets.

Next we tried hg out -p. This lists all of the outgoing changesets, in order, with their patches and full colouring. This is good progress, but we still wanted to filter out merges.

One of the cooler things about Mercurial is revsets. If you’re not familiar with them, it’d pay to take a look at hg help revsets. This allows us to use the hg log command, but pass in a query that describes which changesets we want to see listed: hg log -pr "outgoing() and not merge()".

Finally, we added a cls to the start of the command so that it was easy to just scroll back and see exactly what was in the review. This took the full command to cls && hg log -pr "outgoing() and not merge()". It’d be nice to be able to do hg log -pr "outgoing() and not merge()" | more but the more command drops the ANSI escape codes used for coloring.

What We Do Now

To save everybody from having to remember and type this command, we added a file called review.cmd to the root of our repository. It just contains this one command.

Whenever we want a review we just type review and press enter. Too easy!

One Final Tweak

When dealing with multiple repositories, you need to specify which path outgoing() applies to in the revset. We updated the contents of review.cmd to cls && hg log -pr "outgoing(%1) and not merge()". If review.cmd is called with an argument, %1 carries it through to the revset. That way we can run review or review myfork as required.

Released: SnowMaker – a unique id generator for Azure (or any other cloud hosting environment)

What it solves

Imagine you’re building an e-commerce site on Azure.

You need to generate order numbers, and they absolutely must be unique.

A few options come to mind initially:

  • Let SQL Azure generate the numbers for you. The downside to this approach is that you’re now serializing all of your writes down to a single thread, and throwing away all of the possible benefits from something like a queuing architecture. (Sidenote: on my current project we’re using a NoSQL graph DB with eventual consistency between nodes, so this wouldn’t work for us anyway.)
  • Use a GUID. These are far from human friendly. Seriously, can you imagine seeing an order form with a GUID on the top?
  • Prefix numbers with some form of machine specific identifier. This now requires some way to uniquely identify each node, which isn’t very cloud-like.

As you can see, this gets complex quickly.

SnowMaker is here to help.

What it does

SnowMaker generates unique ids for you in a highly distributed and highly performant way.

  • Ids are guaranteed to be unique, even if your web/worker role crashes.
  • Ids are longs (and thus human readable).
  • It requires absolutely no node-specific configuration.
  • Most id generation doesn’t even require any off-box communication.

How to get it

Library: Install-Package SnowMaker

(if you’re not using NuGet already, start today)

Source code: hg.tath.am/snowmaker or github.com/tathamoddie/snowmaker

How to use it

var generator = new UniqueIdGenerator(cloudStorageAccount);
var orderNumber = generator.NextId("orderNumbers");

The only caveat not shown here is that you need to take responsibility for the lifecycle of the generator. You should only have one instance of the generator per app domain. This can easily be done via an IoC container or a basic singleton. (Multiple instances still won’t generate duplicates, you’ll just see wasted ids and reduced performance.) Don’t create a new instance every time you want an id.

Other interesting tidbits

The name is inspired by Twitter’s id generator, snowflake. (Theirs is more scalable because it is completely distributed, but in doing so it requires node-specific configuration.)

Typical id generation doesn’t even use any locks, let alone off-box communication. It will only lock and talk to blob storage when the id pool has been exhausted. You can control how often this happens by tweaking the batch size (a property on the generator). For example, if you are generating 200 order ids per server per second, set the batch size to 2000 and it’ll only lock every 10 seconds.

Node synchronisation is done via Azure blob storage. Other than that, it can run anywhere. You could quite easily use this library from AppHarbor or on premise hosting too, you’d just wear the cost of slightly higher latency when acquiring new ids batches.

The data persistence is swappable. Feel free to build your own against S3, Ninefold Storage, or any other blob storage API you can dream up.

The original architecture and code came from an excellent MSDN article by Josh Twist. We’ve brushed it off, packaged it up for NuGet and made it production ready.

Under the covers

SnowMaker allocates batches of ids to each running instance. Azure Blob Storage is used to coordinate these batches. It’s particularly good for this because it has optimistic concurrency checks supported via standard HTTP headers. At a persistence level, we just create a small text file for each id scope. (eg, the contents of /unique-ids/some-id-scope would just be “4”.)

One issue worth noting is that not all ids will always be used. Once a batch is checked out, none of the ids in it can ever be reallocated by SnowMaker. If a batch is checked out, only one id is used, then the process terminates, the remaining ids in that batch will be lost forever.

Here’s a sequence diagram for one client:

SequenceDiagram

Here’s a more complex sequence diagram that shows two clients interacting with the store, each using a different batch size:

Multiple clients

Homomorphic Encryption + Cloud

There’s an interesting article in this month’s MIT Technology Review about ‘homomorphic encryption’. It has been around in principle for some 30 years but is seemingly back in vogue thanks to cloud computing.

The simple run down:

  • You want to use a cloud service to perform some computation (add numbers together)
  • You don’t want to give the cloud compute provider your original data (numbers) though
  • You take your original data (1 and 2), encrypt it locally (33 and 54), then upload it
  • The cloud service performs the computation on the encrypted data (33 + 54 = 87)
  • You download the encrypted result (87) and decrypt it locally to find the answer (3)

Obviously the complexity sky rockets when you start talking about something like full text indexing, document parsing, etc … and may not even be possible without influencing the encryption process to the point that it becomes predictable … but it’s a fascinating idea none-the-less.

I can see this being useful with something like table storage. If someone like MSR could scale the algorithms sufficiently to handle clustered + non-clustered indexes – you could have Azure table storage with client side encryption and all the algorithms magically buried away by the fabric. How cool would that be?

The article: http://www.technologyreview.com/computing/37197/

Released: FormsAuthenticationExtensions

What it does

Think about a common user table. You probably have a GUID for each user, but you want to show their full name and maybe their email address in the header of each page. This commonly ends up being an extra DB hit (albeit hopefully cached).

There is a better way though! A little known gem of the forms authentication infrastructure in .NET is that it lets you embed your own arbitrary data in the ticket. Unfortunately, setting this is quite hard – upwards of 15 lines of rather undiscoverable code.

Sounds like a perfect opportunity for another NuGet package.

How to get it

Library: Install-Package FormsAuthenticationExtensions

(if you’re not using NuGet already, start today)

Source code: formsauthext.codeplex.com

How to use it

Using this library, all you need to do is add:

 using FormsAuthenticationExtensions; 

then change:

 FormsAuthentication.SetAuthCookie(user.UserId, true); 

to:

 var ticketData = new NameValueCollection {
    { "name", user.FullName },
    { "emailAddress", user.EmailAddress }
 };
new FormsAuthentication().SetAuthCookie(user.UserId, true, ticketData);

Those values will now be encoded and persisted into the authentication ticket itself. No need to store it in any form of session state, custom cookies or extra DB calls.

To read the data out at a later time:

 var ticketData = ((FormsIdentity) User.Identity).Ticket.GetStructuredUserData();
var name = ticketData["name"];
var emailAddress = ticketData["emailAddress"];

If you want something even simpler, you can also just pass a string in:

 new FormsAuthentication().SetAuthCookie(user.UserId, true, "arbitrary string here"); 

and read it back via:

 var userData = ((FormsIdentity) User.Identity).Ticket.UserData; 

Things to Consider

Any information you store this way will live for as long as the ticket.

That can be quite a while if users are active on your application for long periods of time, or if you give out long-term persistent sessions.

Whenever one of the values stored in the ticket needs to change, all you need to do is call SetAuthCookie again with the new data and the cookie will be updated accordingly. In our user name / email address example, this is actually quite advantageous. If the user was to update their display name or email address, we’d just update the ticket with new values. This updated ticket would then be supplied for future requests. In web farm environments this is about as perfect as it gets – we don’t need to go back to the DB to load this information for each request, yet we don’t need to worry about invalidating the cache across machines. (Any form of shared, invalidatable cache in a web farm is generally bad.)

Size always matters.

The information you store this way is embedded in the forms ticket, which is then encrypted and sent back to the users browser. On every single request after this, that entire cookie gets sent back up the wire and decrypted. Storing any significant amount of data here is obviously going to be an issue. Keep it to absolutely no more than a few simple values.

Twavatar – coming to a NuGet server near you

Yet another little micro-library designed to do one thing, and do it well:

twavatar.codeplex.com

Install-Package twavatar

I’ve recently been working on a personal project that lets me bookmark physical places.

To avoid having to build any of the authentication infrastructure, I decided to build on top of Twitter’s identity ecosystem. Any user on my system has a one-to-one mapping back to a Twitter account. Twitter get to deal with all the infrastructure around sign ups, forgotten passwords and so forth. I get to focus on features.

The other benefit I get is being able to easily grab an avatar image and display it on the ‘mark’ page like this:

image

(Sidenote: You might also notice why I recently built relativetime and crockford-base32.)

Well, it turns out that grabbing somebody’s Twitter avatar isn’t actually as easy as one might hope. The images are stored on Amazon S3 under a URL structure that requires you to know the user’s Twitter Id (the numeric one) and the original file name of the image they uploaded. To throw another spanner in the works, if the user uploads a new profile image, the URL changes and the old one stops working.

For most Twitter clients this isn’t an issue because the image URL is returned as part of the JSON blob for each status. In our case, it’s a bit annoying though.

Joe Stump set out to solve this problem by launching tweetimag.es. This service lets you use a nice URL like http://img.tweetimag.es/i/tathamoddie_n and let them worry about all the plumbing to make it work. Thanks Joe!

There’s a risk though … This is a free service, with no guarantees about its longevity. As such, I didn’t want to hardcode too many dependencies on it into my website.

This is where we introduce Twavatar. Here’s what my MVC view looks like:

 @Html.TwitterAvatar(Model.OwnerHandle) 

Ain’t that pretty?

We can also ask for a specific size:

 @Html.TwitterAvatar(Model.OwnerHandle, Twavatar.Size.Bigger) 

The big advantage here is that if / when tweetimag.es disappears, I can just push an updated version of Twavatar to NuGet and everybody’s site can keep working. We’ve cleanly isolated the current implementation into its own library.

It’s scenarios like this where NuGet really shines.

Update 1: Paul Jenkins pointed out a reasonably sane API endpoint offered by Twitter in the form of http://api.twitter.com/1/users/profile_image/tathamoddie?size=bigger. There are two problems with this API. First up, it issues a 302 redirect to the image resource rather than returning the data itself. This adds an extra DNS resolution and HTTP round trip to the page load. Second, the documentation for it states that it “must not be used as the image source URL presented to users of your application” (complete with the bold). To meet this requirement you’d need to call it from your application server-side, implement your own caching and so forth.

The tweetimag.es service most likely uses this API under the covers, but they do a good job of abstracting all the mess away from us. If the tweetimag.es service was ever to be discontinued, I imagine I’d update Twavatar to use this API directly.

Node.js on Windows

Thanks to Sharkie’s ongoing organisation efforts, SydJS is a thriving monthly JavaScript meeting here in Sydney. This evening they welcomed me along to talk about Node.js on Windows. Afraid of a mostly non-Microsoft crowd I rocked up with all the anti-Unix jokes I had but they turned out to be all quite friendly and it was a fun little talk.

Here’s what I ran through…

Update 18th July 2011: The latest official builds of node.js now come with a Windows executable. This is thanks to support from Microsoft.

Cygwin

Cygwin gives you a full POSIX environment on Windows. It’s great for running apps designed for Unix, but it’s pretty heavy and not very … Windows-ey. It’d be like creating a “My Documents” folder on Ubuntu.

All that being said, it’s the simplest and most reliable way of getting node running on Windows.

Works for 0.2.6 -> 0.3.1 and 0.4.0+. Anything between 0.3.1 and 0.4.0 won’t compile.

The steps (and common pitfalls) are well documented at https://github.com/joyent/node/wiki/Building-node.js-on-Cygwin-(Windows)

Once you’ve got it running in Cygwin, if you jump out to a standard Windows command prompt and run c:\Cygwin\usr\local\bin\node.exe you’ll get a nice big error. More on this later.

MinGW

The next step up from Cygwin is to compile it under MinGW. MinGW (Minimal GNU for Windows) provides the bare minimum set of libraries required to make it possible to compile Unix-y apps on Windows, avoiding the full POSIX strangehold environment that Cygwin provides.

Works for 0.3.6+.

Again, there are well documented steps for this: https://github.com/joyent/node/wiki/Building-node.js-on-mingw

Once you’ve got it running in MinGW, jump out to a standard Windows command prompt again and run c:\wherever-you-put-your-git-clone\node.exe you’ll get a nice big error.

Standalone

Now that we’ve compiled it with MinGW (you did that in the last step, right?) we’re ready to run it on Windows natively.

From a native Windows command prompt:

  1. Create a new folder (mkdir node-standalone)
  2. Copy in the node.exe you compiled in MinGW (xcopy c:\wherever-you-put-your-git-clone\node.exe node-standalone)
  3. Copy in the MinGW helper libraries (xcopy c:\mingw\bin\lib*.dll node-standalone)
  4. Run node-standalone\node
  5. Voila! It works!

Running as a Service

Next up, I wanted to host node as a service, just like IIS. This way it’d start up with my machine, run in the background, restart automatically if it crashes and so forth.

This is where nssm, the non-sucking service manager, enters the picture. This tool lets you host a normal .exe as a Windows service.

Here are the commands I used to setup an instance of the SydJS website as a service:

nssm.exe install sydjs-node c:\where-i-put-node-standalone\node.exe c:\code\SydJS\server.js
net start sydjs-node

What We Achieved

We now have node.js, running natively on Windows, as a service. Enjoy!

(Just please don’t use this in production – it’s really not ready for that yet.)