New Talks: “Neo4j in a .NET World (Graph DBs)” and “You’re in production. Now what?”

Last week I was lucky enough to join an array of great speakers at the NDC Oslo conference.

The recordings of both of my talks are now online, along with 141 other excellent talks you should watch.

Neo4j in a .NET World (Graph DBs)

This year, a small team of developers delivered a ASP.NET MVC app, with a neo4j backend, all running in Azure. This isn’t in POC; it’s a production system. Also, unlike most graph DB talks, it’s not a social network!

https://vimeo.com/43676873

You’re in production. Now what?

A tiny subset of your users can’t login: they get no error message yet have both cookies and JavaScript enabled. They’ve phoned up to report the problem and aren’t capable of getting a Fiddler trace. You’re serving a million hits a day. How do you trace their requests and determine the problem without drowning in logs?

Marketing have requested that the new site section your team has built goes live at the same time as a radio campaign kicks off. This needs to happen simultaneously across all 40 front-end web servers, and you don’t want to break your regular deployment cadence while the campaign gets perpetually delayed. How do you do it?

Users are experiencing 500 errors for a few underlying reasons, some with workarounds and some without. The customer service call centre need to be able to rapidly triage incoming calls and provide the appropriate workaround where possible, without displaying sensitive exception detail to end users or requiring synchronous logging. At the same time, your team needs to prioritize which bugs to fix first. What’s the right balance of logging, error numbers and correlations ids?

These are all real scenarios that Tatham Oddie and his fellow consultants have solved on large scale, public websites. The lessons though are applicable to websites of all sizes and audiences.

https://vimeo.com/43624434

Remembering Why We Undertake ICT Projects

I’ve recently been reading Standards Australia’s publication HB280-2006: "How Boards and Senior Management Have Governed ICT Projects to Succeed (or Fail)"1. Just yesterday, Pat Weaver blogged some related analysis which ultimately spurred this post.

Both sources draws similar conclusions about the need to identify the delivery aspect of a project as just one component of a larger game. Ultimately, both sources then attribute this responsibility, and thus commonality of project failure, to senior management.

I particularly like this quote from section 2.2.1 of the handbook:

The case studies provide quite strong evidence, that in general, ICT projects deliver benefits by enabling process change, and project management, user support and all the other traditional prescriptions are less important than senior management support. Only senior management can resolve the political issues that arise as a result of conflicts in objectives caused by change.

Within the software development community, we almost always view software projects as changes themselves rather than simply an enabler in a wider organisational change program. While we talk about needing product owners from the business to elicit requirements and resolve implementation questions, we don’t look to them to act as a change champion in anywhere nearly as structured a way.

Food for thought: Perhaps we need to move towards reducing the number of people tasked with gathering requirements (business analysts and subject matter experts) to make way for some people to be actively pushing change back on the business? Both Pat’s post and the handbook talk about this as the responsibility of senior management, however I think they can reasonably be assisted in a structured way, similar to how we employee business analysts rather than expecting the project champion to understand and document all of the requirements.

Certainly, the measure of overall project success needs to shift away from on-time/budget/scope delivery towards an assessment of organisational change and benefit realisation. This is a core principle to any form of Lean-based delivery, however is yet to make it’s way into the world of organisations still addicted to Waterfall-derived delivery models, or in most cases, even Scrum.

Finally, it is encouraging to note that I came across Pat’s post via a discussion thread in the Australian Institute of Company Directors LinkedIn group. This is a group very heavily comprised of senior managers, and consequently a great place to see these questions being raised.


1 Warning: The publication mechanism for this handbook is positively horrible. After handing over AU$114.27 for a legitimate license, you receive it as a rights-stripped PDF, that requires a third-party DRM plugin for Adobe Reader, which only lets you open it on one computer ever, only lets you launch the print dialog once ever, prevents you from highlighting even a single word, and prevents the accessibility functions from working (in breach of the Australian Disability and Discrimination Act). To top it all off, they still feel the need to print your full license details down the side of every single page. You’ll need to be downright persistent to even make it past page 1 as a legitimate user, which is sad considering it’s otherwise interesting content.

The Checklist Manifesto: How to Get Things Right

I’ve just finished reading Atul Gawande’s somewhat self-assuredly titled The Checklist Manifesto: How to Get Things Right. Trepidatious about reading an entire book dedicated to the unassuming concept of checklists, it had slipped in my reading queue. The result was however a pleasantly educational and entertaining surprise – he’s a good writer, with an extensive repertoire of experience. There’s even three hours of flight time left for me to knock this post out.

The foundation is simple: we’ve entered an era that’s encumbered by our ability to apply knowledge (ineptitude), as opposed to lacking it in the first place (ignorance).

Half a century ago, heart attack treatment was non-existent; patients would be given morphine for the pain, some oxygen, then sent home as, what Atul describes, “a cardiac cripple”. In contrast, responders are now faced with a wide gamut of therapies and the new challenge of implementing the right one in each scenario. When they fail, beyond the obvious downsides, blame is frequently attributed to the professional who ‘failed’ to apply a body of knowledge they have been given. As this becomes an unachievable task, we need to adopt better solutions.

Without detracting from the value of investing a few hours to read the book yourself, I wanted to tease out a few of the points I found interesting. If you find these even vaguely interesting, I really do suggest that you grab a copy.

An Emphasis on Process

Atul cites that the master builder approach to construction has been replaced with specialized roles to such a degree that we really need to call them super-specializations. It started with dividing the architects from the builders, then splitting off the engineers, and so and so forth. As a surgeon, he jokes that in the medical world he’s expecting to start seeing left-ear surgeons and right-ear surgeons, and has to keep checking that this isn’t already the case whenever somebody mentions the idea.

In the advent of this, certification processes have also evolved. Where a building inspector may have historically re-run critical calculations themselves, modern building projects involve too many distinct engineering disciplines, drawing on too many bodies of knowledge, for this to be practical. We could build a team of specialized inspectors, except this rapidly becomes unwieldy itself. Instead, building inspectors have taken to focusing on ensuring that due process has been followed. Has a particular assessment been completed by the relevant parties? Did it have the appropriate information going in? Did it produce a satisfactory outcome? Great, move on.

An almost identical construct exists in Australian employment law. It doesn’t matter if somebody is completely incompetent (or inept?); you still have to follow due process in order to disengage them. Employment courts, despite already being a form of specialization themselves, are not interested in or capable of assessing an employee’s performance. They are however capable of asserting that the correct steps were followed in issuing warnings, conducting performance management, and so forth.

Here was my first face-palm moment: I’d made the mistake of considering a checklist as a list, with checkboxes. There’s a whole set of gate, check and review processes which I’ve always mentally separated from the concept of checklists. Beyond the semantics, I found this to be a valuable light bulb moment when considering some of the other ideas.

Communication

Atul’s passion for checklists comes from leading the World Health Organisation’s Safe Surgery Saves Lives program. In trying to solve the general problem of ‘how do we make surgery safer?’, the program ended up rolling out a 19-point check list, with amazing results. It’s no small feat to cause behavioural change across literally thousands of hospitals around the world.

There were actually two behavioural changes required. First, they had to get people to actually adopt the checklists as a useful contributor to the surgical process. They had to be short, add demonstrable value, and so forth.

The second challenge was getting people to talk to each other. Some of the statistics he quotes about the number of people involved in the surgical environment are amazing. One Boston clinic employs “some six hundred doctors and a thousand other health professionals covering fifty-nine specialties.” The result of this is that operating teams have rarely worked together prior to any particular case. Having clear specialities makes it functional to have an unacquainted collection of professionals achieve an outcome, however it doesn’t facilitate an environment of team work when something goes awry. Instead, these autonomous professionals become focussed-in on achieving their individual goals.

To combat this, one of the checklist points is actually as simple as making sure everyone in the room knows everyone’s name and role before the surgery begins.

Fly the Airplane

Some Cessna emergency checklists have an obvious first step: fly the airplane. While we wait for evolution to catch-up, our brains are still wired for a burst of physical exertion to combat panic. Otherwise common mental processes go by the way side and we do something stupid.

I like the simplicity of this point, and see it being useful in a operations environment.

Pause Points

In early trials of their new safe surgery checklist, participants found it unclear about who was meant to be completing the list and when. A similar problem plagues most development ‘done criteria’ I’ve worked with. Yes, everything is meant to be checked off eventually, but when?

Airline checklists instead occur at distinct pause points. Before starting the engines. Before taxiing. Before takeoff. In each of these scenarios, there’s a clear pause to execute the checklist. The list is kept short (less than a minute) and relevant to that particular pause point.

The next time I work on defining a done criteria, I think I’ll try and split it into distinct lists. These points must be completed before you push the code. These points must be completed before the task is closed.

“Cleared for Takeoff”

Surgical environments have a clear pecking order that starts with the surgeon. Major challenges of the safe surgery campaign were getting everyone to apply the process as a team, and ensuring individual members of the team were empowered enough to call a halt if something was about to be done incorrectly. To achieve this, nurses had to be empowered to stop a surgeon.

In one hospital, a series of metal covers were designed for the scalpels. These were engraved with “Cleared for Takeoff”. The scalpel couldn’t be handed over for an incision until the cover was removed, and that didn’t happen until the checklist was completed. This changed the conversation to again be about the process (‘we haven’t completed the checklist yet’) instead of individual actions (‘you missed a step’).

I think points like this are small but important. And definitely interesting.

Now, go and read the book.

The book is an extension of a 2007 article by Atul, published in The New Yorker. I haven’t read the article, but some Amazon reviews suggest it covers the same concepts with less text. Most of the book is just stories, but I found them all interesting nonetheless.

Code: Request Correlation in ASP.NET

I’ve been involving in some tracing work today where we wanted to make sure each request had a correlation id.

Rather than inventing our own number, I wanted to use the request id that IIS already uses internally. This allows us to correlate across even more log files.

Here’s the totally unintuitive code that you need to use to retrieve this value:

var serviceProvider = (IServiceProvider)HttpContext.Current;
var workerRequest = (HttpWorkerRequest)provider.GetService(typeof(HttpWorkerRequest));
var traceId = workerRequest.RequestTraceIdentifier;

(A major motivator for this post was to save me having to trawl back to my sent emails from 2009 the next time I need this code.)

Update 28th Feb 2013: Some people have been seeing Guid.Empty when querying this property. The trace identifier is only available if IIS’s ETW mechanism is enabled. See http://www.iis.net/configreference/system.webserver/httptracing for details on how to enable IIS tracing. Thanks to Levi Broderick from the ASP.NET team for adding this.

A Business in a Day: giveusaminute.com

Lately, my business partner and I have wanted to try some shorter ‘build days’. The idea of these are to start with a blank canvas and an idea, then deliver a working product by the end of the day. This is a very different approach to the months of effort that we generally invest to launch something.

Today we undertook our first build day and delivered Give Us A Minute, an iPad-targeted web app for managing wait lists:

image

It was a fun experience trying to achieve everything required in one day, but I think we did pretty well. We managed everything from domain name registration to deployment in just under 9 hours. One of the biggest unplanned tasks was actually building the website to advertise the app; we hadn’t even thought of factoring that in when we started the day. The photography also took up a bit of time, but we needed to do it to tell the story properly on the site. Also, it was nice to be required to go and find a beer garden with a tax-deductible beer each so we could get that bottom-left photo.

As part of staying focussed on the idea of a minimum viable product we dropped the idea of accounts very early on. At some point we’ll have to start charging for the text messages, but that then implies logins, registration, forgotten passwords, account balances and a whole host of other infrastructure pieces. In the mean time we’ll just absorb the cost of message delivery. If it starts to become prohibitive, it’ll be a pretty high quality problem to have.

The next step is to get this out on the road and into some businesses. We’ll start by approaching some businesses directly so we can be part of the on-boarding experience. Based on how that goes, we’ll start scaling out our marketing efforts.

We also need to get ourselves listed in the Apple App Store for the sake of discoverability. The ‘app’ is already designed with PhoneGap in mind, but we’re waiting on our Apple Developer enrolment to come through before we can finalise all of this.

Released: ReliabilityPatterns – a circuit breaker implementation for .NET

How to get it

Library: Install-Package ReliabilityPatterns (if you’re not using NuGet already, start today)

Source code: hg.tath.am/reliability-patterns

What it solves

In our homes, we use circuit breakers to quickly isolate an electrical circuit when there’s a known fault.

Michael T. Nygard introduces this concept as a programming pattern in his book Release It!: Design and Deploy Production-Ready Software.

The essence of the pattern is that when one of your dependencies stops responding, you need to stop calling it for a little while. A file system that has exhausted its operation queue is not going to recover while you keep hammering it with new requests. A remote web service is not going to come back any faster if you keep opening new TCP connections and mindlessly waiting for the 30 second timeout. Worse yet, if your application normally expects that web service to respond in 100ms, suddenly starting to block for 30s is likely to deteriorate the performance of your own application and trigger a cascading failure.

Electrical circuit breakers ‘trip’ when a high current condition occurs. They then need to be manually ‘reset’ to close the circuit again.

Our programmatic circuit breaker will trip after an operation has more consecutive failures than a predetermined threshold. While the circuit breaker is open, operations will fail immediately without even attempting to be executed. After a reset timeout has elapsed, the circuit breaker will enter a half-open state. In this state, only the next call will be allowed to execute. If it fails, the circuit breaker will go straight back to the open state and the reset timer will be restarted. Once the service has recovered, calls will start flowing normally again.

Writing all this extra management code would be painful. This library manages it for you instead.

How to use it

Taking advantage of the library is as simple as wrapping your outgoing service call with circuitBreaker.Execute:

// Note: you'll need to keep this instance around
var breaker = new CircuitBreaker();

var client = new SmtpClient();
var message = new MailMessage();
breaker.Execute(() => client.SendEmail(message));

The only caveat is that you need to manage the lifetime of the circuit breaker(s). You should create one instance for each distinct dependency, then keep this instance around for the life of your application. Do not create different instances for different operations that occur on the same system.

(Managing multiple circuit breakers via a container can be a bit tricky. I’ve published a separate example for how to do it with Autofac.)

It’s generally safe to add this pattern to existing code because it will only throw an exception in a scenario where your existing code would anyway.

You can also take advantage of built-in retry logic:

breaker.ExecuteWithRetries(() => client.SendEmail(message), 10, TimeSpan.FromSeconds(20));

Why is the package named ReliabilityPatterns instead of CircuitBreaker?

Because I hope to add more useful patterns in the future.

This blog post in picture form

Sequence diagram

5 Minute Screencast: Stop helping your users. Help yourself.

A few weeks ago I spoke at Web Directions’ What Do You Know event. The event consisted of 10 speakers each doing a 5 minute presentation about some technique or idea that they find useful in web development.

Here’s my talk:

Ros Hodgekis won the night with her awesome email-related talk (all delivered with champagne glass still in hand):

.NET Rocks! #687: ‘Tatham Oddie Makes HTML 5 and Silverlight Play Nice Together’

I spoke to Carl + Richard on .NET Rocks! last week about using HTML5 and Silverlight together. We also covered a bit of Azure toward the end.

The episode is now live here:

http://www.dotnetrocks.com/default.aspx?showNum=687

JT – Sorry, I referred to you as “the other guy I presented with” and never intro-ed you. 😦

Everyone else – JT is awesome.

Peer Code Reviews in a Mercurial World

Mercurial, or Hg, is a brilliant DVCS (Distributed Version Control System). Personally I think it’s much better than Git, but that’s a whole religious war in itself. If you’re not familiar with at least one of these systems, do yourself a favour and take a read of hginit.com.

Massive disclaimer: This worked for our team. Your team is different. Be intelligent. Use what works for you.

The Need for Peer Code Reviews

I’ve previously worked in a number of environments which required peer code reviews before check-ins. For those not familiar, the principle is simple – get somebody else on the team to come and validate your work before you hit the check-in button. Now, before anybody jumps up and says this is too controlling, let me highlight that this was back in the unenlightened days of centralised VCSs like Subversion and TFS.

This technique is just another tool in the toolbox for finding problems early. The earlier you find them, the quicker and cheaper they are to fix.

  • If you’ve completely misinterpreted the task, the reviewer is likely to pick up on this. If they’ve completely misinterpreted the task, it spurs a discussion between the two of you that’s likely to help them down the track.
  • Smaller issues like typos can be found and fixed immediately, rather than being relegated to P4 bug status and left festering for months on end.
  • Even if there aren’t any issues, it’s still useful as a way of sharing knowledge around the team about how various features have been implemented.

On my current project we’d started to encounter all three of these issues – we were reworking code to fit the originally intended task, letting small issues creep into our codebase and using techniques that not everybody understood. We identified these in our sprint retrospective and identified the introduction of peer code reviews as one of the techniques we’d use to counter them.

Peer Code Reviews in a DVCS World

One of the most frequently touted benefits for DVCS is that you can check-in anywhere, anytime, irrespective of network access. Whilst you definitely can, and this is pretty cool, it’s less applicable for collocated teams.

Instead, the biggest benefit I perceive is how frictionless commits enables smaller but more frequent commits. Smaller commits provide a clearer history trail, easier merging, easier reviews, and countless other benefits. That’s a story for a whole other post though. If you don’t already agree, just believe me that smaller commits are a good idea.

Introducing a requirement for peer review before each check-in would counteract these benefits by introducing friction back into the check-in process. This was definitely not an idea we were going to entertain.

The solution? We now perform peer reviews prior to pushing. Developers still experience frictionless commits, and can pull and merge as often as possible (also a good thing), yet we’ve been able to bring in the benefits of peer reviews. This approach has been working well for us for 3 weeks so far (1.5 sprints).

It’s a DVCS. Why Not Forks?

We’ve modelled our source control as a hub and spoke pattern. BitBucket has been nominated as our ‘central’ repository that is the source of truth. Generally, we all push and pull from this one central repository. Because our team is collocated, it’s easy enough to just grab the person next to you to perform the review before you push to the central repository.

Forks do have their place though. One team member worked from home this week to avoid infecting us all. He quickly spun up a private fork on BitBucket and started pushing to there instead. At regular intervals he’d ask one of us in the office for a review via Skype. Even just using the BitBucket website, it was trivial to review his pending changesets.

The forking approach could also be applied in the office. On the surface it looks like a nice idea because it means you’re not blocked waiting on a review. In practice though, it just becomes another queue of work which the other developer is unlikely to get to in as timely a manner. “Sure, I’ll take a look just after I finish this.” Two hours later, the code still hasn’t hit the central repository. The original developer has moved on to other tasks. By the time a CI build picks up any issues, ownership and focus has long since moved on. An out-of-band review also misses the ‘let’s sit and have a chat’ mentality and knowledge sharing we were looking for.

What We Tried

To kick things off, we started with hg out. The outgoing command lists all of the changesets that would be pushed if you ran hg push right now. By default it only lists the header detail of each changeset, so we’d then run though hg exp 1234, hg exp 1235, hg exp 1236, etc to review each one. The downsides to this approach were that we didn’t get colored diff outputs, we had to review them one at a time and it didn’t exclude things like merge changesets.

Next we tried hg out -p. This lists all of the outgoing changesets, in order, with their patches and full colouring. This is good progress, but we still wanted to filter out merges.

One of the cooler things about Mercurial is revsets. If you’re not familiar with them, it’d pay to take a look at hg help revsets. This allows us to use the hg log command, but pass in a query that describes which changesets we want to see listed: hg log -pr "outgoing() and not merge()".

Finally, we added a cls to the start of the command so that it was easy to just scroll back and see exactly what was in the review. This took the full command to cls && hg log -pr "outgoing() and not merge()". It’d be nice to be able to do hg log -pr "outgoing() and not merge()" | more but the more command drops the ANSI escape codes used for coloring.

What We Do Now

To save everybody from having to remember and type this command, we added a file called review.cmd to the root of our repository. It just contains this one command.

Whenever we want a review we just type review and press enter. Too easy!

One Final Tweak

When dealing with multiple repositories, you need to specify which path outgoing() applies to in the revset. We updated the contents of review.cmd to cls && hg log -pr "outgoing(%1) and not merge()". If review.cmd is called with an argument, %1 carries it through to the revset. That way we can run review or review myfork as required.