An Approach for More Structured Enums

The Need

I encountered a scenario today where a team need to increase the structure of their logging data. Currently, logging is unstructured text – log.Error("something broke") – whereas the operations team would like clearer information about error codes, descriptions and accompanying guidance.

The first proposed solution was a fairly typical one: we would define error codes, use them in the code, then document them in a spreadsheet somewhere. This is a very common solution, and demonstrated to work, but I wanted to table an alternative.

This blog post is written in the context of logging, but you can potentially extend this idea to anywhere that you’re using an enum right now.

My Goals

I wanted to:

  • support the operations team with clear guidance
  • keep the guidance in the codebase, so that it ages at the same rate as the code
  • keep it easy for developers to write log entries
  • make it easy for developers to invent new codes, so that we don’t just re-use previous ones

A Proposed Solution

Instead of an enum, let’s define our logging events like this:

public static class LogEvents
{
    public const long ExpiredAuthenticationContext = 1234;
    public const long CorruptAuthenticationContext = 5678;
}

So far, we haven’t added any value with this approach, but now let’s change the type and add some more information:

public static class LogEvents
{
    public static readonly LogEvent ExpiredAuthenticationContext = new LogEvent
    {
        EventId = 1234,
        ShortDescription = "The authentication context is beyond its expiry date and can't be used.",
        OperationalGuidance = "Check the time coordination between the front-end web servers and the authentication tier."
    };
 
    public static readonly LogEvent CorruptAuthenticationContext = new LogEvent
    {
        EventId = 5678,
        ShortDescription = "The authentication token failed checksum prior to decryption.",
        OperationalGuidance = "Use the authentication test helper script to validate the raw tokens being returned by the authentication tier."
    };
}

From a consumer perspective, we can still refer to these individual items akin to how we would enums – logger.Error(LogEvent.CorruptAuthenticationContext), however we can now get more detail with simple calls like LogEvent.CorruptAuthenticationContext.EventId and LogEvent.CorruptAuthenticationContext.OperationalGuidance.

More Opportunities

Adding some simple reflection code, we can expose a LogEvents.AllEvents property:

public static IEnumerable<LogEvent> AllEvents
{
    get
    {
        return typeof(LogEvents)
            .GetFields(BindingFlags.Static | BindingFlags.Public | BindingFlags.DeclaredOnly)
            .Where(f => f.FieldType == typeof(LogEvent))
            .Select(f => (LogEvent)f.GetValue(null));
    }
}

This then allows us to enforce conventions as unit tests, like saying that all of our log events should have at least a sentence of so of operational guidance:

[Test]
[TestCaseSource(typeof(LogEvents), "AllEvents")]
public void AllEventsShouldHaveAtLeast50CharactersOfOperationalGuidance(LogEvents.LogEvent logEvent)
{
    Assert.IsTrue(logEvent.OperationalGuidance.Length >= 50);
}

Finally, it’s incredibly easy to either list the guidance on an admin page, or generate it to static documentation during build: just enumerate the LogEvents.AllEvents property.

The Code

I’ve posted some sample code to https://github.com/tathamoddie/LoggingPoc

Something interesting things in that repository are:

  • I’ve split the ‘framework’ code like the AllEvents property into a partial class so that LogEvents.cs stays cleaner.
  • I’ve written some convention tests that cover uniqueness of ids and validation of operational guidance.

Wrap Up

There’s absolutely nothing about this solution that is technically interesting. It’s flat out boring, but sometimes those are the most elegant solutions. Jimmy already wrote about enumeration classes 5 years ago.

Dead vs. Done

Back through 2009 and 2010, Damian Edwards and I worked on a project called Web Forms MVP.

It grew out of a consistent problem that we saw across multiple consulting engagements. We got tired of solving it multiple times, so we wrote it as a framework, released it open source, and then implemented it on client projects (with client understanding). Clients got free code. We got to use it in the wild, with real load and challenges. We got to re-use it. The community got to use it too.

It has been downloaded 20k+ times, which is pretty big considering it was around before NuGet was. (Although, we were one of the first 25 packages on the feed too.)

In the last 12 – 18 months, I’ve started seeing “Is Web Forms MVP dead?” being asked. This blog post both answers that question directly in the context of Web Forms MVP, but also discusses the idea of dead vs. done.

Here’s a specific question I was asked:

I am a little bit worried about the fact there is not code commit since sept 2011. Will you continue the project or will it fall in the forgotten ones?

And here was the answer I wrote:

I have a mixed answer for you here.

On the one hand, we cut seven CTP builds, then a v1.0, then a 1.1, 1.2, 1.3 and 1.4. That means we developed the library, tested hundreds of millions of production web requests through it, reached a feature point we wanted to call 1.0, then iterated on top of it. In the 15 months since cutting 1.0, there are only six issues on the Codeplex site: one is a question, one a misunderstanding, one a demo request, and one a feature request that I don’t really agree with anyway.

At this point, Web Forms MVP 1 is “done”, not “dead”. I’m just slack about closing off issues.

Now, that opens up some new questions:

If you find a bug with 1.4 that you can’t workaround, are you left out in the cold? No. First up, you have all the code and build scripts (yay for open source!) so there’s nothing we can do to prevent you from making a fix even if we wanted to (which we never would). Secondly, if you send a pull request via Codeplex we’ll be happy to accept your contribution and push it to the official package.

Will there be a Web Forms MVP 2? At this time, from my personal perspective, I’ll say ‘highly unlikely’. As a consultant, I haven’t been on a Web Forms engagement in over 2 years. That’s not to say there isn’t still a place for Web Forms and Web Forms MVP, but that I’m just not personally working in that area so I’m not well placed to innovate on the library. Damian has lots of great ideas of things to do, and since starting Web Forms MVP has actually become the Program Manager for ASP.NET Web Forms at Microsoft. That being said, his open source efforts of late are heavily focussed on SignalR.

Should there be a Web Forms MVP 2? Maybe. It’d be nice to bring it in line with ASP.NET 4.5, but I’m hard placed to know what is needed in this area considering I’m not on a Web Forms engagement. Without a clear need, I get rather confused by people calling for a new version of something just so they can feel comfortable that the version number incremented.

I hope that gives you some clarity and confidence around what is there today, what will stay, and where we’re going (or not going).

Some projects definitely die. They start out as a great idea, and never make it to a release. I find it a little sad that that’s the only categorisation that seems to be available though.

I hope I’m not just blindly defending my project, but I do genuinely believe that we hit ‘done’.

From here, Web Forms MVP might disappear into the background (it kind of has). The community might kick off a v2. A specific consumer might make their own fork with a bug fix they need. Those are all next steps, now that we’ve done a complete lifecycle.

In the meantime, people are asking if the project is dead, yet not raising any bugs or asking for any features. This just leaves me confused.

Upcoming Manila Presentation: Your website is in production. It’s broken. Now what?

Next month I’ll be spending some time in the Readify Manila office catching up with some local colleagues. While there, I’ll be presenting a user group session at the local Microsoft office. If you’re in the area, I’d love to see you there. (Did you know that we’re hiring in Manila too?)

Manila, Philippines, Wed 12th Dec 2012. Free. Register now.

Big websites fail in spectacular ways. In this session, five-time awarded ASP.NET/IIS MVP and ASP Insider Tatham Oddie will share the problems that he and fellow Readify consultants have solved on large scale, public websites. The lessons are applicable to websites of all sizes and audiences, and also include some funny stories. (Not so funny at the time.)

A tiny subset of your users can’t login: they get no error message yet have both cookies and JavaScript enabled. They’ve phoned up to report the problem and aren’t capable of getting a Fiddler trace. You’re serving a million hits a day. How do you trace their requests and determine the problem without drowning in logs?

Marketing have requested that the new site section your team has built goes live at the same time as a radio campaign kicks off. This needs to happen simultaneously across all 40 front-end web servers, and you don’t want to break your regular deployment cycle while the marketing campaign gets perpetually delayed. How do you do it?

Users are experiencing HTTP 500 responses for a few underlying reasons, some with workarounds and some without. The customer service call centre need to be able to rapidly evaluate incoming calls and provide the appropriate workaround where possible, without displaying sensitive exception detail to end users. At the same time, your team needs to prioritize which bugs to fix first. What’s the right balance of logging, error numbers and correlations ids?

Your application is running slow in production, causing major delays for all users. You don’t have any tools on the production servers, and aren’t allowed to install any. How do you get to the root of the problem?

New Talks: “Neo4j in a .NET World (Graph DBs)” and “You’re in production. Now what?”

Last week I was lucky enough to join an array of great speakers at the NDC Oslo conference.

The recordings of both of my talks are now online, along with 141 other excellent talks you should watch.

Neo4j in a .NET World (Graph DBs)

This year, a small team of developers delivered a ASP.NET MVC app, with a neo4j backend, all running in Azure. This isn’t in POC; it’s a production system. Also, unlike most graph DB talks, it’s not a social network!

https://vimeo.com/43676873

You’re in production. Now what?

A tiny subset of your users can’t login: they get no error message yet have both cookies and JavaScript enabled. They’ve phoned up to report the problem and aren’t capable of getting a Fiddler trace. You’re serving a million hits a day. How do you trace their requests and determine the problem without drowning in logs?

Marketing have requested that the new site section your team has built goes live at the same time as a radio campaign kicks off. This needs to happen simultaneously across all 40 front-end web servers, and you don’t want to break your regular deployment cadence while the campaign gets perpetually delayed. How do you do it?

Users are experiencing 500 errors for a few underlying reasons, some with workarounds and some without. The customer service call centre need to be able to rapidly triage incoming calls and provide the appropriate workaround where possible, without displaying sensitive exception detail to end users or requiring synchronous logging. At the same time, your team needs to prioritize which bugs to fix first. What’s the right balance of logging, error numbers and correlations ids?

These are all real scenarios that Tatham Oddie and his fellow consultants have solved on large scale, public websites. The lessons though are applicable to websites of all sizes and audiences.

https://vimeo.com/43624434

Remembering Why We Undertake ICT Projects

I’ve recently been reading Standards Australia’s publication HB280-2006: "How Boards and Senior Management Have Governed ICT Projects to Succeed (or Fail)"1. Just yesterday, Pat Weaver blogged some related analysis which ultimately spurred this post.

Both sources draws similar conclusions about the need to identify the delivery aspect of a project as just one component of a larger game. Ultimately, both sources then attribute this responsibility, and thus commonality of project failure, to senior management.

I particularly like this quote from section 2.2.1 of the handbook:

The case studies provide quite strong evidence, that in general, ICT projects deliver benefits by enabling process change, and project management, user support and all the other traditional prescriptions are less important than senior management support. Only senior management can resolve the political issues that arise as a result of conflicts in objectives caused by change.

Within the software development community, we almost always view software projects as changes themselves rather than simply an enabler in a wider organisational change program. While we talk about needing product owners from the business to elicit requirements and resolve implementation questions, we don’t look to them to act as a change champion in anywhere nearly as structured a way.

Food for thought: Perhaps we need to move towards reducing the number of people tasked with gathering requirements (business analysts and subject matter experts) to make way for some people to be actively pushing change back on the business? Both Pat’s post and the handbook talk about this as the responsibility of senior management, however I think they can reasonably be assisted in a structured way, similar to how we employee business analysts rather than expecting the project champion to understand and document all of the requirements.

Certainly, the measure of overall project success needs to shift away from on-time/budget/scope delivery towards an assessment of organisational change and benefit realisation. This is a core principle to any form of Lean-based delivery, however is yet to make it’s way into the world of organisations still addicted to Waterfall-derived delivery models, or in most cases, even Scrum.

Finally, it is encouraging to note that I came across Pat’s post via a discussion thread in the Australian Institute of Company Directors LinkedIn group. This is a group very heavily comprised of senior managers, and consequently a great place to see these questions being raised.


1 Warning: The publication mechanism for this handbook is positively horrible. After handing over AU$114.27 for a legitimate license, you receive it as a rights-stripped PDF, that requires a third-party DRM plugin for Adobe Reader, which only lets you open it on one computer ever, only lets you launch the print dialog once ever, prevents you from highlighting even a single word, and prevents the accessibility functions from working (in breach of the Australian Disability and Discrimination Act). To top it all off, they still feel the need to print your full license details down the side of every single page. You’ll need to be downright persistent to even make it past page 1 as a legitimate user, which is sad considering it’s otherwise interesting content.

The Checklist Manifesto: How to Get Things Right

I’ve just finished reading Atul Gawande’s somewhat self-assuredly titled The Checklist Manifesto: How to Get Things Right. Trepidatious about reading an entire book dedicated to the unassuming concept of checklists, it had slipped in my reading queue. The result was however a pleasantly educational and entertaining surprise – he’s a good writer, with an extensive repertoire of experience. There’s even three hours of flight time left for me to knock this post out.

The foundation is simple: we’ve entered an era that’s encumbered by our ability to apply knowledge (ineptitude), as opposed to lacking it in the first place (ignorance).

Half a century ago, heart attack treatment was non-existent; patients would be given morphine for the pain, some oxygen, then sent home as, what Atul describes, “a cardiac cripple”. In contrast, responders are now faced with a wide gamut of therapies and the new challenge of implementing the right one in each scenario. When they fail, beyond the obvious downsides, blame is frequently attributed to the professional who ‘failed’ to apply a body of knowledge they have been given. As this becomes an unachievable task, we need to adopt better solutions.

Without detracting from the value of investing a few hours to read the book yourself, I wanted to tease out a few of the points I found interesting. If you find these even vaguely interesting, I really do suggest that you grab a copy.

An Emphasis on Process

Atul cites that the master builder approach to construction has been replaced with specialized roles to such a degree that we really need to call them super-specializations. It started with dividing the architects from the builders, then splitting off the engineers, and so and so forth. As a surgeon, he jokes that in the medical world he’s expecting to start seeing left-ear surgeons and right-ear surgeons, and has to keep checking that this isn’t already the case whenever somebody mentions the idea.

In the advent of this, certification processes have also evolved. Where a building inspector may have historically re-run critical calculations themselves, modern building projects involve too many distinct engineering disciplines, drawing on too many bodies of knowledge, for this to be practical. We could build a team of specialized inspectors, except this rapidly becomes unwieldy itself. Instead, building inspectors have taken to focusing on ensuring that due process has been followed. Has a particular assessment been completed by the relevant parties? Did it have the appropriate information going in? Did it produce a satisfactory outcome? Great, move on.

An almost identical construct exists in Australian employment law. It doesn’t matter if somebody is completely incompetent (or inept?); you still have to follow due process in order to disengage them. Employment courts, despite already being a form of specialization themselves, are not interested in or capable of assessing an employee’s performance. They are however capable of asserting that the correct steps were followed in issuing warnings, conducting performance management, and so forth.

Here was my first face-palm moment: I’d made the mistake of considering a checklist as a list, with checkboxes. There’s a whole set of gate, check and review processes which I’ve always mentally separated from the concept of checklists. Beyond the semantics, I found this to be a valuable light bulb moment when considering some of the other ideas.

Communication

Atul’s passion for checklists comes from leading the World Health Organisation’s Safe Surgery Saves Lives program. In trying to solve the general problem of ‘how do we make surgery safer?’, the program ended up rolling out a 19-point check list, with amazing results. It’s no small feat to cause behavioural change across literally thousands of hospitals around the world.

There were actually two behavioural changes required. First, they had to get people to actually adopt the checklists as a useful contributor to the surgical process. They had to be short, add demonstrable value, and so forth.

The second challenge was getting people to talk to each other. Some of the statistics he quotes about the number of people involved in the surgical environment are amazing. One Boston clinic employs “some six hundred doctors and a thousand other health professionals covering fifty-nine specialties.” The result of this is that operating teams have rarely worked together prior to any particular case. Having clear specialities makes it functional to have an unacquainted collection of professionals achieve an outcome, however it doesn’t facilitate an environment of team work when something goes awry. Instead, these autonomous professionals become focussed-in on achieving their individual goals.

To combat this, one of the checklist points is actually as simple as making sure everyone in the room knows everyone’s name and role before the surgery begins.

Fly the Airplane

Some Cessna emergency checklists have an obvious first step: fly the airplane. While we wait for evolution to catch-up, our brains are still wired for a burst of physical exertion to combat panic. Otherwise common mental processes go by the way side and we do something stupid.

I like the simplicity of this point, and see it being useful in a operations environment.

Pause Points

In early trials of their new safe surgery checklist, participants found it unclear about who was meant to be completing the list and when. A similar problem plagues most development ‘done criteria’ I’ve worked with. Yes, everything is meant to be checked off eventually, but when?

Airline checklists instead occur at distinct pause points. Before starting the engines. Before taxiing. Before takeoff. In each of these scenarios, there’s a clear pause to execute the checklist. The list is kept short (less than a minute) and relevant to that particular pause point.

The next time I work on defining a done criteria, I think I’ll try and split it into distinct lists. These points must be completed before you push the code. These points must be completed before the task is closed.

“Cleared for Takeoff”

Surgical environments have a clear pecking order that starts with the surgeon. Major challenges of the safe surgery campaign were getting everyone to apply the process as a team, and ensuring individual members of the team were empowered enough to call a halt if something was about to be done incorrectly. To achieve this, nurses had to be empowered to stop a surgeon.

In one hospital, a series of metal covers were designed for the scalpels. These were engraved with “Cleared for Takeoff”. The scalpel couldn’t be handed over for an incision until the cover was removed, and that didn’t happen until the checklist was completed. This changed the conversation to again be about the process (‘we haven’t completed the checklist yet’) instead of individual actions (‘you missed a step’).

I think points like this are small but important. And definitely interesting.

Now, go and read the book.

The book is an extension of a 2007 article by Atul, published in The New Yorker. I haven’t read the article, but some Amazon reviews suggest it covers the same concepts with less text. Most of the book is just stories, but I found them all interesting nonetheless.

Code: Request Correlation in ASP.NET

I’ve been involving in some tracing work today where we wanted to make sure each request had a correlation id.

Rather than inventing our own number, I wanted to use the request id that IIS already uses internally. This allows us to correlate across even more log files.

Here’s the totally unintuitive code that you need to use to retrieve this value:

var serviceProvider = (IServiceProvider)HttpContext.Current;
var workerRequest = (HttpWorkerRequest)provider.GetService(typeof(HttpWorkerRequest));
var traceId = workerRequest.RequestTraceIdentifier;

(A major motivator for this post was to save me having to trawl back to my sent emails from 2009 the next time I need this code.)

Update 28th Feb 2013: Some people have been seeing Guid.Empty when querying this property. The trace identifier is only available if IIS’s ETW mechanism is enabled. See http://www.iis.net/configreference/system.webserver/httptracing for details on how to enable IIS tracing. Thanks to Levi Broderick from the ASP.NET team for adding this.

A Business in a Day: giveusaminute.com

Lately, my business partner and I have wanted to try some shorter ‘build days’. The idea of these are to start with a blank canvas and an idea, then deliver a working product by the end of the day. This is a very different approach to the months of effort that we generally invest to launch something.

Today we undertook our first build day and delivered Give Us A Minute, an iPad-targeted web app for managing wait lists:

image

It was a fun experience trying to achieve everything required in one day, but I think we did pretty well. We managed everything from domain name registration to deployment in just under 9 hours. One of the biggest unplanned tasks was actually building the website to advertise the app; we hadn’t even thought of factoring that in when we started the day. The photography also took up a bit of time, but we needed to do it to tell the story properly on the site. Also, it was nice to be required to go and find a beer garden with a tax-deductible beer each so we could get that bottom-left photo.

As part of staying focussed on the idea of a minimum viable product we dropped the idea of accounts very early on. At some point we’ll have to start charging for the text messages, but that then implies logins, registration, forgotten passwords, account balances and a whole host of other infrastructure pieces. In the mean time we’ll just absorb the cost of message delivery. If it starts to become prohibitive, it’ll be a pretty high quality problem to have.

The next step is to get this out on the road and into some businesses. We’ll start by approaching some businesses directly so we can be part of the on-boarding experience. Based on how that goes, we’ll start scaling out our marketing efforts.

We also need to get ourselves listed in the Apple App Store for the sake of discoverability. The ‘app’ is already designed with PhoneGap in mind, but we’re waiting on our Apple Developer enrolment to come through before we can finalise all of this.

Released: ReliabilityPatterns – a circuit breaker implementation for .NET

How to get it

Library: Install-Package ReliabilityPatterns (if you’re not using NuGet already, start today)

Source code: hg.tath.am/reliability-patterns

What it solves

In our homes, we use circuit breakers to quickly isolate an electrical circuit when there’s a known fault.

Michael T. Nygard introduces this concept as a programming pattern in his book Release It!: Design and Deploy Production-Ready Software.

The essence of the pattern is that when one of your dependencies stops responding, you need to stop calling it for a little while. A file system that has exhausted its operation queue is not going to recover while you keep hammering it with new requests. A remote web service is not going to come back any faster if you keep opening new TCP connections and mindlessly waiting for the 30 second timeout. Worse yet, if your application normally expects that web service to respond in 100ms, suddenly starting to block for 30s is likely to deteriorate the performance of your own application and trigger a cascading failure.

Electrical circuit breakers ‘trip’ when a high current condition occurs. They then need to be manually ‘reset’ to close the circuit again.

Our programmatic circuit breaker will trip after an operation has more consecutive failures than a predetermined threshold. While the circuit breaker is open, operations will fail immediately without even attempting to be executed. After a reset timeout has elapsed, the circuit breaker will enter a half-open state. In this state, only the next call will be allowed to execute. If it fails, the circuit breaker will go straight back to the open state and the reset timer will be restarted. Once the service has recovered, calls will start flowing normally again.

Writing all this extra management code would be painful. This library manages it for you instead.

How to use it

Taking advantage of the library is as simple as wrapping your outgoing service call with circuitBreaker.Execute:

// Note: you'll need to keep this instance around
var breaker = new CircuitBreaker();

var client = new SmtpClient();
var message = new MailMessage();
breaker.Execute(() => client.SendEmail(message));

The only caveat is that you need to manage the lifetime of the circuit breaker(s). You should create one instance for each distinct dependency, then keep this instance around for the life of your application. Do not create different instances for different operations that occur on the same system.

(Managing multiple circuit breakers via a container can be a bit tricky. I’ve published a separate example for how to do it with Autofac.)

It’s generally safe to add this pattern to existing code because it will only throw an exception in a scenario where your existing code would anyway.

You can also take advantage of built-in retry logic:

breaker.ExecuteWithRetries(() => client.SendEmail(message), 10, TimeSpan.FromSeconds(20));

Why is the package named ReliabilityPatterns instead of CircuitBreaker?

Because I hope to add more useful patterns in the future.

This blog post in picture form

Sequence diagram