Why sign SLA in the first place?
So you’ve taken the path of developing your software startup with an external development team – you’ve agreed on the design, functional and non-functional requirements, signed off a schedule and formal software development contract (no matter if agile or waterfall process), secured your intellectual property rights.
The development is underway, more or less on schedule.
According to the contract and specs, the app/software is to be delivered functional , bug-free*, tested.
Seems like all bases have been covered, doesn’t it?
If your app is a short term/scale commitment (a proof of concept, or a low-budget MVP, with no premium/paying users) – then yes. You can hope for the best and even if there is a major incident (software/hardware related) brining the site/app to it’s knees – most likely nobody’ll get hurt.
This is not the case with paying customers or an app/site handling significant production-grade traffic. There’ll be complains, refund demands, customer service issues/services that need to be handled.
Not to mention the customers loosing confidence in your product – the more significant, the more you charge / plan to charge for the product.
Once the product is live and you’re charging the clients, you need to be certain that there’s always a competent/knowledgeable person ready to step into action whenever necessary, regardless of time & day.
No matter if it’s an unexpected heavy traffic bringing your site down, a critical security update that needs to be applied ASAP, or a hardware failure that needs to be identified and (if necessary) addressed by a third-party.
That’s what the SLA is for.
What should SLA cover?
To secure your business and be meaningful and binding agreement, the SLA needs to have at least a basic coverage of the following matters:
#1 Clear definition of technological (hardware/software) layers each side is responsible for
If it happens that there are multiple parties responsible for the software (application) and sysadmin (OS, database/web server software) maintenance – each of them needs to have its’ roles and responsibilities very clearly defined to avoid confusion – both during the incident management/resolution and after.
The blaming game that might come up as an aftermath is something you want to avoid.
In practice this might result in a separate SLA’s for the software and sysops maintenance.
#2 Definition of critical/major/minor bugs, issues or incidents
Not all incidents/bugs are created equal. Some might be minor, hardly visible annoyances, while the others shut your service down as soon as they appear.
The SLA should define each level (i.e. minor/major/critical) of incidents, together with an example/explanation of types of incidents/scenarios that belong to each group in case of your software (i.e. users being unable to log in might be either a critical, or a major incident, depending on the profile of your site/app)
#3 Agreed reaction times
Separate for each incident level, as there’s no need to rush in at early a.m. hours to fix a wrong font size.
So shorter (minutes/hours )/immediate for critical issues, more flexible and forgiving for the minor ones.
No matter the case, they need to be clearly defined in the agreement so that the chance of a confusion (and hence – not swift enough reaction to incident) is avoided.
#4 Service uptime requirements and how they will be measured
What is the agreed uptime level (in %, i.e. 99.5%) you’re paying for, how will it be measured (i.e. using externally set-up Zabbix, by reviewing it’s monthly uptime reports?)
SLA without clear and actionable way of measuring if it’s fulfilled might not cover your needs/demands good enough in case of an argument, leaving you without proper and solid proof that its’ arrangements have been breached.
#5 Communication channels (and elevation paths) in case of incidents
How will the issues be reported? Via phone, email, or perhaps a separate issue-management system? Who should be contacted in case there’s no reaction for issue reported within the defined timeframe?
Who should the incident report be elevated to if there’s no reaction from the person/party defined as the first-contact?
Having a readily available history of reports/reactions (which reporting by phone calls won’t provide you with) will allow for an easier and proof-based assessment of service level you receive.
Remember to always follow by at least an email.
#6 The price for the service as well as financial implications for breaching agreed up-/reaction times/service level
This one is fairly obvious – simply put, how much will it cost the maintainer to exceed agreed reaction times or fail resolve the incident? Will it be a monetary or work (time & material, i.e. 2h of worth of work for each 1h of delay ) compensation?
Under what circumstances it the maintainee authorized to immediately discard the SLA with no additional serverance? (i.e. failure to fix the issue within 3 days, failure to react properly more than 3 times)
You can see some example SLA’s agreements at Docracy, ie:
Keep in mind though, there’s not a single one that is perfect for each service/app – so the final result is a matter of multiple reviews/mutual discussions specific to the service it is to cover.