Eliminate Crashes & Incidents in Mobile Development Teams
As an iOS engineer, I have spent a fair amount of time dealing with crash log reports and incidents reported by customers. Especially if you're working with enterprise apps in large, distributed teams, resolving crashes and incidents should be a prioritized part of your daily work. I have worked on apps with >1 million users where this wasn't the case - such defects would get out of hand extremely quickly. And the thing is, the more you neglect this type of work, the more it's going to hurt to get rid of. Not to mention that failures in your app can have interaction with each other, resulting in new failures. Before you know it, you will suffer from the Broken Windows syndrome - the more defects you have, the less energy developers have to get rid of them.
So, let me share what has worked in projects I've been on, and how we went from a large pile of crash log and incident reports to almost none in a matter of months.
First, let's clear up some terminology:
- Crash log: a log containing reports of crashes in production
- Incident: issue reported by a customer - something that went wrong in the app
- Defect: a term I'll use to cover both crash log and incident reports
Hero of the Day
The first challenge in eliminating defects is to figure out how to encourage and enforce better discipline around them. To accelerate this, start by assigning a person or a team the responsibility of distributing defects to the teams that own the defect, by creating issues for their backlog. If you have a support team in your organization that normally provides support to other teams, it makes sense to let them have this responsibility.
In our support team, we have a role known as the "Hero of the Day" which is responsible for spending 1-2 hours per day going through the list of new and existing defects. This role rotates amongst team members each day. The hero of the day will either fix the "easy" defects or identify the team who owns a specific defect and create an issue in their backlog. Some defects can be grouped if they're related so that only one issue is created for related defects.
When this role was first introduced, it proved to be very effective as the list of defects shrunk significantly. After a few months, we reached an acceptable point where only minor defects were left. This was mostly the least frequent defects, ones that only happened on a specific model of older devices, or was simply not possible to reproduce by the developers.
Incident management process
I'll provide a little more details on the process we would use for handling incident reports. The process is heavily inspired by the ITIL framework, so the incident lifecycle would look like this:
- Incident reported
- Incident resolution
- Incident analysis
The first phase is where the customer reports an incident. Many companies have a service desk where the incident is reported to by phone call, email, a form on a website, etc. If the service desk agent can't resolve it, the incident is escalated, so developers will handle it.
The service desk should have an incident template of information to gather for easier resolution. Here are some important points to include in the template:
- The name of the person involved in the incident.
- The date and time the incident is reported.
- A description of the incident (what is not working properly).
- A unique identification number assigned to the incident, for tracking.
- A logical, intuitive category (and subcategory, as needed), for grouping.
This is when the incident lands in the hands of the developers. If you have a hero of the day, that person should be the first one to pick it up and do the initial technical investigation. The person may even successfully diagnose or resolve the incident without the involvement of other teams. Otherwise, the incident is distributed to the team who owns it as an issue in their backlog.
After resolution, the incident is then passed back to the service desk to be closed. To maintain quality and ensure a smooth process, only service desk employees are allowed to close incidents, and the service desk agent should check with the customer who reported the incident to confirm that the resolution is satisfactory and the incident can, in fact, be closed.
To reduce incident volume and mitigate risk, analyze the existing incident reports once in a while. The point is to look for trends, patterns, and potential underlying problems of the incidents. The category on each incident report is useful for this, as you'll be able to easier identify weak areas in the codebase.
Set up an incident analysis meeting in your team that will reoccur once a month to incorporate analyses into your incident management process.
Having the right tools can help your team raise awareness and resolve defects more efficiently. Here are a few tools to help you do this.
Crash log reports
Both iOS and Android provide a basic overview of the crashes being logged in production.
For iOS apps, open app store connect in a browser, click on "App Analytics" and select the app you are interested in, then a graph of crash reports will be shown at the bottom right. The title of this graph "Crashes" is a link and will take you to a more detailed crash report graph. However, to see the actual call stack for any individual crash, you will need to do this from the XCode Organizer tool.
For Android apps, the play store console provides much more detail for reported app crashes in the browser as well as making it much easier to view user reviews. So Apple certainly has room for improvement here.
There are more sophisticated tools for monitoring of crash log reports. Visibility and transparency yielded by effective monitoring is invaluable, so it worth investing in. AppDynamics is a tool I've had most success with. It also has great dashboard functionality to easily share the status on a monitor with your team members.
There are different tools available for incident management. One of the most commonly used ones is ServiceNow which allows for tracking of the incident state and communication between developers and the service desk.
It's also very practical to implement an in-app incident report tool. Antoine van der Lee from the iOS community has recently been working on an open-source tool called Diagnostics that serves this exact purpose. I haven't had a chance to try it out myself, but it definitely looks promising. It's only for iOS, but hopefully, there's something similar available for Android.
That's it. We discussed how you can ultimately bring down the defects in your app by setting up processes in your daily workflow. We also covered what a solid incident management process looks like as well as different tools to make it all easier.
If you liked this post, make sure to check out the other posts on this site. I cover everything related to mobile app development. Feel free to contact me or tweet to me on Twitter if you have any additional tips or feedback.