Predicting Incident Resolution Time from Incident Response Log

lettertoabhishekna
Jan 13, 2021
4 min read

Updated: Jan 14, 2021

#ServiceNow #ITSM #IncidentManagement #Support #Infra #Infrastructure #IT #Ticket #ServicManagement #MachineLearning #SLA #Management #Ticketingtool

What is an Incident?

Short Answer: Any disruption in service.

Now to put these three words into context - Any Organization having an IT department, has a set of teams that keeps the Business and IT needs covered. From the perspective of Business users, these teams keep their Computers and the daily use internal and external, applications available to them at all times.

Why do we need Incidents?

Now to keep track of each individual business users IT Issues they use a ITSM (IT service management) tool. One of the tools with leading market share is Service Now. The dataset that I have used in this project is a Kaggle Dataset (https://www.kaggle.com/vipulshinde/incident-response-log), retrieved from Service Now.

As every department has some guidelines and deadlines by which they work, the IT Support's major guidelines and deadlines is set through this Tool. Basic idea is to attend to and resolve any user issues as soon as possible. I would request you to go through details of each feature from Kaggle.

Why solve this Problem? (Business Objective)

As the performance of a team is measured by this tool, any team lead or manager would find the ETR (Estimated Time of Resolution) for any Incident raised to be very useful. The Incidents which takes more number of days to be resolved would be the focus for those managers or team leads, and accordingly they would be identify the areas of improvement.

In case of Outsourcing, these ETR are the one that makes profit for the vendors also they are the ones that penalizes the vendor. So by solving the problem, the vendor would be able to avoid paying penalty.

Now, lets dig in to the problem in hand.

This is for them who don't want to wait: (https://github.com/Nath9319/Incident-Resolution-Prediction-for-IT-Support). Please go through my Notebook here.

Data Understanding:

The first observation about the data is, there is no actual information about the issue. As the data is opensource, hence all the categories has been changed and the description is not present, which is the major part. However proceeding with this data in hand, ended up in good result.

The Incident States Count are as follows, the distribution looks like this because. The data shows all the states that each Incident has been.

The Incident Reopen count reflects that most of the Incident has not been reopened. This means that the issue has not reoccurred for most of the users.

Incase of Impact, we see that most of the Incidents are medium impact. the expected time of resolution and SLA for Priority Incidents are different. Hence understanding the trend how the users in an organization consider the a prioritizes an or their issue. Can hold a key factor on contract for vendors.

There are some features where a category is "?", depending upon each individual how they want to handle it. I used it as a separate feature which gave me good results at the end.

Target Data/ Dependent Variable:

This problem being an open ended problem, instead of approaching it as a Regression problem, where we predict exactly the day within which the Incident would be resolved. I approached it as a Classification problem, where I binned together the days together.

Below if the count for days until resolution.

For me binning them would make more sense, as a manager would expect to get the Incident numbers which belongs to groups with higher resolution time.

Outlier Removal: Removed all the Incidents with resolution time less than Zero and more than 30 days. As any Incident getting resolved within a day should not show up in open Incident report, and well any Incident taking more than a month to resolve is out of hands already. [After CDF analysis]

Relations among features: Any data with Multicollinearity exopods the variance and the end results becomes unreliable. Hence removed multicollinearity tests.

Metric Selection: This section decides how good is our model performing. For this problem I chose Confusion Matrix and ROC-AUC.

Model Selection:

To validate how better our model is performing, we need to first set a threshold. In this case I selected Naive Bayes Classifier, as a baseline model upon which I would be able to say how good is our model is performing. As expected GaussianNB performing tribally on our data.

Below we can see that this model is classifying most of the data as class 4.

For the next model I tried KNN. Although KNN is expensive on testing, it performed well in our data. After hyperparameter tuning, we got the below results on our data.

The performance on our KNN is significantly better than our baseline model, however there is some amount of misclassification on our data.

For the next model, I would try boosting algorithm to check how well it is able to classify the data. XGBoost is not able to classify our model better than KNN. Here are the details below.

To improve the classification power of the model, I tried sampling methods. As the data is imbalanced, the usual approach is resampling to generate synthetic data which would provide classifier more data on the minority class labels. However the models are seen to perform worse in case of resampled data. This could be caused when the synthetic data created, is not helping in classification. Hence the model is not able to perform well in Test Set.

Hence, as of not KNN is performing best among all of the models that we have tried.

To improve the classification performance of our model, I tried Multilayer Perceptron Classifier. The Neural Network model is doing a very good job in performing classification.

As the there are not much business constraints line interpretability and time, hence MLP's could be used. This model could be used on any Incidents at it's inception.

However I believe if there was issue details available. This models performance could be drastically improved, as eventually the ETR is dependent on the issue more than other ITSM tool parameters. Nevertheless this data is able to classify our model very well.

Thanks a lot everyone for taking time to look into this small project of mine. If you have anything to add please reach out to me, my email and LinkedIn URL is available in my blog.

Predicting Incident Resolution Time from Incident Response Log

Recent Posts

コメント