Recurrent Neural Networks for Churn Prediction

I just posted a simple implementation of WTTE-RNNs in Keras on GitHub: Keras Weibull Time-to-event Recurrent Neural Networks. I'll let you read up on the details in the linked information, but suffice it to say that this is a specific type of neural net that handles time-to-event prediction in a super intuitive way. If you're thinking of building a model to predict (rather than understand) churn, I'd definitely consider giving this a shot. And with Keras, implementing the model is pretty darn easy.

As proof of the model's effectiveness, here's my demo model (with absolutely no optimization) predicting the remaining useful life of jet engines. It's not perfect by any means, but it's definitely giving a pass to engines that are in the clear, and flagging ones that are more likely to fail (plus a few false positives). I have no idea how much better it could do with some tweaking:

Demo WTTE-RNN Performance

Also, if anybody's curious as to why I've been in a bit of a post desert lately, my wife and I recently had a baby and I haven't been giving as much thought to the blog. However, I have some ideas brewing!

56 Responses

  1. Jimmy March 24, 2017 / 4:32 am

    Dear blogger,
    What is the relation between test_x and test_y? I thought the data in the test_y were life, but obviously they were not.

    • daynebatten March 24, 2017 / 9:52 am

      In the demo code I posted, test_y is a (samples, 2) tensor containing time-to-event (y) and a 0/1 event indicator (u), specifically for the testing data (as opposed to the training data). The event indicator for test_y is always 1, because this is the true remaining lifetime for all these engines.

      test_x is simply a history of sensor readings for each engine, used to make predictions about its remaining useful life.

      If the model is working well, when it’s fed test_x and asked to make predictions, it should generate predictions roughly equivalent to the lifetimes in test_y.

      Hope that helps.

      • Jimmy March 28, 2017 / 3:09 am

        Thanks for your reply. It has been very helpful. I want to try to predict the real-time remaining life of the engines. Do you have any suggestion?

        • daynebatten March 28, 2017 / 7:40 am

          That’s actually exactly what the demo code is for… the alpha and beta parameters in the output describe a Weibull distribution showing the propensity for each engine to fail over future time.

          If you need a point estimate of life remaining (like I’ve used for the graph above), you can simply set the Weibull survivor function equal to .5 using the output alpha and beta and solve for time. The survivor function is exp(-(t/a)^b). Set it equal to .5 and solve for t and you get a(-ln(.5))^(1/b).

          Hope that helps.

  2. Jimmy March 30, 2017 / 2:53 am

    I calculated the remaining useful life and I found the values are quite far from the actual one. Take engine 1 as an example, the predicted result was [ 112. 1. 218.93415833 4.13228035]. The t I solved is about
    but the actual value is 31. Why is it so different?

    • daynebatten March 30, 2017 / 6:14 am

      If you look at the code, you’ll see that the output prints the test_y values next to the predictions, so that 112 in your output is the actual remaining life of that engine. Not sure where you got 31 from.

      A few other things to remember:

      1) This prediction is saying the engine has a 50% chance of surviving to day 196. We’d expect some engines to fail before then and some to fail after then.

      2) This is machine learning, not black magic. No predictions are going to be perfect.

      3) This is a demo and I made no attempts to optimize this model whatsoever. It’s entirely possible that making some tweaks to the model (additional layers of neurons, for example) could dramatically improve performance. Feel free to mess around.

      • Jimmy March 30, 2017 / 10:31 am

        Sorry about my stupid questions. XD
        I have been learning LSTM recently and I have not completed a single model of RNN (LSTM). Your example is wonderful and I am sure I will learn a lot from it. The value 31 was learned from the data test_x. According to the data, engine 1 ran 31 cycles and failed, so I think TTE value should be 31.(Am I right?) However, the value is 112 in test_y.

        • daynebatten March 30, 2017 / 1:19 pm

          No problem at all. The best way to learn is by taking on new challenges and muddling through.

          Best of luck!

          (Don’t have time to go hunt down the 31 right now, but if the 31 came from test_x, it’s some kind of sensor reading from the engine.)

  3. Phillip April 1, 2017 / 4:28 am

    Mr. Batten,

    I am a doctoral student at NC state interested in discussing a section of your recent dissertation with you. If willing, please email me at your earliest convenience.

    • daynebatten April 3, 2017 / 7:43 am

      If you’re looking to talk about this model, I’d be happy to have a conversation with you, but I haven’t written a dissertation on the subject (or any subject, for that matter)… Perhaps you’re referring to Egil Martinsson’s thesis?

      I’ll send you an email offline, though I’m also mildly concerned this is trolling what with it being April 1 and all?

  4. Natalia Connolly April 12, 2017 / 1:17 pm

    Hi Dayne,

    Thank you for making your code available on github! This is really cool stuff.

    A quick question: where did you get the true time-to-event numbers (the ones in test_y)? Was it part of the NASA dataset?

    • daynebatten April 12, 2017 / 1:46 pm

      Hi Natalia,

      Yes, it was part of the NASA data set. The files starting with RUL indicate the remaining useful life for each engine in the test data files, which start with test_.

      The organization of the data is admittedly not the best…


  5. Ashwin May 20, 2017 / 9:43 am

    Mr. Batten,

    Thanks for the simple write up and code! I am still learning the workings and have a basic question –
    The RUL = a(-ln(.5))^(1/b) that you mentioned to Jimmy’s post, is it absolute time or a percentage of the max time of the engine?


    • daynebatten May 22, 2017 / 3:22 pm

      Thanks for writing. That should be absolute time!

  6. Alberto Albiol May 22, 2017 / 5:22 am

    Excellent work, I have been digging into your code to get some light on recurrent models.
    I have two questions:
    1- Do you think that the result would be very different without masking?, you have already left-padded the sequences with zeros

    2- You are using a sliding window of 100 timesteps, but another option is using a stateful char-rnn, so the network can remember the whole history and reset the state for each engine?, have you tried something like this?. I think that the advantage of your approach is that you can shuffle the training samples….

    • daynebatten May 22, 2017 / 3:26 pm

      Good questions.

      1) I’m not an expert on NNs, so I’m not 100% sure, but I expect it wouldn’t make a super-big difference. I’d imagine the network would learn that having everything set to 0s is basically useless information and begin to ignore it. Not ideal to have some of your model’s knowledge dedicated to that, though, if it’s not necessary.

      2) Yes, you could use a stateful char-rnn, but there’d be two downsides. First, you could only back-propagate through time for as many time steps as you had in each batch. If you kept, say, 10 time-steps per batch, this might not matter… but I don’t know. I haven’t played with this dummy model much. Second, there’s just some Keras challenges involved in managing batches when you have time series of differing lengths. It’s not anything that can’t be accomplished, but it would be some extra coding.

      Thanks for writing!

  7. Jack May 24, 2017 / 1:46 am

    Thanks for posting this. On which hardware did you train the model and how much time did it take?

    • daynebatten May 24, 2017 / 8:09 am

      I trained the dummy model on my laptop (core i5). I don’t remember exactly how long it took, but definitely no more than an hour or so.

      I’ve scaled this up to a different data set with 52 observations of 67 features for ~100k individuals. As with the jet engine data, this expanded to a full historical look-back for each period in which the individual was observed. It was a pretty big model. Trains on a p2.8xlarge in ec2 (8 Tesla k80s) in a couple hours…

  8. DM June 26, 2017 / 12:44 pm

    Hi Dayne, thanks so much for taking the time to put this together. I’m trying to walk through the work you’ve done with the goal of spinning up my own churn prediction model. Downloading the data from the NASA site and running the code from your github page, I’ve plotted the predicted vs actual time to survival, similar to the graph you have near the bottom of your post.

    I see similar patterns and clusters between our two plots, except the vertical bar in my plot is centered at around 90 days whereas yours it at 60. Without asking you to debug my code, and being aware of the randomness inherent in the model, is it possible that your plot was generated from a different model than what you posted?

    Like I said, I pasted your code in its entirety so I want to verify the input to the chart before I begin diving into other possible discrepancies. Again, thanks a ton for the all the work you’ve done, it’s a great simplified version of Egil’s work and has been really helpful for me to wrap my head around this.

    • DM June 26, 2017 / 5:10 pm

      Nevermind! It was an issue in my code, somehow 🙂

    • Sven Thoms May 8, 2019 / 4:51 am

      Also, Egil is using Theano, which is not supported anymore. Your version, Dayne, can be adapted better to recent Tensorflow Versions and tf.keras.
      About the predicted values: that vertical var is around 75 in my plot.
      75, 60, 90 … what was the issue in your code that led to the predicted time shifted to the right?

  9. slaw June 29, 2017 / 2:44 pm

    Great post. I had a question about the normalization. If I train a model on a single feature that has raw values between 0-0.5 and the test data has values between 0-1.0 then wouldn’t the separate normalizations of the training and test sets be inconsistent? It would be even worse if the new incoming unlabeled data had values between 0-2.0 and so the normalization would be inconsistent across all three data sets. Is there a good way to address this?

    • daynebatten July 6, 2017 / 7:33 am

      This is just a bug… thanks for pointing it out. I actually noticed it in a private project a couple weeks ago and haven’t gotten around to fixing it in the public repo. I’ll knock that out now.

  10. pup August 16, 2017 / 7:35 am

    Hi Dayne – How do you think this will extrapolate to a retail customer churn model which is not subscription based. Instead customers simply stop coming. Would the analogue to RUL be the number of days to the last transaction?


    • daynebatten August 28, 2017 / 8:52 pm

      Thanks for writing! I think it would map on quite well. In fact, in Egil Martinsson’s original work (cited above), he was envisioning just such a scenario!

  11. Alexander September 4, 2017 / 12:36 pm

    Do you know how to deal with irregular observations? For example during let’s say 12 months period there could be various number of observations for each subject at various dates during this 12 month period (pretty much random dates and number of events per subject). Number, frequency, and parameters taken during each observation could be related to the probability of the event – for example if device is brought in for repair multiple times during last 12 months it is obviously will affect its chance of a catastrophic failure in the next 30 days.

    • daynebatten September 6, 2017 / 11:16 am

      Hmmm… that’s an interesting question and, unfortunately, I don’t know that I have a great answer. It may be worth looking into the application of RNNs in situations with varying amounts of time between observations? I’d imagine somebody else has faced this issue before, though maybe in a different context.

  12. Ed September 22, 2017 / 5:14 pm

    First of all GREAT THANKS!
    I work with surv analysis in R (rfsrs etc)
    i’ve read your post

    please, tell where i can read description of the dataset?
    jet engine failure
    I cann’t find this

    I guess column 0 means # of engine
    column 1 means # of timestep
    Do I understand correctly?
    Where in train data we can see sensored or unsensored event?
    thanks a lot

    • daynebatten September 25, 2017 / 8:16 am

      Thanks for the kind words… The jet engine failure data is #6 on the linked page… “Turbofan Engine Degradation Simulation Data Set.” There’s an included readme and other resources.

      I believe (though I didn’t go explicitly check, so you may want to verify) that all of the engines in the data set were followed until the end of their useful life, so there are no censored observations.

  13. Ayub Quadri November 17, 2017 / 12:11 am

    Hello Daynebatten,

    Its a wonderful work with a beautiful dataset to understand the Time to a failure event, I want to know more about some of the parameters in the test_results which it generates, as per my understanding.

    Test_results -> [ TTE, Event Indicator, Alpha, Beta]
    where TTE -> actual value
    Event Indicator -> yes or no [1/0]
    Alpha -> some value
    Beta -> some value

    to calculate the predicted Time to failure do we need to feed it to the formula inverse CDF

    1. The inverse cumulative distribution function is I(p) = alpha (-ln(P))^(1/beta)


    2. Probability density function (PDF)

    3. Cumulative density function (CDF)?

    the formula for Weibull distribution function f(x), PDF, CDF, ICDF(p) is provided in

    Please correct me if my assumption is wrong.

    • daynebatten November 17, 2017 / 9:14 am

      There isn’t necessarily a “predicted time to failure” as a point value. Instead, alpha and beta give us information about the probability of failure over time. If you want to get a point estimate (as I did to make the graph in the GitHub ReadMe), you could look for the point in time where the survival function for the Weibull distribution crosses 50%. This is the point where the chance of having failed by that point goes over 50%, so it’s a defensible estimate for remaining useful life.

      • Ayub Quadri November 20, 2017 / 10:00 am

        I want to know more how did you calculate the ‘Predicted Remaining Useful life’
        The outcome of mode is 4 fields
        1. Actual RUL
        2. Event Indicator
        3. Alpha
        4. Beta

        it might be a silly question but please explain me how did you evaluate the Predicted RUL from the above outcomes.

  14. Ayub Quadri November 21, 2017 / 2:17 pm

    please help me out to understand the outcome generated by the model.
    it gave me alpha, beta values but what would be the TTE?

    can I feed that to Weibull function to predict the Remaining Useful Life (RLU)?

    • daynebatten November 22, 2017 / 8:45 pm

      Please see my previous reply. There is no point value for predicted time to failure… instead there’s a distribution across future time.

      For the purposes of that graph on GitHub, I did something quick and dirty – I looked for the point where the Weibull survival function crossed 50%. This is not exactly correct, but, as I said in the last post, it’s reasonable, and it gave me what I wanted – a quick and dirty way to plot my results against actual values and prove that the model learned something. If you’re looking for me to lay the math out for you, the survivor function is exp(-(t/a)^b). Set it equal to .5 and solve for t and you get a(-ln(.5))^(1/b). Plug in a and b and you get your estimate.

      As it turns out (see pages 3-4 of the document linked below), the correct way to estimate expected future life is to integrate the survival function from 0 to infinity. Please, please, though… understand that neither of these are a prediction that the subject will experience a failure on that particular day, it’s just our best estimate for how long it will survive.

      • rory cawley April 5, 2018 / 4:58 am

        Just to clarify, when you mean:
        survivor function is exp(-(t/a)^b)

        is this the code you are saying to change?
        hazard0 = k.pow((y_ + 1e-35) / a_, b_)

        • daynebatten August 10, 2018 / 12:04 pm

          Not saying to change that code, just saying that that’s the formula for calculating time until 50% likelihood of having failed. Just plug alpha and beta into that formula and you get your estimate.

  15. Gianmario Spacagna January 9, 2018 / 8:56 am

    Thanks for the work.
    I tried to re-implement your architecture using same data and following your code.
    I think there is a weird effect if you train the network for more epochs or if you switch to adam which converges faster than RMSprop. What happens is that weights of the input connections of the output layer tend close to zero while the bias increase. Aka, the output become constant. Thus one value of alpha and beta for any test sequence.
    I am still investigating the problem. It could be something related to the L2 normalization (default in the normalize scikit-learn function). I am also experimenting with batch normalization but I need to get rid of zero-variance features (that are: op_setting_3, sensor_measurement_18 and sensor_measurement_19).

    Have you tried to run exactly the same code but increasing epochs to something like 500 or with a different optimizer?


    • daynebatten January 30, 2018 / 11:27 am

      Yes, we ran into this problem as well in certain scenarios, but I got pulled into other work before I could solve it. Have you found anything helpful?

    • cristiana corno July 19, 2019 / 8:51 am

      same here and the problem is in the training set too,

  16. rory cawley April 5, 2018 / 4:53 am

    Thank you Dayne for putting this together! I’ve a question about interpreting the results. Below is an example of the output from the code. I see 112 vs 194, 98 vs 155, 69 vs 86 and 82 vs 86. I’m unsure how to use these numbers to understand whether the model is effective or not. Would you mind taking a couple of examples from the output and explain how they show work. Should it be that the engines are ranked by predicted TTE so you can then prioritize the schedule maintenance for them? Sorry to ask you to tease out what these findings really mean.

    [112. 1. 194.55757141 2.05106115]
    [ 98. 1. 155.65660095 1.96848094]
    [ 69. 1. 86.60446167 1.54581654]
    [ 82. 1. 86.64717102 1.54790354]
    [ 91. 1. 86.62908173 1.54897189]
    [ 93. 1. 86.11849976 1.54517996]
    [ 91. 1. 86.29958344 1.54640079]
    [ 95. 1. 86.54808044 1.55025446]
    [111. 1. 141.9490509 1.92031062]
    [ 96. 1. 86.59703827 1.54538286]

    • daynebatten August 10, 2018 / 12:01 pm

      Hi Rory,

      As discussed above, the third and fourth values in the output for each observation are alpha and beta parameters of a weibull distribution describing the likelihood of engine failure over future time. You can use these parameters to calculate a distribution function or cumulative distribution function of likely failures for each engine. You’d then have to choose some statistic to help you decide how to prioritize which engines to focus on. Maybe it’s time until the CDF goes past 50%, for example… (in other words, time at which the engine is predicted to have a >50% chance of having failed). Ultimately, this is going to be up to the discretion of the researcher, and you’ll have to choose thresholds that give you the kind of precision/recall performance you want.


      • rory cawley September 14, 2018 / 9:22 pm

        Thanks Dayne

  17. Nicolai Bjerre Pedersen May 6, 2018 / 11:36 am

    Hi Dayne,

    Really interesting work here. One question though; why are every observation marked as uncensored? I agree that the last day for each engine should be uncensored, but the days up till failure should be censored? I mean; you assume you dont know the future, right?

    Hopefully you understand my question.

    One other thing; what about using cox regression instead – do you have pro and cons for that approach?


    • daynebatten August 10, 2018 / 12:05 pm

      Hi Nicolai,

      Thanks for commenting. Every observation is marked as uncensored because every engine in this data set did fail eventually. Not every data set will be that way, obviously.


  18. Jeff May 18, 2018 / 4:51 am

    Hi, if I am using this WTTE-LSTM for medical data where some of them has no event occuring within the recording data some do, what would my y-axis be? In your case you put all 1s because you’re assuming the engine will fail sometime in the future. What would you recommend in my case since it is not necessarily true that everyone will get, for example, cancer? Thanks!

    • daynebatten August 10, 2018 / 12:09 pm

      Unfortunately, most of the survival analysis field is predicated on the assumption that everybody experiences the event of interest eventually. In medical contexts, the event of interest is often death, and everybody dies. If you’re looking at something that everybody may not experience (e.g., getting cancer), the assumptions fail. People will argue about how much of a problem that is in practice (and there are plenty of papers published that, for example, use vanilla survival analysis to look at things like divorce rates), but the assumptions fail nonetheless.

      You can choose to ignore the problem and use the model and just treat people that never got cancer as censored, but do so at your own risk.

      There’s also models specifically designed for handling data where not everybody experiences the event of interest. Google “cure models” if you’re interested. Incorporating the math behind those models into this project would be a pretty big undertaking, though.

  19. Srujana Takkallapally September 4, 2018 / 2:19 pm

    Hi, I am trying to use this model for retail industry where time sequences are not continuous (as customers do not come to retailer everyday). For example for customer 1 – My time sequences are – 1, 14, 22, 52, 100…. Can I still use this model for predicting customer churn using np.zeros for days in between?

    • daynebatten April 19, 2019 / 7:59 am

      You could certainly use this model, but the challenge would be figuring out when a “churn” even occurs. You could likely define it as a customer not coming in for X days, where X is longer than the gap you’d expect a normal customer to have in their shopping behavior. For example, if you’re working in grocery retail, maybe a gap of a few weeks is normal (somebody’s on vacation), but a 60 day gap is probably a good indication that somebody’s moved on.

      I hope that’s helpful?

  20. alice April 19, 2019 / 1:11 pm

    Thanks a lot for the wrapping up code for WTTE-RNN. It makes much easier to understand.However, I am confused about how you transform the features in the line “xtemp[:, max_time-min(j, 99)-1:max_time, :] = engine_x[max(0, j-max_time+1):j+1, :]”. Could you give some explanations? Many thanks.

    • daynebatten May 8, 2019 / 3:58 pm

      No problem! I’m glad it’s helpful. Yes, I’d be happy to explain a bit.

      xtemp is a (0, max_time, 24) numpy array where max_time is the look-back window for building the data set (in other words, we’re predicting using the last [max_time] days of data) and 24 is just the number of features in the dataset.

      We fill that [max_time]-day look-back window with as much data as we have for a given engine on a given day. In some cases, that’s [max_time] days, but sometimes it’s less (for the second day we’ve observed an engine, for example). So, the max_time-min(j, 99)-1:max_time slices that window out of the xtemp data, and the max(o, j-max_time+1):j+1 slices the same period out of the engine data so we can assign it to the xtemp data.

      Does that help?

  21. Ishita April 23, 2019 / 1:02 pm

    Hey Dayne,
    Thanks for the amazing post. Just a quick and naive question since I am totally new in the field.
    I was actually confused from where are we getting the alpha and beta value initially, as ab_pred has not been initialized anywhere.

    • daynebatten May 8, 2019 / 3:49 pm

      No problem! We actually aren’t getting the alpha and beta value initially, if I’m understanding your question correctly. The model just learns to predict them…

  22. Andrew June 28, 2019 / 6:29 am

    Thank you very much for this work!
    I have a question relating to using this in a retail scenario: as I understand it, I can either define my own churn date (e.g. no activity for X days) and treat this as a time-to-failure problem (where “failure”=churn), or I can treat it as a time-to-event problem, where each order is an event, and the as-yet-unseen orders are censored data. Is this correct?

  23. Dominik July 19, 2019 / 5:09 am

    Hi Dayne,

    Thanks for your work, it’s really concise and very helpful! I’m also a fan of Egil’s thesis and blog post, and I’ve actually started to try and implement this at work on real customer data. Your Keras implementation is very helpful. The difference is, I’m doing this in R so I need to check how everything works. For example, I’m not sure about being able to do custom activation functions in R; I found some explanations about using custom loss functions though. So, I completely understand what you mean with wrapping one’s head around everything – I’m still wrapping, haha! Something I’m kind of paranoid about is the input shape. I have multiple customers per time step, similar to the jet engine example where there are multiple engines per time step. Do you maybe have any helpful tipps how to set batch size (time steps and data dimension are clear)? I have it boiled down to two situations I think: one batch is the data for all n customers at the j-th time step, or one batch is the data across all time steps for the i-th customer. Or do you have any other tipps? I saw that your input shape didn’t take 3 arguments, what was the reasoning here?

    Sorry for the intense questions and again, thanks for your work it’s a great addition to Egil’s work. If you have some thoughts, any and all help is appreciated if you have the time.

    All the best,

  24. cristiana corno July 19, 2019 / 8:48 am

    ciao Dayne
    I am a little confused on the performance of the network (actually got better result with a linear regression on the log of tte on the engine dataset) .
    Can i write you in private?

  25. alice October 3, 2019 / 4:26 pm

    Do we need to set up the fixed look back window (in this case 100) for each engine or is it OK to use different look back window for each engine? Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *