Please don’t stop the Music — Sparkify

Predicting churn using Apache Spark. Analyzing data of the fictional company Sparkify.


You offer the best music streaming service in the market, yet you have customers churning left and right?

Maybe a possible solution to it would be offering winback offers before the customer churned.

Psychologically the customer is much more likely to stay with your service if he did not make the final decision to cancel his contract. Even if it is already in his mind. That is the reason why you should try to predict churn in advance.

In the following I will demonstrate how to do predict customers cancelling their contract using Apache Spark. Since streaming services gather tons of rows of customer logs per day, depending on their size. It makes sense do use a large-scale analytics engine like Spark.

The Data

We are running our analysis on a small extract of the dataset containing 286500 rows in total. The following columns are available for analysis:

Due to meaningful naming, most of them should be self-explanatory. All information used in the classification, will be further discussed in the Feature Engineering section.

After cleaning missing or empty values in the most important columns, we have to first and foremost define churn. For this we use the page information which provides many useful information. Possible values are given as

The churn event is triggered by the value Cancellation Confirmation. We additionally introduce a column that indicates for all entries of a single user whether he churned in the time frame of our dataset.

Exploratory Data Analysis

Looking at the distribution of active and churned customers, we can see that slightly below 25% of the customers in the dataset did churn.

Next, we are going to take a look if there is any difference in the actions the two groups perform. The information is contained in the page column.

It seems that users that did not churn seem to engage more positive with regard to Sparkify, e.g. they

  • give more thumbs up
  • add more friends
  • add more songs to playlists

Users that churn in contrast

  • give more thumbs down
  • are more likely to roll advert

Additionally, we compare the average amount of songs listened by subscription status and find that users that churned have a lower amount of songs on average.

Feature Engineering

From the previously discussed exploratory analysis we derive the following features per user:

  • Number of thumbs up and down
  • Number of playlist and friend adds
  • Number of roll adverts
  • Songs Played

With further discussion in the Jupyter notebook, we derive two additional features:

  • The total amount of time spend listening to songs
  • And the number of different artists

The notebook also displays how Feature Engineering is done in Pyspark, the format required for Machine Learning Models in Spark and information regarding scaling. As a brief summary, we used Pyspark’s Machine Learning Feature capabilities in the form of VectorAssembler and StandardScaler to prepare the features.


Next, we can finally start predicting churn. Two potential models are fitted testing different parameters. On the one hand this is a Logistic Regression model, on the other hand a Gradient Boost Tree Classifier. For both models we use Pyspark’s Machine Learning Classifier capabilities.

The code for the initialization of the models is given by:

We set maxIter = 10 for both models, mostly for runtime reasons. Though I did not encounter any real problems, runtime was already pretty long for this small dataset. Additionally tuning regParam and elsasticNetParam for the Logistic Regression model and maxDepth as an important parameter for the Gradien Boost Tree Model.

Final parameters were given by (maxIter, regParam, elasticNetParam) = (10, 0.0, 0.0) for the Logistic Regression and (maxIter, maxDepth) = (10, 5) for the Gradient Boost Tree model.

We choose F1 Score as optimization metric since, as seen above, the ratio between active/churned customers is approximately 3:1 . Therefore the data can be considered imbalanced. The F1 Score is a metric that performs better than for example Accuracy on imbalanced datasets.

Different parameters are tested using Pyspark’s cross-validation functionalities in combination with a parameter grid.


Only evaluation the F1-Score, the Logistic Regression performs slightly better ( 0.684 to 0.657). Both models deliver the same accuracy of 0.78.

Furthermore I want to analyze the confusion matrices of both models.

You can clearly see that both models do not perform especially good. Unfortunately the Logistic Regression does not predict a single churn. This is not only a bad model but unfortunately the result would have no impact on Sparkify at all. If we do not predict churn we cannot target customers.

The Gradient Boost Tree model on the other hand at least classifies some customers as churn. This behavior has way more impact on Sparkify and their winback possibilities.

In this case a customer falsely classified as churn will cost Sparkify only some marketing and winback costs. Not predicting that a customer will churn on the other hand costs a full customer and therefore the full contributions.

Therefore the Gradient Boost Tree Classifier is chosen as predicting model. The feature importance of the model is displayed in the following figure.

We can see that especially friends and ratings play an important role in the predicting model which is what we indicated by the Exploratory Data Analysis.


In this article we took a look at Music Streaming Customer logs and analyzed those using Apache Spark.

We discussed why it is important to predict churn and which implications it can have on the company. Which actions it offers.

We found out that customers that cancel the service show a different user behavior which clearly shows that they are not as satisfied as the average customer base.

We implemented and displayed two potential models in Pyspark. And even more importantly discussed how to properly evaluate churn models in this case. Taking into account the monetary effect on the company.

Using this discussion it would make sense to build a model focusing on the number of False Positives specifically. Additionally, using the whole data set on a larger server cluster or cloud, it could make sense to use some kind of downsampling approach in order to balance out the dataset.

Is your music still playing?

The underlying code and analysis can be found in my GitHub available here.