liacs.leidenuniv.nl/~takesfw/DSPM/assignment2.html


Introduction

CookieDestroyer

You spent the past few weeks at Thunderstorm Entertainment visualizing their data. Company management is impressed with your performance and has decided to continue working with you. As part of "Operation Thunderstorm", the board is seeking your continued advice regarding the use of their data to increase future revenue. Management has requested that you dive into predictive analytics, specifically the question of indicating which customers are most likely to make a purchase in the near future. In general, this helps decide which customers are loyal and which ones are not, allowing for more targeted marketing activities. To do this, you will use data mining techniques on the sales data, aggregated to a customer level. Your job is to perform a pilot study in which you ultimately convince management of the use of data science techniques for understanding customer behavior.

Assignment format

This is Assignment 2 of the Data Science and Process Modelling course taught at Leiden University.
It builds on the same data as used in Assignment 1.

For each part of the assignment, the number of points awarded for a 100% perfect answer is listed between brackets and sums to a total of 100 points. You should answer each question as precisely as possible; not addressing parts of the question means that fewer points are awarded. Your assignment grade (between 1 and 10, bounds included) is computed by dividing your number of points by 10 and rounding it to the nearest half. If you get an insufficient grade for the assignment, you can retake the assignment by meeting the assignment retake deadline. Please do not be late with handing in your work. If you are late with handing in your work, it means that you failed the assignment and that you are automatically using the retake deadline for the assignment. Retake assignment grades have 2 points subtracted from the total. You are allowed to work in teams consisting of exactly two people. For each question, clearly describe how you obtained your answer, and write down any non-trivial assumptions. All practical exercises can be done on the student workstations. Be sure to hand in digitally:

  1. Your final assignment report (in PDF, generated using LaTeX)
  2. All relevant source code in one zip-file or tarball

Questions or remarks? Preferably ask them during one of the weekly lectures or lab sessions. In case of urgent questions outside these hours, contact one of the course assistants via e-mail, or ask the lecturer in person.

Assignment contents

Assignment goal

The goal of the assignment is to perform predictive analytics on the sales data to predict customer behavior, deriving different customer-based features and employing these in different data mining algorithms. You will write a scientific report on your results and specifically report on experiments that employ a neural network, including a discussion of relevant parameters. Finally, you will write a 250 word management summary in which you summarize your main findings for a non-CS audience.

Note that the main output for this assignment is the report, which should of course meet academic empirical research standards. For example, based on the report, it should be possible to repeat your experiments and obtain the same results. It should follow the structure of a typical scientific report, should pass a standard plagiarism check, and should have proper references to relevant literature.


The data

See the comments regarding the data in Assignment 1 for details.


Part 1: Feature extraction

First, we are going to convert the sales data to customer-based data and determine how we are going to verify our future data mining results.

  1. [12p] Determine (and describe in the report) at least eight customer-based features that you think can be used to predict which customer will make a purchase in the future. Consult literature and use common sense. Write code to extract these features from the sales data. Hint: think of customer lifetime, spending behavior, order frequency, etc.
  2. [13p] Determine and describe (in the report) how you verify your data mining results. What will be the training, test and validation data? How do you prevent overfitting? Which performance metrics do you use, and why? Think of the typical steps and caveats of data mining as discussed during the lectures.

An easy way to do this part is to write some code to create a new dataset called customers (either in MySQL or in .csv-format), so that you will have a flat table to perform data mining on. A row in this new table could look like this:

customerId feature1 feature2 ... featureN class

The class attribute can refer to whether or not the customer converted in the future, the duration of the future lifetime, or the future sales volume. Explain and motivate your choice(s).


Part 2: Data mining

  • [10p] Use a visualization toolchain such as matplotlib (perhaps automagically enhanched by seaborn) in combination with pandas to investigate the influence of your features on future conversion. For example, generate a correlation matrix of the features and your class attribute. Report on your findings in the report, and include the matrix as a figure.
  • [24p] Use for example scikit-learn to apply at least three different machine learning algorithms (e.g., logistic regression, decision trees and support vector machines) to predict future conversion. Report on your findings in the report. Mention how you set parameters (even if you use the default ones), and reflect on what happens when you do change a parameter. Include at least one relevant diagram as a figure, comparing the performance of the three methods.

To use Python in your own environment:
wget http://repo.continuum.io/archive/Anaconda3-4.3.0-Linux-x86_64.sh
bash Anaconda3-4.3.0-Linux-x86_64.sh

Accept questions with 'yes'.
source anaconda3/bin/activate
You should now have your own environment in which you can install packages as you wish.


Part 3: Neural network

[16p] Management has recently heard about neural networks and deep learning, and wants you to employ neural networks to predict customer loyalty.
You are not immediately convinced that a neural network approach is per se beneficial, and promise to critically assess whether a neural network is useful in this context.

In addition to the three approaches from Part 2 of this assignment, assess the performance of a neural network on this prediction task. Experiment with at least two relevant parameters, such as the number of layers and the number of neurons, and report on your findings using at least one diagram.


Part 4: Management summary

Write a summary of your steps in at most 250 words that convinces management of the use of data science techniques [17p]. Make sure that this is a picture perfect text that you would feel comfortable sending to a high value customer. Remember that the reader does not have a computer science background. Think of explaining at least what problem you solved (from a domain expert perspective), what data mining methods you used to solve it, what data you used, and what the main findings/results are. Discuss potential business-related implications for the company.
Add one picture-perfect diagram (including proper axis descriptions and a caption) to the textual summary to support your argument [8p].


Good luck with the assignment! Ask questions. A lot, if you have to. The deadline is posted on the course website.