You spent the past few weeks at Thunderstorm Entertainment visualizing their data. Company management is impressed with your performance and has decided to continue working with you. As part of "Operation Thunderstorm", the board is seeking your continued advice regarding the use of their data to increase future revenue. Management has requested that you dive into predictive analytics, specifically the question of indicating which customers are most likely to make a purchase in the near future. In general, this helps decide which customers are loyal and which ones are not, allowing for more targeted marketing activities. To do this, you will use data mining techniques on the sales data, aggregated to a customer level. Your job is to perform a pilot study in which you ultimately convince management of the use of data science techniques for understanding customer behavior.
This is Assignment 2 of the Data Science and Process Modelling course taught at Leiden University.
It builds on the same data as used in Assignment 1.
For each part of the assignment, the number of points awarded for a 100% perfect answer is listed between brackets and sums to a total of 100 points. You should answer each question as precisely as possible; not addressing parts of the question means that fewer points are awarded. Your assignment grade (between 1 and 10, bounds included) is computed by dividing your number of points by 10 and rounding it to the nearest half. If you get an insufficient grade for the assignment, you can retake the assignment by meeting the assignment retake deadline. Please do not be late with handing in your work. If you are late with handing in your work, it means that you failed the assignment and that you are automatically using the retake deadline for the assignment. Retake assignment grades have 2 points subtracted from the total. You are allowed to work in teams consisting of exactly two people. For each question, clearly describe how you obtained your answer, and write down any non-trivial assumptions. All practical exercises can be done on the student workstations. Be sure to hand in digitally:
Questions or remarks? Preferably ask them during one of the weekly lectures or lab sessions. In case of urgent questions outside these hours, contact one of the course assistants via e-mail, or ask the lecturer in person.
The goal of the assignment is to perform predictive analytics on the sales data to predict customer behavior, deriving different customer-based features and employing these in different data mining algorithms. You will write a scientific report on your results and specifically report on experiments that employ a neural network, including a discussion of relevant parameters. Finally, you will write a 250 word management summary in which you summarize your main findings for a non-CS audience.
Note that the main output for this assignment is the report, which should of course meet academic empirical research standards. For example, based on the report, it should be possible to repeat your experiments and obtain the same results. It should follow the structure of a typical scientific report, should pass a standard plagiarism check, and should have proper references to relevant literature.
See the comments regarding the data in Assignment 1 for details.
First, we are going to convert the sales data to customer-based data and determine how we are going to verify our future data mining results.
An easy way to do this part is to write some code to create a new dataset called customers (either in MySQL or in .csv-format), so that you will have a flat table to perform data mining on. A row in this new table could look like this:
customerId feature1 feature2 ... featureN class
The class attribute can refer to whether or not the customer converted in the future, the duration of the future lifetime, or the future sales volume. Explain and motivate your choice(s).
To use Python in your own environment:
Accept questions with 'yes'.
You should now have your own environment in which you can install packages as you wish.
[16p] Management has recently heard about neural networks and deep learning, and wants you to employ neural networks to predict customer loyalty.
You are not immediately convinced that a neural network approach is per se beneficial, and promise to critically assess whether a neural network is useful in this context.
In addition to the three approaches from Part 2 of this assignment, assess the performance of a neural network on this prediction task. Experiment with at least two relevant parameters, such as the number of layers and the number of neurons, and report on your findings using at least one diagram.
Write a summary of your steps in at most 250 words that convinces management of the use of data science techniques [17p]. Make sure that this is a picture perfect text that you would feel comfortable sending to a high value customer. Remember that the reader does not have a computer science background. Think of explaining at least what problem you solved (from a domain expert perspective), what data mining methods you used to solve it, what data you used, and what the main findings/results are. Discuss potential business-related implications for the company.
Add one picture-perfect diagram (including proper axis descriptions and a caption) to the textual summary to support your argument [8p].
Good luck with the assignment! Ask questions. A lot, if you have to. The deadline is posted on the course website.