liacs.leidenuniv.nl/~takesfw/DSPM/assignment1.html


Introduction

CookieDestroyer

Congratulations! For the next 3 weeks you are employed as all-round Data Science expert of the largest Leiden-based gaming business "Thunderstorm Entertainment". One of their games is CookieDestroyer, a popular game played by people of all ages and genders worldwide, primarily making money from selling in-game "coins" that allow the gamer to progress in the game more quickly. Your job is to provide the company's management with information on the status, growth and trends with respect to the monetization of CookieDestroyer. To do this, you will use data from the company's sales logs, containing all purchases of coins made by the customers. Ultimately, your data analysis, visualization, interpretation and tools should allow company management to make decisions on future steps to expand the game.


Assignment format

This is Assignment 1 of the Data Science and Process Modelling course taught at Leiden University. It is also the first assignment of the Data Science course.

For each part of the assignment, the number of points awarded for a 100% perfect answer is listed between brackets and sums to a total of 100 points. You should answer each question as precisely as possible; not addressing parts of the question means that fewer points are awarded. Your assignment grade (between 1 and 10, bounds included) is computed by dividing your number of points by 10 and rounding it to the nearest half. If you get an insufficient grade for the assignment, you can retake the assignment by meeting the assignment retake deadline. Please do not be late with handing in your work. If you are late with handing in your work, it means that you failed the assignment and that you are automatically using the retake deadline for the assignment. Retake assignment grades have 2 points subtracted from the total. You are allowed to work in teams consisting of exactly two people. For each question, clearly describe how you obtained your answer, and write down any non-trivial assumptions. All practical exercises can be done on the student workstations. Be sure to hand in digitally:

  1. Your final assignment report (in PDF, generated using LaTeX)
  2. Your Dashboard URL
  3. All relevant server-side source code in one zip-file or tarball

Questions or remarks? Preferably ask them during one of the weekly lectures or lab sessions. In case of urgent questions outside these hours, contact one of the course assistants via e-mail, or ask the lecturer in person.


Assignment contents

Assignment goal

The goal of the assignment is to:

  1. Create a web-based dashboard which can be used by management to understand and interpret the sales data of the CookieDestroyer game.
  2. Answer a few strategic questions on the game using the data and/or the dashboard in a short assignment report.

The report that you hand in for this assignment should contain a short introduction to the data, the company and the dashboard, as well as the answers to the strategic questions.


Data

The data for this assignment comes in three files: sales, methods and countries. The main data table to be studied is sales, which contains a few hundred thousand sales records spanning a time period from 2010 to 2015. It has the following attributes:

  • saleId - the sale unique identifier
  • saleDateTime - date and time
  • accountName - name of the user, anonymized using one-way collision-free hashing
  • coins - the number of game coins given
  • currency - the currency of the user
  • priceInCurrency - the price in the user's currency
  • priceInEUR - the price converted to EUR (which is all the pricing information you need for this assignment)
  • methodId - unique ID of the payment method used for this sale
  • ip - the IP (again anonymized)
  • ipCountry - the ISO2 country code of the country corresponding to that IP

Furthermore, there are tables methods and countries which map the methodId and ipCountry fields in the sales table to (anonymized) payment method names and country names. Each table is available in .sql format and in .csv format. The files can be found here:

sales .csv .sql
methods .csv .sql
countries .csv .sql For some additional instructions, see below.

The files can also be found in the shared UNIX folder /vol/share/groups/liacs/scratch/DSPM/.

A simple way to load data from file.sql into MySQL is via the command line, for example as follows:
mysql -h mysql.liacs.leidenuniv.nl -u username -p username < file.sql
Username is your ULCN username, password is your MySQL password. The second username is to select the database name, which is equal to your username.


Dashboard

The web-based dashboard consists of various (at least four) widgets that should each visualize (in a different way) the following aspects of the data:

  1. [10p] Sales Volume: Visualize the sales over time (for example, per month or per quarter) in terms of at least three measures. For example: real money (price attribute), virtual money (amount attribute) and transaction count (row count).
  2. [15p] Attribute Segmentation and Filtering (OLAP-style): Present sales volume (as above) segmented per attribute: at least the payment method (methodId) attribute should be included, but you can also think of the day of the week, time of the day or the country of the customer.
  3. [15p] Customer Behavior: Can you come up with some KPI's (metrics and scores) that help management understand the customer? Visualize them in a nice way. For example, the average number of transactions per customer, the average lifespan of a customer, etc. Bonus points if you can show a plot of a distribution of the (perhaps binned) values of the metrics in addition to an average.
  4. [10p] Geographical Development: visualize the sales volume (as above) in a geographical setting, for example the number of distinct customers per country over time. The goal is again to give management as much geographic insight as possible (see Strategic Questions later on).
  5. [10p] Explain your toolstack in the report; what components did you write yourself, what did you reuse? How did you "glue" it all together? No need to list actual code in the report.
  • [10p] You are very much encouraged to make the dashboard visually appealing, and to use non-traditional visualization techniques to give management an astonishing insight in the sales data of their game. In particular, the interaction and degrees of freedom in visualizing and exploring the sales data should be as large as possible.

You can use any programming language, scripting language, markup language, or framework (as long as it is open-source), but here are some hints.

  • Data storage: You can choose to store the data in a MySQL database, but you are also free to use and experiment with NoSQL databases such as ElasticSearch (try for example Kibana), Apache superset or any other framework of your choice.
  • Markup language: when needed, on the markup/client side you will want to use at least HTML, with proper CSS style classes assigned to elements. Try to stick to a modern presentation framework such as Twitter Bootstrap to prevent you from having to write custom CSS.
  • Scripting language: On the server side, you can use PHP to serve the data, and then use Javascript and in particular jQuery's interface for JSON on the client side. Depending on your type of data storage, you could also use Python on the server side. Or work with Node.js or Angular.js.
  • Visualization: For visually appealing charts you could for example use D3.js, Chart.js or Google Charts.

Not sure what to use? Go with what you learned in your first year the "easy mode":
- Data in the MySQL database (perhaps create some relevant indexes),
- One HTML page with the dashboard, styled using Twitter Bootstrap.
- One PHP script that serves relevant data in JSON-format.
- Some Javascript/jQuery code to put the data in relevant Google Charts diagrams.
You could start with the skeletoncode.


Strategic Questions

The following questions should be answered by querying the data or using the dashboard. For each question, always elaborately motivate your answers based on the data, for example by giving queries or instructions to use the dashboard.

  1. [10p] Visualization For each component of your dashboard, explain how you mapped data attributes to visual attributes, and why you made various visualization or interaction design choices. How do these choices help management understand the data, and to what extent have you implemented "Codeless reporting"?
  2. [10p] Decision Making: For next year, company management has budget to start expanding in three countries. Based on the data, what are the emerging countries in which expansion could be worthwile? State how you define "emerging". Try to do more than counting transactions. For example, look for a trend, or also incorporate volume or customer behavior.
  3. [10p] Fraud and Compliance: Gaming companies have a responsibility towards society to prevent addiction and excessive spending patterns. A customer that makes more than 50 orders within one day can be considered addicted, but (assuming that the coins have value and can be resold) could also be committing fraud (for example, with a stolen credit card or phone). How do you identify a customer when a customer can have multiple account names in the game? How do you propose to distinguish between addiction and fraud based on the data? What are the results of applying your metric to the data, i.e., who are potential fraudulent users?

Good luck with the assignment! Ask questions. A lot, if you have to. The deadline is posted on the course website.