liacs.leidenuniv.nl/~takesfw/DSPM/assignment1.html




Note: this is the *old* 2020 assignment.


Introduction

CookieDestroyer

Congratulations! For the next 3 weeks you are employed as all-round Data Science expert of the largest Leiden-based gaming business "Thunderstorm Entertainment". One of their games is CookieDestroyer, a popular game played by people of all ages and genders worldwide, primarily making money from selling in-game "coins" that allow the gamer to progress in the game more quickly. Your job is to provide the company's management with information on the status, growth and trends with respect to the monetization of CookieDestroyer. To do this, you will use data from the company's sales logs, containing all purchases of coins made by the customers. Ultimately, your data analysis, visualization, interpretation and tools should allow company management to make decisions on future steps to expand the game.


Assignment format

This is Assignment 1 of the Data Science and Process Modelling course taught at Leiden University. It is also the first assignment of the Data Science course.

For each part of the assignment, the number of points awarded for a 100% perfect answer is listed between brackets and sums to a total of 100 points. You should answer each question as precisely as possible; not addressing parts of the question means that fewer points are awarded. Your assignment grade (between 1 and 10, bounds included) is computed by dividing your number of points by 10 and rounding it to the nearest half. If you get an insufficient grade for the assignment, you can retake the assignment by meeting the assignment retake deadline. Please do not be late with handing in your work. If you are late with handing in your work, it means that you failed the assignment and that you are automatically using the retake deadline for the assignment. Retake assignment grades have 2 points subtracted from the total. You are allowed to work in teams consisting of exactly two people. For each question, clearly describe how you obtained your answer, and write down any non-trivial assumptions. All practical exercises can be done on the student workstations. Be sure to hand in digitally:

  1. Your final assignment report (in PDF, generated using LaTeX), including on the front page your dashboard URL
  2. All relevant server-side source code in one zip-file or tarball

Questions or remarks? Preferably ask them during one of the weekly lectures or lab sessions. In case of urgent questions outside these hours, contact one of the course assistants via e-mail, or ask the lecturer in person.


Assignment contents

Assignment goal

The goal of the assignment is to:

  1. Create a web-based dashboard which can be used by management to understand and interpret the sales data of the CookieDestroyer game.
  2. Answer a few strategic questions on the game using the data and/or the dashboard in a short assignment report.

The report that you hand in for this assignment should contain a short introduction to the data, the company and the dashboard, as well as the answers to the strategic questions.


Data

The data for this assignment comes in three files: sales, methods and countries. The main data table to be studied is sales, which contains a few hundred thousand sales records spanning a time period from 2010 to 2015. It has the following attributes:

  • saleId - the sale unique identifier
  • saleDateTime - date and time
  • accountName - name of the user, anonymized using one-way collision-free hashing
  • coins - the number of game coins given
  • currency - the currency of the user
  • priceInCurrency - the price in the user's currency
  • priceInEUR - the price converted to EUR (which is all the pricing information you need for this assignment)
  • methodId - unique ID of the payment method used for this sale
  • ip - the IP (again anonymized)
  • ipCountry - the ISO2 country code of the country corresponding to that IP

Furthermore, there are tables methods and countries which map the methodId and ipCountry fields in the sales table to (anonymized) payment method names and country names. Each table is available in .sql format and in .csv format. The files can be found here:

sales .csv .sql
methods .csv .sql
countries .csv .sql For some additional instructions, see below.

The files can also be found in the Leiden University shared Linux folder /vol/share/groups/liacs/scratch/DSPM/.


Dashboard

The web-based dashboard consists of various (at least four) widgets that should each visualize (in a different way) the following aspects of the data:

  1. [10p] Sales Volume: Visualize the sales over time (for example, per month or per quarter) in terms of at least three measures. For example: real money (price attribute), virtual money (amount attribute) and transaction count (row count).
  2. [15p] Attribute Segmentation and Filtering: Present sales volume (as above) segmented per attribute: at least the payment method (methodId) attribute should be included, but you can also think of the day of the week, time of the day or the country of the customer. Think a bit in terms of the OLAP cube.
  3. [15p] Customer Behavior: Can you come up with some KPI's (metrics and scores) that help management understand the customer? Visualize them in a nice way. For example, the average number of transactions per customer, the average lifespan of a customer, etc. Bonus points if you can show a plot of a distribution of the (perhaps binned) values of the metrics in addition to an average.
  4. [10p] Geographical Development: visualize the sales volume (as above) in a geographical setting, for example the number of distinct customers per country over time. The goal is again to give management as much geographic insight as possible (see Strategic Questions later on).
  5. [10p] Explain your toolstack in the report; what components did you write yourself, what did you reuse? How did you "glue" it all together? No need to list actual code in the report.
  • [10p] You are very much encouraged to make the dashboard visually appealing, and to use non-traditional visualization techniques to give management an astonishing insight in the sales data of their game. In particular, the interaction and degrees of freedom in visualizing and exploring the sales data should be as large as possible.

You can use any programming language, scripting language, markup language, or framework (as long as it is open-source), but here are some hints.

  • Data storage: You can choose to store the data in a MySQL database, but you are also free to use and experiment with NoSQL databases such as ElasticSearch (try for example Kibana), Apache superset or any other framework of your choice.
  • Markup language: when needed, on the markup/client side you will want to use at least HTML, with proper CSS style classes assigned to elements. Try to stick to a modern presentation framework such as Twitter Bootstrap to prevent you from having to write custom CSS.
  • Scripting language: On the server side, you can use PHP to serve the data, and then use Javascript and in particular jQuery's interface for JSON on the client side. Depending on your type of data storage, you could also use Python on the server side. Or work with Node.js or Angular.js.
  • Visualization: For visually appealing charts you could for example use D3.js, Chart.js or Google Charts.

Not sure what to use? Go with what you learned in your first year the "easy mode":
- Data in the MySQL database (perhaps create some relevant indexes),
- One HTML page with the dashboard, styled using Twitter Bootstrap.
- One PHP script that serves relevant data in JSON-format.
- Some Javascript/jQuery code to put the data in relevant Google Charts diagrams.
You could start with the skeletoncode.


Strategic Questions

The following questions should be answered by querying the data or using the dashboard. For each question, always elaborately motivate your answers based on the data, for example by giving queries or instructions to use the dashboard.

  1. [10p] Visualization For each component of your dashboard, explain how you mapped data attributes to visual attributes, and why you made various visualization or interaction design choices. How do these choices help management understand the data, and to what extent have you implemented "Codeless reporting"?
  2. [10p] Decision Making: For next year, company management has budget to start expanding in three countries. Based on the data, what are the emerging countries in which expansion could be worthwile? State how you define "emerging". Try to do more than counting transactions. For example, look for a trend, or also incorporate volume or customer behavior.
  3. [10p] Fraud and Compliance: Gaming companies have a responsibility towards society to prevent addiction and excessive spending patterns. A customer that makes more than 50 orders within one day can be considered addicted, but (assuming that the coins have value and can be resold) could also be committing fraud (for example, with a stolen credit card or phone). How do you identify a customer when a customer can have multiple account names in the game? How do you propose to distinguish between addiction and fraud based on the data? What are the results of applying your metric to the data, i.e., who are potential fraudulent users?

Good luck with the assignment! Ask questions. A lot, if you have to. The deadline is posted on the course website.




Hints / getting started

Webdevelopment know-how
A crash-course on webdevelopment is available. In addition, tutorials can be found at W3 Schools (HTML, CSS, javascript, jQuery, Bootstrap). However, fluency in each of these languages is not per se required; some skeleton code is available, which may serve as a good starting point.

Webspace at LIACS
Each user should have SSH-access to the liacs.leidenuniv.nl webserver when connecting from the workstations in computer rooms, or via WiFi 'wlan 3'. Your can create your own website by placing files, e.g., an index.html file, in the ~/public_html folder on this server. Your website is visible at http://liacs.leidenuniv.nl/~s.... Also see generic instructions: LIACS-ISSC webserver

MySQL
A simple way to load data from file.sql into MySQL is via the command line while on the webserver or a LIACS desktop machine, for example as follows:
mysql -h mysql.liacs.leidenuniv.nl -u username -p username < file.sql
Username is your ULCN username, password is your MySQL password. The second username is to select the database name, which is equal to your username.

Mounting your webspace on your local machine
There is a possibility to conveniently mount the webserver's public_html folder on your own machine. This way, your webspace just looks like a regular folder on your computer. From within the university this can be done by issuing the following commands:
mkdir public_html
sshfs s...@liacs.leidenuniv.nl:/home/s.../public_html ./public_html
The first command creates in your current working directory on your machine a folder public_html. This needs only to be done when this folder is not yet present. The second command mounts the remote folder ~/public_html on liacs.leidenuniv.nl in your local folder. Note that this second command should be run each after each logout or reboot.

Mounting your webspace from outside the university
The ssh-server at liacs.leidenuniv.nl is only accessible from the computer workstations or from the ssh gateways sshgw.leidenuniv.nl or ssh.liacs.nl.
Starting from OpenSSH 7.3, released August 2016, the easiest way to configure this is using a SSH config file, which needs to be placed at ~/.ssh/config. Here is an example:

Host sshgw.leidenuniv.nl 
    User s...... 

Host liacs.leidenuniv.nl
    User s...... 
    ProxyJump sshgw.leidenuniv.nl
                    
Given that you have an empty directory ./public_html, you can connect in the same way as you did at the local machines:
sshfs liacs.leidenuniv.nl:/home/s.../public_html ./public_html
Note that this command needs to be issued every time you restart your computer.