Social Network Analysis for Computer Scientists

This is the old course page! See the new page for recent information!

This course deals with computer science (CS) aspects of social network analysis (SNA), and is open to all students in the master computer science programme at Leiden University.

News
- Comments from your evaluation forms of the fall 2017 edition of the course are available.
- The final course project paper can be handed in via Dropbox. The deadline of December 12 is final and strict.
- A short guide on copying assignment files to your machine via SCP is available.
- A new version of Gephi was recently released. It requires a different input format, which may not be compatible with that of the assignment data. Therefore, keep using Gephi 0.9.1.

Course information

Lectures: Fridays from 11:00 to 12:45 (Sep. 8 - Dec. 1) in Snellius room B02
Lab sessions (not every week): Fridays from 9:00 to 10:45 in room 302/304
Lecturer: dr. Frank Takes - f.w.takes@liacs.leidenuniv.nl, room 157b
Teaching assistants: Hanjo Boekhout BSc (h.d.boekhout@umail.leidenuniv.nl) and
Jesper van Engelen BSc (j.e.van.engelen@umail.leidenuniv.nl, Snellius room 120)
Prerequisites: a CS(-related) bachelor; mainly courses on Algorithms, Data Structures and Data Mining
Literature: provided papers and book chapters (free and digitally available)
Examination: based on presentation, paper, programming, peer review and participation (no exam)
Study points: 6 ECTS

Network with 1458 nodes and 1948 edges.

Course schedule

		Dec. 13	Dec. 15
	Date	First slot (9:00-10:45)	Second slot (11:00-12:45)
1.	Sep 8, 2017		Lecture 0: Course organization Lecture 1: Introduction and small world phenomenon
2.	Sep 15, 2017	Lab session: Gephi Work on Assignment 1	Lecture 2: Advanced network concepts & Distance-based centrality
3.	Sep 22, 2017	Lab session: NetworkX Work on Assignment 1	Lecture 3: Network formation & Community detection
3.5.	Sep 29, 2017	Work on Assignment 1	No lecture. Work on Assignment 1
	Oct 2, 2017	Deadline for Assignment 1: Click here to upload your assignment
4.	Oct 6, 2017	Work on Assignment 2	Lecture 4: Structure of the web & Propagation-based centrality Example presentation: Diameter computation
5.	Oct 13, 2017	Work on Assignment 2	Lecture 5: Dynamic network models
6.	Oct 20, 2017	Work on Assignment 2	Student presentation 1: Disease spread ~~Student presentation 2: Neighborhoods~~ (moved to later)
7.	Oct 27, 2017	Work on Assignment 2	Student presentation 3: Signed link prediction Student presentation 4: Network motifs 1
	~~Oct 30~~ Nov 1, 2017	Deadline for Assignment 2: Click here to upload your assignment
8.	Nov 3, 2017	Student presentation 5: Community detection 2 Student presentation 6: Counting triangles	Student presentation 7: Network visualization 2 Student presentation 8: Network de-anonymization
	Nov 10, 2017	Deadline for the first three chapters of the course project paper (for the peer review session)
9.	Nov 10, 2017	Student presentation 9: Network visualization 1 Student presentation 10: Betweenness centrality 1	Peer review session
10.	Nov 17, 2017	Student presentation 11: Diameter computation Student presentation 12: Influence spread 1	Student presentation 13: Betweenness centrality 2 Student presentation 14: Closeness centrality 2
	Nov 24, 2017	Deadline for a substantial amount of code / experimental pipeline of the course project (for the code review session)
11.	Nov 24, 2017	Student presentation 15: Shortest paths 1 Student presentation 162: Neighborhoods	Code review session: Best practices in a collaborative document
	Nov 20 23 27, 2017	Deadline (optional) for preliminary version of course project paper: Click here to upload
12.	Dec 1, 2017	Student presentation 17: Personalized PageRank Student presentation 18: Sampling methods	Student presentation 19: Minimum weight cycles Student presentation 20: Shortest paths 2
13.	Dec 8, 2017	Student presentation 21: Link prediction Student presentation 22: Graph compression	Student presentation 23: Influence spread 2 Student presentation 24: Network dissolution and decline
	Dec 12, 2017	Deadline for the final version of the course project paper: Click here to upload
	Dec 13 and 15, 2017	Course project evaluation sessions with student teams, from 9.00 to 17.00 in slots of 30 minutes.
		9:00 Betweenness centrality 1	10:00 Minimum weight cycles
		9:30 Betweenness centrality 2	10:30 Neighborhoods
		10:00 Closeness centrality	11:00 Network dissolution and decline
		10:30 Community detection	~~16:00~~ 11:30 Visualization 2
		11:00 Counting Triangles	~~12:00~~ 9:30 De-anonymization
		11:30 Network motifs	13:00 Personalized PageRank
		12:00 Disease Spread	13:30 Sampling Methods
		13:30 Graph Compression	14:00 Shortest paths 1
		14:00 Influence spread 1	14:30 Shortest paths 2
		14:30 Influence spread 2	15:00 Signed link prediction
		15:00 Link prediction	15:30 Visualization 1
	Dec. 22, 2017	Course end. Grades will be submitted to the student administration

Lab session Friday September 22 in room 302/304

We will again use the UNIX/Linux workstations in Snellius room 302/304 for this lab session.
There is again no deliverable for this lab session. The main goal of this lab session is to become familiar with NetworkX (a Python package to analyze networks for research purposes).
All relevant information on NetworkX can be found in the NetworkX online documentation.

Python instructions to use NetworkX: NetworkX should be installed on all machines, but you may have to enter the right "environment". See /vol/share/software/datascience/2016-README. If you run into issues, it may be productive to try Anaconda.

Take some time to do the NetworkX tutorial.
Try to load the network from the first lab session (/vol/share/groups/liacs/scratch/SNACS/small-gephiready.in) into NetworkX, and investigate some characteristic properties of the network, such as the degree distribution. Use the read_edgelist function.
Generate a visualization of this network with NetworkX, and output it to pdf. Play around with the visualization parameters.
Get the Epinions network from the SNAP repository. Can you visualize it?
Compute metrics like betweenness centrality and visualize the result by making the size of a node dependent on this measure.
What are the differences between NetworkX and Gephi in terms of visualization and analysis capabilities?
Proceed with Exercise 2 of Assignment 1.

Looking for a challenge? Check out Graph-Tool, a python graph analysis toolkit that leaves the hard computation to parallel (OpenMP) C++ code. Or snap, which is entirely written in C++ (although there is now also a Python plugin).

Lab session Friday September 15 in room 302/304

We will use the UNIX/Linux Ubuntu workstations in Snellius room 302/304 for this lab session. If you are not familiar with these workstations, then some information is provided by the ISSC, see their pages on the workstations and SSH access. Also see this ICT FAQ.
If you really must, you can, at the risk of bugs and problems that course personnel can probably not help you solve, attempt to use Windows or the even more unstable version of Gephi for Macintosh.
There is no deliverable for this lab session. The main goal of this lab session is to become familiar with Gephi (experimental beta-software to visualize networks for research purposes) and its input format. At the end of this session you should be able to visualize raw network data with labeled nodes and labeled and/or weighted edges (directed or undirected), and you should understand how to map edge and node size and color to structural network properties.

Extract and run Gephi locally on a student workstation (running it from your homedirectory may be slow), for example on the local disk in /scratch/gephi. Installation instructions, if needed, can be found at the Gephi website.
You must adjust one line in /etc/gephi.conf so that Gephi can find the right Java version:
jdkhome="/usr/lib/jvm/java-8-openjdk-amd64/" (and remove the # in front of the line)
Take some time to do the three Gephi tutorials at the Gephi website:
- Quick start,
- Visualization and
- Layouts
The Quick Start Guide is for a slightly older version of Gephi. On p. 9-10, contrary to what is indicated in the getting started guide, the ranking module is part of the appearance module (since Gephi 0.9), and does not appear as a separate tab. The “Ranking result” table no longer exists, and you can skip that part of the tutorial. In the tutorial on Visualization, on p. 9 the 3-D visualisation option was bugged and has been removed in Gephi 0.9. You can skip that part of the tutorial.
Try out some of the Gephi datasets.
Visualize the network represented by the edge list /vol/share/groups/liacs/scratch/SNACS/small-gephiready.in. Create a New Project, go to Data Laboratory and Import Spreadsheet. Choose tab as separator and Edges table as format.
Play around with some different visualization algorithms such as ForceAtlas 2 (for example, with scaling set to 100 and with stronger gravity checked) and Fruchterman Reingold.
Compute the betweenness centrality (a measure of node centrality/importance, as we will learn later) of each node via the "Network Diameter" button, and then use the "Ranking" tab on the left side to visualize the node's betweenness values and degree values using node color and node size.
Make a new, very simple, edge list input file of a network with say 10 nodes and 20 edges, and import it into Gephi. Now add to the input file (in some order) link (un)direction, link weights, node labels and link labels. For example, have nodes represent students, and link represent friendships, and choose node and edge labels accordingly.Make sure that you are able understand the difference in file input and visualization output.
Get started with Exercise 2 of Assignment 1.

Reading material

Students with non-CS backgrounds have expressed interest in additional reading material to help freshen up on skills and knowledge required for this course.

To brush up on your Python programming skills, you could do Lecture 0, 1 and 4 of these Scientific Python Lectures.
To freshen up your LaTeX skills, you could do for example this tutorial.
MIT has a nice online course on discrete mathematics. Relevant chapters: 5, 6 and 7.
The (Dutch) Leiden courses Algoritmiek (Algorithms) and Datastructuren (Data structures) might be also interesting to browse through.
Basic knowledge of the concepts in machine learning and data mining are provided, for example, at the Wikipedia machine learning portal.

Course project

Teams work on a course project for 60% of the course grade. For this project they pick one of the topics listed below. Each topic has an associated list of papers which have to be studied. Papers on a particular topic all consider different algorithms or methods for solving roughly the same problem. The goal of the project is to make a comparison of at least two of these papers based on experiments using real social network data. So the project consists of:

Giving a 30-minute presentation (+15 minutes for questions) on one of the papers on their topic. Some topics include overview papers, these should not be presented. A paper presenting at least some experiments, should be chosen. At least a Powerpoint/PDF presentation and some demonstration of an implementation or visualization will be required. Teams are also expected to provide feedback on some of the presentations given by their fellow students during the lectures. Students should send the name of the paper they will present at least two weeks before the presentation to the lecturer.
Gathering and implementing the algorithms and/or techniques from the different papers, and running experiments on at least five large real-world network datasets. Thus, some programming has to be done. Teams will also give feedback on the code produced by other teams in the code review session. Some papers introduce multiple techniques. In that case, choose a logical subset to compare, and motivate your choice.
Datasets can for example be found at SNAP, KONECT, BigDND, Networks Repository, ICON, LASAGNE and ASU. Certain topics need particular datasets (e.g. with timestamps, signed links, etc.), which should of course be taken into account when selecting datasets.
As a rule, the course project paper should always have one novel (new) aspect compared to the papers on which it is based. For example, a new tweak to an existing algorithm, a really large number of datasets to test the algorithms to find a relation between the network characteristics and the performance, a new performance metric to evaluatie the algorithms, a new type of visualization of the algorithm or results, an improvement of a proof related to the algorithm, etc.
Writing one 6 to 10 page paper. In the paper the different techniques are analyzed and compared in detail using extensive experiments. The paper, to be written in LaTeX, has to follow the format of an actual scientific paper. Students will also give feedback on the paper produced by another team in the peer review session.

Topics

Suggestions for topics are listed below, and you are encouraged to pick one of these topics (as a team). You are also allowed to suggest your own networks-related problem (topic) in the field of social network analysis, assuming you are have found at least two scientific CS papers that suggest different techniques or algorithms for solving this problem (so, addressing this same topic).

If you choose a topic, then you are also encouraged to look up additional papers in this problem domain to study and compare to the paper that you presented, as the papers to study are merely "simple" suggestions. For some topics, there exists an overview/review paper that discusses many more papers that study the same (type of) problem.

Note: scientific papers (ACM, Elsevier, etc.) can often only be opened from within the university domain (or from home via SSH/VPN/etc.).
IEEE Explore papers can often be opened by looking them up via computer.org. Alternative links and preprints of papers can often be found through Google Scholar.

Instructions for the course project paper »

Project Topics

This list of projects is shown below. Take a look at the papers, before choosing a topic. E-mail f.w.takes@liacs.leidenuniv.nl with your topic preference and names of the two team members. First come, first serve. Your topic is not confirmed until you receive an e-mail from me. You have until September 29 to e-mail your preferred topic. After that, a final division of topics and formation of teams will be made on October 6 during the lecture. Prefix [Chosen] indicates a topic is no longer available.

[Chosen] Shortest paths 1 :
"Fast shortest path distance estimation in large networks" [Pot09]
"Fast and accurate estimation of shortest paths in large graphs" [Gub10]
"A highway-centric labeling approach for answering distance queries on large sparse graphs" [Jin12]
"Fast exact shortest-path distance queries on large networks by pruned landmark labeling" [Tak13]
[Chosen] Shortest paths 2:
"IS-Label: an independent-set based labeling scheme for point-to-point distance querying" [Wai13]
"Approximate Shortest Distance Computing: A Query-Dependent Local Landmark Scheme" [Mia12]
"Orion: Shortest Path Estimation for Large Social Graphs" [Zha10]
[Chosen] Diameter computation:
"Finding the Diameter in Real-World Graphs" [Cre10]
"Fast and Simple Approximation of the Diameter and Radius of a Graph" [Boy06]
"Fast computation of empirically tight bounds for the diameter of massive graphs" [Mag09]
"Determining the diameter of small world networks" [Tak11]
[Chosen] Sampling methods:
"Sampling from Large Graphs" [Les06]
"Walking in Facebook: A Case Study of Unbiased Sampling of OSNs" [Gjo2010] and
"Crawling Online Social Graphs" [Ye2010]
[Chosen] Graph Compression:
"The webgraph framework I: compression techniques" [Bol2004]
"GPS: a graph processing system" [Sal2013]
"On compressing social networks" [Chi2009]
"GraphChi: Large-Scale Graph Computation on Just a PC" [Kyr2012]
[Chosen] Betweenness centrality 1:
"A measure of betweenness centrality based on random walks" [New05]
"Approximating Betweenness Centrality" [Bad07]
"Better Approximation of Betweenness Centrality" [Gei08]
"A faster algorithm for betweenness centrality" [Bran01]
[Chosen] Betweenness centrality 2:
"Identifying high betweenness centrality nodes in large social networks" [Kou12]
"Topology manipulations for speeding betweenness centrality computation" [Puz15]
"Fast approximation of betweenness centrality through sampling" [Rio15]
Closeness centrality 1:
"Fast Approximation of Centrality" [Epp04]
"Centrality estimation in large networks" [Bra07] and
"Ranking of Closeness Centrality for Large-Scale Social Networks" [Kaz08]
[Chosen] Closeness centrality 2:
"Fast and Simple Computation of Top-k Closeness Centralities" [Bor15]
"Incremental Algorithms for Closeness Centrality" [Sar13]
"Efficient top-k closeness centrality search" [Ols14]
[Chosen] Link prediction:
"Link Prediction using Supervised Learning" [Has06]
"Link Prediction in Relational Data" [Tas03] and
"New perspectives and methods in link prediction" [Lic10]
[Chosen] Influence spread and virality 1:
"Maximizing the spread of influence through a social network" [Kem03]
"A data-based approach to social influence maximization" [Goy11] and
"How to win friends and influence people, truthfully: influence maximization mechanisms for social networks" [Sin12]
[Chosen] Influence spread and virality 2:
"Scalable influence maximization for prevalent viral marketing in large-scale social networks" [Che10]
"Competitive Influence Maximization in Social Networks" [Bha07]
"Efficient influence maximization in social networks" [Che09]
"Social Network Sensors for Early Detection of Contagious Outbreaks [Chr10]
"CELF++: optimizing the greedy algorithm for influence maximization in social networks" [Goy11]
[Chosen] Disease spread:
"Spread of epidemic disease on networks" [New02]
"Information diffusion through blogspace " [Gru04]
"Contagion in financial networks" [Gai10]
[Chosen] Community detection 1:
"Finding and evaluating community structure in networks" [New04]
"Community structure in social and biological networks" [Gir02]
[Chosen] Community detection 2:
"Fast unfolding of communities in large networks" [Blo08]
"Faster unfolding of communities: speeding up the Louvain algorithm" [Tra15]
[Chosen] Visualization algorithms 1:
"Graph drawing by force-directed placement" [Fru91]
"ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software" [Jac14] and
"Topology Preserving Constrained Graph Layout" [Tol09]
[Chosen] Visualization algorithms 2:
"OpenOrd: an open-source toolbox for large graph layout" [Mar11]
"Efficient, high-quality force-directed graph drawing" [Yu05]
[Chosen] Network motifs 1:
"The gaston tool for frequent subgraph mining" [Nij05]
"Frequent subgraph discovery" [Kur01]
"Efficient mining of frequent subgraphs in the presence of isomorphism" [Hua03]
Network motifs 2:
"g-tries: an efficient data structure for discovering network motifs" [Rib10]
"A faster algorithm for detecting network motifs" [Wer05]
[Chosen] De-Anonymization of networks:
"Prediction Promotes Privacy in Dynamic Social Networks" [Bha10]
"A Practical Attack to De-anonymize Social Network Users" [Won10]
[Chosen] Network dissolution and decline: "DecLiNe -- Models for Decay of Links in Networks" [Pre14]
"Loosing Frieds on Facebook" [Que12]
"More of a Receiver Than a Giver: Why Do People Unfollow in Twitter?" [Kwa12]
[Chosen] Counting triangles: "Doulion: counting triangles in massive graphs with a coin" [Tso09]
"Main-memory triangle computations for very large (sparse (power-law)) graphs" [Lat08]
"Triangle listing in massive networks and its applications" [Chu11]
[Chosen] Neighborhoods:
"ANF: a fast and scalable tool for data mining in massive graphs" [Pal02]
"R-MAT: A Recursive Model for Graph Mining" [Cha02]
"Fast Approximation of Centrality" [Epp04]
[Chosen] Link prediction in signed networks:
"Predicting positive and negative links in online social networks" [Les10]
"Predicting Trust and Distrust in Social Networks " [DuB11] and
" Friend or frenemy?: predicting signed ties in social networks " [Yan12]
[Chosen] Personalized PageRank:
"Quick Detection of Top-k Personalized PageRank Lists" [Avr11]
"Fast algorithms for Top-k personalized PageRank queries" [Gup08]
"Fast incremental and personalized PageRank" [Bah10]
Multiplex networks:
"Diffusion Dynamics on Multiplex Networks" [Gom13]
"Structural measures for multiplex networks" [Bat14]

This list will be updated regularly as teams are formed and topics are chosen.

Course description

See the e-Studyguide for a more general description.

Topics include: SNA from a CS perspective (graph representation, complexity issues, examples), Graph Structure (power law, small world phenomenon, clustering coefficient, hierarchies), Paths and Distances (neighborhoods, radius, diameter), Spidering and Sampling (BFS, forest fire, random walks), Graph Compression (graph grammars, bitwise tricks, encryption, hashing), Centrality (degree centrality, closeness centrality, betweenness centrality, rating and ranking), Centrality and Webgraphs (HITS, PageRank, structure of the web), Community Detection (spectral clustering, modularity), Visualization (force-based algorithms, Gephi, NodeXL), Graph Models (random graphs, preferential attachment), Link Prediction (structure, semantics, prediction algorithms, graph mining), Contagion (diffusion of information, spreading activation, gossipping) and Privacy and Anonymity ((de-)anonymizing graphs, ethical aspects, privacy issues).

The course was also given in 2014, 2015 and 2016.