Social Network Analysis for Computer Scientists

This is the old course page! See the new page for recent information!

News
- The textual comments from the 2016 evaluation forms are available here.
- As announced, all course projects have been graded. Thank you all for your participation! There is an opportunity to come collect feedback on January 10 or 13.

Course information

Lectures: Fridays from 11:15 to 13:00 in the fall semester of 2015 in Snellius room ~~402~~ ~~403~~ 313
Lab sessions (not every week): Fridays from 9:00 to 10:45 in Snellius room 302/304
Lecturer: dr. Frank Takes - f.w.takes@liacs.leidenuniv.nl, room 157b
Teaching assistant: Govert Brinkmann - g.g.brinkmann@umail.leidenuniv.nl
Prerequisites: a CS(-related) bachelor; in particular basic knowledge of Algorithms and Data Mining
Literature: provided papers and book chapters (free and digitally available)
Examination: based on presentation, paper, programming, peer review and participation (no exam)
Study points: 6 ECTS

Network with 1458 nodes and 1948 edges.

Course schedule

Date

First slot (9:00-10:45)

Second slot (11:15-13:00)

Sep 9, 2016

Lecture 1: Introduction and small world phenomenon

Sep 16, 2016

Lab session: Gephi and Assignment 1

Lecture 2: Network topology and centrality

Sep 23, 2016

10.00: Lecture 3: Guest lecture by prof. dr. P. Cresenzi (University of Florence)

11.15: Lab session: NetworkX and Assignment 1

Sep 30, 2016

Lab session: work on Assignment 1

Lecture 4: Network formation and community detection
Presentation 0

Oct 5, 2016

Deadline for Assignment 1

Oct. 7, 2016

Lab session: work on Assignment 2

Lecture 5: Network evolution and models

Oct. 14, 2016

Lab session: work on Assignment 2

No lecture

Oct. 21, 2016

Lab session: work on Assignment 2

Lecture 6: Structure of the web and network visualization

Oct. 28, 2016

Lab session: Course project framework
and work on Assignment 2

Presentation 1: Sampling methods (LR)
Presentation 2: Virality & Outbreak detection 2 (HR)

Nov 2, 2016

Deadline for Assignment 2

Nov. 4, 2016

Presentation: Graph compression/representation (EZ)
Presentation: Shortest paths 1 (BB)

Presentation: Shortest paths 2 (KL)
Presentation: Signed link prediction (FW)

Nov. 11, 2016

We start at 10.00.
Presentation: Betweenness centrality (HP)

Presentation: Link prediction (DG)
Presentation: Community detection 1 (QR)

10.

Nov. 18, 2016

Peer review session

Guest lecture by André Vermeij (Kenedict Innovation Analytics)

11.

Nov. 25, 2016

Presentation: Graph Partitioning (DH)
Presentation: Network dissolution (GJ)

Presentation: Network motifs (DS)
Presentation: Diameter computation (SZ)

Nov. 29, 2016

Deadline for Intermediary Paper Check

12.

Dec. 2, 2016

Presentation: Network de-anonymization (PH)
Presentation: Closeness centrality 2 (RH)

Code review session

13.

Dec. 9, 2016

Presentation: Closeness centrality 1 (SB)
Presentation: Virality & Outbreak detection 1 (BK)

Presentation: Advances in community detection (TnT)
Presentation: Personalized PageRank (GR)

Dec. 14, 2016

Deadline for Course Project Paper

Dec. 16, 2016

Evaluation sessions

09:00 - 09:25	Shortest Paths 1
09:25 - 09:50	Virality and outbreak detection 1
09:50 - 10:15	Closeness Centrality 1
10:15 - 10:40	Network motifs
10:40 - 11:05	Link prediction
11:05 - 11:30	Community detection 2
11:30 - 11:55	Graph compression/representation
11:55 - 12:20	Signed Link Prediction
12:20 - 12:45	Network dissolution
12:50 - 13:15	Personalized PageRank
13:15 - 13:40	Closeness Centrality 2
13:40 - 14:05	Virality and outbreak detection 2
14:05 - 14:30	Advances in Community detection
14:30 - 14:55	De-Anonymization of networks
14:55 - 15:20	Betweenness Centrality
15:20 - 15:45	Shortest Paths 2
15:45 - 16:10	Sampling methods
16:10 - 16:35	Community detection 1
16:35 - 17:00	Diameter computation

Dec. 22, 2016

Course end. Grades will be submitted to the student administration

Assignments

About Assignment 1

Some hints:

Some hints for Question 2:
- By default, NetworkX creates an undirected graph when using nx.read_edgelist(). You can use the create_using=nx.DiGraph() argument of nx.read_edgelist() to specify that it should be loaded as directed. Examples and documentation can be found in the documentation.
- The nx.draw() function doesn't give you much control over the layout algorithm that is used to position the nodes (like Gephi does). You can first compute the positions of the nodes using one of the layout algorithms (nx.spectral_layout, nx.spring_layout, etc.), store these positions and pass them to nx.draw() as an argument. This gives you some more control over the algorithm, in case you want to use NetworkX instead of Gephi for the visualization. Also see the documentation.
A clarification w.r.t. Question 1.5: you may assume that nodes are integers and that the adjacency list is stored as an vector of vectors, of which the size is known. In terms of implementation, think of an array of arrays, where array element A[i] contains the array of neighbors of A[i].
Assignment 1 can also be handed in via Dropbox (for example, when you have large visualization PDFs that cannot be e-mailed).

About Assignment 2

Assignment 2 can also be handed in via Dropbox.

Lab session Friday October 28 in room 302/304

We will work on a framework for the course project.
The idea is that this session will help you set up your experimental pipeline for the project.

Code and input

See if there is an (adjustable) implementation of the code available, for example from the author's personal website or research group website.
If the above is not the case, then determine which parts you will have to implement yourself.
What input data is expected to perform experiments? A network, or also parameters?

Data sources

From your papers, extract which kind of network data (directed/undirected, signed, labeled, expected size, timestamped, etc. etc.) is expected.
Go through the list of datasets, for example at KONECT, and determine which ones are feasible for your project.
Try to ensure sufficient diversity among the datasets. Can you write some code to report on the basic properties of the datasets, relevant to your research?

Output and results

What is the output of your algorithms? Can you automatically store this output as a number, or a list? For later processing.
Ensure that you can measure at least time usage, for example using the Linux time command. How about memory usage?
How do you measure the quality of the result?
Can the output of your algorithms be used to plot the result, or to automatically make result tables for your paper?

As a final step, consider glueing the above together, for example using some simple bash scripts that read all network datasets from a particular data/ folder, then run these datasets on some executables in a bin/ folder, storing the results for each dataset in a results folder or file. Here you can heavily rely on standard input/output redirection.

Recall that the above is meant as a guideline; other set-ups are also possible.

Lab session Friday September 16 in room 302/304

We will use the UNIX/Linux workstations in Snellius room 302/304 for this lab session. If you really must, you can attempt to use Windows, fully at your own risk (you will at least need to edit gephi.conf so that it points to the correct Windows JRE version 6 or 7 (and not 8)). Some information on the workstations is provided by the ISSC, see their pages on the workstations and SSH access. Also see this ICT FAQ.
There is no deliverable for this lab session. The main goal of this lab session is to become familiar with Gephi (experimental beta-software to visualize networks for research purposes) and its input format. At the end of this session you should be able to visualize raw network data with labeled nodes and labeled and/or weighted edges (directed or undirected), and you should understand how to map edge and node size and color to structural network properties.

Extract and run Gephi locally on a student workstation (running it from your homedirectory may be slow or not work at all), for example on the local disk in /scratch/gephi. Installation instructions, if needed, can be found at the Gephi website.
You must adjust one line in /etc/gephi.conf so that Gephi can find the right Java version:
jdkhome="/usr/lib/jvm/java-7-openjdk-amd64/" (and remove the # in front of the line)
Take some time to do the three Gephi tutorials at the Gephi website:
- Quick start,
- Visualization and
- Layouts
Try out some of the Gephi datasets.
Visualize the network represented by the edge list /vol/share/groups/liacs/scratch/SNACS/small-gephiready.in
Play around with some different visualization algorithms such as ForceAtlas 2 (with stronger gravity) and Fruchterman Reingold.
Compute the betweenness centrality (a measure of node centrality/importance, as we will learn later) of each node via the "Network Diameter" button, and then use the "Ranking" tab on the left side to visualize the node's betweenness values and degree values using node color and node size.
Make a new, very simple, edge list input file of a network with say 10 nodes and 20 edges, and import it into Gephi. Now add to the input file (in some order) link direction, link weights, node labels and link labels. Make sure that you are able understand the difference in file input and visualization output.
Get started with Exercise 2 of Assignment 1.

Reading material

Some students with non-CS backgrounds have expressed interest in additional reading material to help freshen up on skills and knowledge required for this course.

To brush up on your Python programming skills, you could do Lecture 0, 1 and 4 of these Scientific Python Lectures.
To freshen up your LaTeX skills, you could do for example this tutorial.
MIT has a nice online course on discrete mathematics. Relevant chapters: 5, 6 and 7.
The (Dutch) Leiden courses Algoritmiek (Algorithms) and Datastructuren (Data structures) might be also interesting to browse through.
Basic knowledge of the concepts in machine learning and data mining are provided, for example, by the Wikipedia machine learning portal.

Course project

Teams work on a course project for 60% of the course grade. For this project they pick one of the topics listed below. Each topic has an associated list of papers which have to be studied. Papers on a particular topic all consider different algorithms or methods for solving roughly the same problem. The goal of the project is to make a comparison of at least two of these papers based on experiments using real social network data. So the project consists of:

Giving a 30-minute presentation (+15 minutes for questions) on one of the papers on their topic. Some topics include overview papers, these should not be presented. A paper presenting at least some experiments, should be chosen. At least a Powerpoint/PDF presentation and some demonstration of an implementation or visualization will be required. Teams are also expected to provide feedback on some of the presentations given by their fellow students during the lectures. Students should send the name of the paper they will present at least two weeks before the presentation to the lecturer.
Gathering and implementing the algorithms and/or techniques from the different papers, and running experiments on at least five large real-world network datasets. Thus, some programming has to be done. Teams will also give feedback on the code produced by other teams in the code review session. Some papers introduce multiple techniques. In that case, choose a logical subset and motivate your choice.
Datasets can for example be found at SNAP, KONECT, BigDND, Networks Repository, ICON, LASAGNE and ASU. Certain topics need particular datasets (e.g. with timestamps, signed links, etc.), which should of course be taken into account when selecting datasets.
Writing one 6 to 10 page paper. In the paper the different techniques are analyzed and compared in detail using extensive experiments. The paper, to be written in LaTeX, has to follow the format of an actual scientific paper. Students will also give feedback on the paper produced by another team in the peer review session.

Topics

Suggestions for topics are listed below, and you are encouraged to pick one of these topics (as a team). You are also allowed to suggest your own problem (topic) in the field of social network analysis, assuming you are have found at least two scientific CS papers that suggest different techniques or algorithms for solving this problem (so, addressing this same topic). If you choose a topic, then you are also encouraged to look up additional papers in this problem domain to study and compare to the paper that you presented, as the papers to study are merely "simple" suggestions. For some topics, there exists an overview/review paper that discusses many more papers that study the same (type of) problem.

Note: scientific papers (ACM, Elsevier, etc.) can often only be opened from within the university domain (or from home via SSH/nxClient/VPN/etc.).
IEEE Explore papers can often be opened by looking them up via computer.org. Alternative links and preprints of papers can often be found through Google Scholar.

Instructions for the course project paper » (2015 version still)

Project Topics

[Chosen] Shortest paths 1 :
"Fast shortest path distance estimation in large networks" [Pot09]
"Fast and accurate estimation of shortest paths in large graphs" [Gub10],
"A highway-centric labeling approach for answering distance queries on large sparse graphs" [Jin12] and perhaps
"Fast exact shortest-path distance queries on large networks by pruned landmark labeling" [Tak13]
[Chosen] Shortest paths 2 :
"IS-Label: an independent-set based labeling scheme for point-to-point distance querying" [Wai13]
"Approximate Shortest Distance Computing: A Query-Dependent Local Landmark Scheme" [Mia12],
"Orion: Shortest Path Estimation for Large Social Graphs" [Zha10]
[Chosen] Diameter computation:
"Finding the Diameter in Real-World Graphs" [Cre10]
"Fast and Simple Approximation of the Diameter and Radius of a Graph" [Boy06],
"Fast computation of empirically tight bounds for the diameter of massive graphs" [Mag09] and perhaps
"Determining the diameter of small world networks" [Tak11]
[Chosen] Sampling methods:
"Sampling from Large Graphs" [Les06]
"Walking in Facebook: A Case Study of Unbiased Sampling of OSNs" [Gjo2010] and
"Crawling Online Social Graphs" [Ye2010]
[Chosen] Graph compression/representation:
"The webgraph framework I: compression techniques" [Bol2004]
"GPS: a graph processing system" [Sal2013],
"On compressing social networks" [Chi2009] and perhaps
"GraphChi: Large-Scale Graph Computation on Just a PC" [Kyr2012]
[Chosen] Betweenness centrality:
"A measure of betweenness centrality based on random walks" [New05]
"Approximating Betweenness Centrality" [Bad07],
"Better Approximation of Betweenness Centrality" [Gei08] and perhaps
"A faster algorithm for betweenness centrality" [Bran01]
[Chosen] Closeness centrality 1:
"Fast Approximation of Centrality" [Epp04]
"Centrality estimation in large networks" [Bra07] and
"Ranking of Closeness Centrality for Large-Scale Social Networks" [Kaz08]
[Chosen] Closeness centrality 2:
"Fast and Simple Computation of Top-k Closeness Centralities" [Bor15]
"Incremental Algorithms for Closeness Centrality" [Sar13]
"Efficient top-k closeness centrality search" [Ols14]
Neighborhoods:
"ANF: a fast and scalable tool for data mining in massive graphs" [Pal02]
"R-MAT: A Recursive Model for Graph Mining" [Cha02] and perhaps
"Fast Approximation of Centrality" [Epp04]
[Chosen] Link prediction in signed networks:
"Predicting positive and negative links in online social networks" [Les10]
"Predicting Trust and Distrust in Social Networks " [DuB11] and
" Friend or frenemy?: predicting signed ties in social networks " [Yan12]
[Chosen] Personalized PageRank:
"Quick Detection of Top-k Personalized PageRank Lists" [Avr11]
"Fast algorithms for Top-k personalized PageRank queries" [Gup08],
"Fast incremental and personalized PageRank" [Bah10]
[Chosen] Link prediction:
"Link Prediction using Supervised Learning" [Has06]
"Link Prediction in Relational Data" [Tas03] and
"New perspectives and methods in link prediction" [Lic10]
[Chosen] Virality and outbreak detection (1):
"Maximizing the spread of influence through a social network" [Kem03]
"A data-based approach to social influence maximization" [Goy11] and
"How to win friends and influence people, truthfully: influence maximization mechanisms for social networks" [Sin12]
[Chosen] Virality and outbreak detection (2):
"Scalable influence maximization for prevalent viral marketing in large-scale social networks" [Che10]
"Competitive Influence Maximization in Social Networks" [Bha07], "Efficient influence maximization in social networks" [Che09]
"Social Network Sensors for Early Detection of Contagious Outbreaks [Chr10] and perhaps
"CELF++: optimizing the greedy algorithm for influence maximization in social networks" [Goy11]
[Chosen] Community detection 1:
"An Algorithm for Partitioning the Nodes of a Graph" [Bar81]
"A clustering algorithm based on graph connectivity" [Har00],
"Finding and evaluating community structure in networks" [New04] and perhaps
"Community structure in social and biological networks" [Gir02]
[Chosen] Graph Partitioning:
"A fast and high quality multilevel scheme for partitioning irregular graphs" [Kar06]
"An Efficient Heuristic Procedure for Partitioning Graphs" [Ker70] and
"Partitioning graphs into balanced components" [Kra09]
Visualization algorithms (1):
"Graph drawing by force-directed placement" [Fru91]
"ForceAtlas2, a Continuous Graph Layout Algorithm for Handy Network Visualization Designed for the Gephi Software" [Jac14] and
"Topology Preserving Constrained Graph Layout" [Tol09],
Visualization algorithms (2):
"OpenOrd: an open-source toolbox for large graph layout" [Mar11]
"Efficient, high-quality force-directed graph drawing" [Yu05]
[Chosen] Finding network motifs:
"g-tries: an efficient data structure for discovering network motifs" [Rib10]
"Frequent subgraph discovery" [Kur01]
"Efficient mining of frequent subgraphs in the presence of isomorphism" [Hua03]
"The gaston tool for frequent subgraph mining" [Nij05]
[Chosen] De-Anonymization of networks:
"Prediction Promotes Privacy in Dynamic Social Networks" [Bha10]
"A Practical Attack to De-anonymize Social Network Users" [Won10]
[Chosen] Network dissolution and decline:
"DecLiNe -- Models for Decay of Links in Networks" [Pre14]
"Loosing Frieds on Facebook" [Que12]
"More of a Receiver Than a Giver: Why Do People Unfollow in Twitter?" [Kwa12]
Counting triangles:
"Doulion: counting triangles in massive graphs with a coin" [Tso09]
"Main-memory triangle computations for very large (sparse (power-law)) graphs" [Lat08]
"Triangle listing in massive networks and its applications" [Chu11]
[Chosen] Advances in community detection:
"Fast unfolding of communities in large networks" [Blo08]
"Faster unfolding of communities: speeding up the Louvain algorithm" [Tra15]

Course description

See the e-Studyguide for a more general description.

Topics include: SNA from a CS perspective (graph representation, complexity issues, examples), Graph Structure (power law, small world phenomenon, clustering coefficient, hierarchies), Paths and Distances (neighborhoods, radius, diameter), Spidering and Sampling (BFS, forest fire, random walks), Graph Compression (graph grammars, bitwise tricks, encryption, hashing), Centrality (degree centrality, closeness centrality, betweenness centrality, rating and ranking), Centrality and Webgraphs (HITS, PageRank, structure of the web), Community Detection (spectral clustering, modularity), Visualization (force-based algorithms, Gephi, NodeXL), Graph Models (random graphs, preferential attachment), Link Prediction (structure, semantics, prediction algorithms, graph mining), Contagion (diffusion of information, spreading activation, gossipping) and Privacy and Anonymity ((de-)anonymizing graphs, ethical aspects, privacy issues).

The course was also given in 2014 and 2015.