Infomap code

Here we provide the source code to the Infomap algorithm for detecting communities in large networks. See also GossipMap, a distributed version for billion-edge directed graphs implemented with GraphLab PowerGraph by Seung-Hee Bae and Bill Howe at University of Washington.

  1. Installation
  2. Running
  3. Input
  4. Output
  5. Features
  6. Benchmark
  7. Algorithm
  8. Troubleshooting

Installation

The latest Infomap code can be downloaded here (Thu Jan 12 2017): Infomap.zip. The source code is also available at Github.

Infomap is written in C++ and can be compiled with gcc using the included Makefile. To extract the zipped archive and compile the source code in a Unix-like environment, open a terminal and type this:

cd [path/to/Infomap]
unzip Infomap.zip
cd Infomap
make

Substitute [path/to/Infomap] to the folder where Infomap was downloaded, e.g. ~/Downloads.

Compilation error due to missing parallelization library. If there is a compilation error that references -lomp or -lpthread, the compiler probably can't find the OpenMP library which is used for parallelization. To compile without the OpenMP library, type make noomp. You can also try to compile with another compiler by calling for example CXX=g++-4.3 make. Please give us feedback if you have any troubles!

Parallelization support in OS X. Since Mac OS X 10.9, the standard compiler tools are based on clang, which in the current version lack support of the OpenMP library for parallelization. The Makefile now automatically skip the -fopenmap compiler flag if the standard compiler is clang. To get support for OpenMP, you can manually install a gcc-based compiler. A simple way is to install Homebrew and type for example brew install gcc43 in the terminal. Then run CXX=g++-4.3 make

To be able to run on Windows, you can install MinGW/MSYS to get a minimalist development environment where the above instructions will work. Please follow the instructions on MinGW - Getting Started or download the complete MinGW-MSYS Bundle.

Running

Usage: ./Infomap [options] network_data dest

The optional arguments can be put anywhere, see the Options section for the available options. The network_data should point to a valid network file (see Input section) and dest to a directory where the Infomap should write its output files.

If no option is provided, Infomap will assume an undirected network and try to partition it hierarchically.

Options

Run Infomap with the options -h or --help to list the options below:

-h[+] --help [+]
Prints this help message. Use -hh to show advanced options.
-V --version
Display program version information.
-i<s> --input-format <s>
Specify input format ('pajek', 'link-list', '3gram', 'multiplex' or 'bipartite') to override format possibly implied by file extension.
--overlapping
Let nodes be part of different and overlapping modules. Applies to ordinary networks by first representing the memoryless dynamics with memory nodes.
-z --zero-based-numbering
Assume node numbers start from zero in the input file instead of one.
-k --include-self-links
Include links with the same source and target node. (Ignored by default.)
--map
Print the top two-level modular network in the .map format.
--clu
Print the top cluster indices for each node.
--tree
Print the hierarchy in .tree format. (default true if no other output with cluster data)
--bftree
Print the tree including horizontal flow links in a streamable binary format.
-2 --two-level
Optimize a two-level partition of the network.
-u --undirected
Assume undirected links. (default except for input format '3gram')
-d --directed
Assume directed links. (default only for input format '3gram')
-t --undirdir
Two-mode dynamics: Assume undirected links for calculating flow, but directed when minimizing codelength.
-s<n> --seed <n>
A seed (integer) to the random number generator.
-N<n> --num-trials <n>
The number of outer-most loops to run before picking the best solution. (Default: 1)
-F[+] --fast-hierarchical-solution [+]
Find top modules fast. Use -FF to keep all fast levels. Use -FFF to skip recursive part.
--inner-parallelization
Parallelize the innermost loop for greater speed. Note that this may give some accuracy tradeoff.
-v[+] --verbose [+]
Verbose output on the console. Add additional 'v' flags to increase verbosity up to -vvv.
--silent
No output on the console.
You can write short options together as long as no argument is required. Thus -d2N3 is equivalent to -d -2 -N 3.

Advanced options:

Run Infomap with the options -hh to list the options below:

-h[+] --help [+]
Prints this help message. Use -hh to show advanced options.
-V --version
Display program version information.
-n --negate-next
Set the next (no-argument) option to false.
-i<s> --input-format <s>
Specify input format ('pajek', 'link-list', '3gram', 'multiplex' or 'bipartite') to override format possibly implied by file extension.
--skip-adjust-bipartite-flow
Skip distributing all flow from the bipartite nodes (first column) to the ordinary nodes (second column).
--overlapping
Let nodes be part of different and overlapping modules. Applies to ordinary networks by first representing the memoryless dynamics with memory nodes.
--hard-partitions
Don't allow overlapping modules in memory networks by keeping the memory nodes constrained into their physical nodes.
--non-backtracking
Use non-backtracking dynamics and let nodes be part of different and overlapping modules. Applies to ordinary networks by first representing the non-backtracking dynamics with memory nodes.
--without-iostream
Parse the input network data without the iostream library. Can be a bit faster, but not as robust.
-z --zero-based-numbering
Assume node numbers start from zero in the input file instead of one.
-k --include-self-links
Include links with the same source and target node. (Ignored by default.)
-O<n> --node-limit <n>
Limit the number of nodes to read from the network. Ignore links connected to ignored nodes. (Default: 0)
--pre-cluster-multiplex
Pre-cluster multiplex networks layer by layer.
-c<p> --cluster-data <p>
Provide an initial two-level (.clu format) or multi-layer (.tree format) solution.
--no-infomap
Don't run Infomap. Useful if initial cluster data should be preserved or non-modular data printed.
--out-name <s>
Use this name for the output files, like [output_directory]/[out-name].tree
-0 --no-file-output
Don't print any output to file.
--map
Print the top two-level modular network in the .map format.
--clu
Print the top cluster indices for each node.
--tree
Print the hierarchy in .tree format. (default true if no other output with cluster data)
--ftree
Print the hierarchy in .tree format and append the hierarchically aggregated network links.
--btree
Print the tree in a streamable binary format.
--bftree
Print the tree including horizontal flow links in a streamable binary format.
--node-ranks
Print the calculated flow for each node to a file.
--flow-network
Print the network with calculated flow values.
--pajek
Print the parsed network in Pajek format.
--expanded
Print the expanded network of memory nodes if possible.
-2 --two-level
Optimize a two-level partition of the network.
-u --undirected
Assume undirected links. (default)
-d --directed
Assume directed links.
-t --undirdir
Two-mode dynamics: Assume undirected links for calculating flow, but directed when minimizing codelength.
--outdirdir
Two-mode dynamics: Count only ingoing links when calculating the flow, but all when minimizing codelength.
-w --rawdir
Two-mode dynamics: Assume directed links and let the raw link weights be the flow.
-e --recorded-teleportation
If teleportation is used to calculate the flow, also record it when minimizing codelength.
-o --to-nodes
Teleport to nodes instead of to links, assuming uniform node weights if no such input data.
-p<f> --teleportation-probability <f>
The probability of teleporting to a random node or link. (Default: 0.15)
-y<f> --self-link-teleportation-probability <f>
Additional probability of teleporting to itself. Effectively increasing the code rate, generating more and smaller modules. (Default: -1)
--markov-time <f>
Scale link flow with this value to change the cost of moving between modules. Higher for less modules. (Default: 1)
--preferred-number-of-modules <n>
Stop merge or split modules if preferred number of modules is reached. (Default: 0)
--multiplex-relax-rate <f>
The probability to relax the constraint to move only in the current layer. If negative, the inter-links have to be provided. (Default: -1)
--multiplex-relax-limit <n>
The number of neighboring layers in each direction to relax to. If negative, relax to any layer. (Default: -1)
--multiplex-js-relax-rate <f>
The probability to relax the constraint to move only in the current layer and instead move to a random layer where the same physical node is present and proportional to the out-link similarity measured by the Jensen-Shannon divergence. If negative, the inter-links have to be provided. (Default: -1)
--multiplex-js-relax-limit <f>
The minimum out-link similarity measured by the Jensen-Shannon divergence to relax to other layer. From 0 to 1. No limit if negative. (Default: -1)
-s<n> --seed <n>
A seed (integer) to the random number generator.
-N<n> --num-trials <n>
The number of outer-most loops to run before picking the best solution. (Default: 1)
-m<f> --min-improvement <f>
Minimum codelength threshold for accepting a new solution. (Default: 1e-10)
-a --random-loop-limit
Randomize the core loop limit from 1 to 'core-loop-limit'
-M<n> --core-loop-limit <n>
Limit the number of loops that tries to move each node into the best possible module (Default: 10)
-L<n> --core-level-limit <n>
Limit the number of times the core loops are reapplied on existing modular network to search bigger structures. (Default: 0)
-T<n> --tune-iteration-limit <n>
Limit the number of main iterations in the two-level partition algorithm. 0 means no limit. (Default: 0)
-U<f> --tune-iteration-threshold <f>
Set a codelength improvement threshold of each new tune iteration to 'f' times the initial two-level codelength. (Default: 1e-05)
--fast-first-iteration
Move nodes to strongest connected module in the first iteration instead of minimizing the map equation.
-C --fast-coarse-tune
Try to find the quickest partition of each module when creating sub-modules for the coarse-tune part.
-A --alternate-coarse-tune-level
Try to find different levels of sub-modules to move in the coarse-tune part.
-S<n> --coarse-tune-level <n>
Set the recursion limit when searching for sub-modules. A level of 1 will find sub-sub-modules. (Default: 1)
-F[+] --fast-hierarchical-solution [+]
Find top modules fast. Use -FF to keep all fast levels. Use -FFF to skip recursive part.
-l[+] --low-memory [+]
Prioritize memory efficient algorithms before fast. Use twice to optimize even more, but this may give approximate results.
--inner-parallelization
Parallelize the innermost loop for greater speed. Note that this may give some accuracy tradeoff.
--show-bipartite-nodes
Include the bipartite nodes in the output.
-v[+] --verbose [+]
Verbose output on the console. Add additional 'v' flags to increase verbosity up to -vvv.
--silent
No output on the console.

Argument types are defined as:

  • <n> - a positive integer
  • <f> - a floating point number
  • <p> - a path to a file or directory
  • <s> - a string matching one of the listed keys

[+] means that the option can be repeated

Examples

All examples assumes that you are in the Infomap base directory, which contains the Makefile and src directory.

As stand-alone

mkdir output
./Infomap ninetriangles.net output/ -N 10 --tree --bftree

Here, we first create a new directory where we want the put the results, and feed that to Infomap together with an input network and an option. Now Infomap will try to parse the file ninetriangles.net as an undirected network, partition it hierarchically and write the best result out of 10 attempts to the output directory, both in the human readable .tree format and the binary .bftree format, which can be loaded into the Network Navigator.

./Infomap ninetriangles.net output/ -N 10 --directed --two-level --clu --map

Here, Infomap will treat the network as directed and try to find the optimal two-level partition with respect to the flow. This will also create an output .clu file with the cluster indices, and a .map file with the modular network that can be visualized using the Map Generator.

Multiplex

Multiplex networks describes a set of networks with corresponding nodes but different link structures, and possibly also an inter-network link structure (see multiplex format). If no inter-link data is provided, or if the relax rate is explicitly set, the inter-link structure will be generated by relaxing the layer constraints of the intra-network links.

There are two ways to provide a multiplex network to Infomap:

./Infomap multiplex-network.net output/ -i multiplex

or

./Infomap network1.net network2.net [... networkN.net] output/

where network1.net to networkN.net are ordinary networks in the Pajek or link list format.

As C++ library

To see a minimal example on how Infomap can be run as a library, check the files example.cpp and Makefile in the examples/cpp/minimal directory, and run by typing:

cd examples/cpp/minimal
make
./example

The included directory example/cpp/Infomap-igraph-interface-library/ contains an example on how to write a small interface library to simplify integration with other network libraries, here igraph. Example code to use that is included in the examples/cpp/igraph directory.

With other languages

To compile the interface wrappers to other languages, the SWIG library must be installed, see instructions. On OS X, a simple way is to install Homebrew and just type brew install swig.
With Python

The run Infomap from the included python example code, type:

cd examples/python
make
python example-networkx.py

The make command first calls make python in the Infomap base directory, which uses swig and setuptools to compile Infomap to a python module. Then that module is copied to the current example directory to be easily imported in the included python code.

The example-networkx.py code depends on the NetworkX module, which can be installed by typing pip install networkx using the pip tool.

With R

Assuming R is installed, you can run Infomap from from the included R example code, type:

cd examples/R
make
R
source("example-minimal.R")

The make command first calls make R in the Infomap base directory, which uses swig to generate wrapper code for R and R CMD SHLIB to compile Infomap as a shared library. The library can then be loaded in an R environment by calling source("load-infomap.R"), which is called by the example files.

The example-igraph.R code gives a more complete example, depending on the igraph R package to generate a network and to plot the communities found by Infomap.

Tuning the result

Infomap decomposes a network into modules by optimally compressing a description of information flows on the network. To model this flow of information, we release a random walker on the network and encodes its steps and module exits. To tune the modular solution, you can therefore tune the dynamics of the random walker, or the encoding scheme.

Tuning the dynamics

Some relevant options here are --directed, --undirdir, --rawdir, --teleportation-probability, --to-nodes and --include-self-links.

Tuning the coding scheme

Relevant options are --recorded-teleportation and --markov-time. With --recorded-teleportation, the solution captures the blurring effect of the random teleportation, typically needed on directed networks to reach a stationary solution. To minimize this artificial effect, we therefore no longer use the option by default.

The --markov-time defines how many steps the random walker takes before its position is encoded (default 1). Shorter Markov times than 1 mean that the average transition rate of a random walker is lower than the encoding rate of its position, such that the same node will be encoded multiple times in a row. As a result, the map equation will favor more and smaller modules. Oppositely, longer Markov times mean that the average transition rate is higher than the encoding rate, such that not every node on the trajectory will be encoded, and the map equation will favor fewer and larger modules. Read more in Efficient community detection of network flows for varying Markov times and bipartite networks.

Input

Currently, Infomap understands three different file formats for the input network data, a minimal link list format, the standard Pajek format, and a 3-gram format for second-order dynamics. Specify the input format to Infomap with the -i or --input-format flag together with pajek or link-list as its argument. If no input format is specified, Infomap will try to parse the file by this assumption:

If the node numbering in the file starts from 0, you have to add the flag -z or --zero-based to Infomap, otherwise it will assume that the node numbering starts from 1.

A link list is a minimal format to describe a network by only specifying a set of links:

# A network in link list format
1 2 1
1 3 1
2 3 2
3 5 0.5
...

Each line corresponds to the triad source target weight which describes a weighted link between the nodes with specified numbers. The weight can be any non-negative value. If omitted, the link weight will default to 1. The first node are assumed to start from 1* and the total number of nodes will be determined by the maximum node number.

Pajek format

The Pajek format specifies both the nodes and the links in two different sections of the file:

# A network in Pajek format
*Vertices 27
1 "1"
2 "2"
3 "3"
4 "4"
...
*Edges 33
1 2 1
1 3 1
1 4 1
2 3 1
...

The node section must start with *Vertices N and the link section with *Edges N or *Arcs N (case insensitive), where N is the number of nodes or links. The characters within each quotation mark defines the corresponding node name. Weights can be given to nodes by adding a third column with positive numbers, and the list of links are treated the same as for a link list.

3-gram format

The 3-gram format contains information about second-order dynamics and is similar to the pajek format but the links are defined by at least three columns, from through to [weight]:

# A network in 3-gram format
*Vertices 27
1 "1"
2 "2"
3 "3"
4 "4"
...
*3grams 33
1 1 2 1
2 1 2 1
3 1 2 1
1 2 1 1
...

The node section format follows exactly the Pajek format, but the link section must begin with *3grams N or *Arcs N (case insensitive), where N is the number of links. Because pathways are inherently directed, the links are assumed to be directed by default.

Incomplete links

You can mix the trigrams with bigrams, i.e. ordinary links, by setting the first column to -1. Infomap will then try to match those links with the existing trigrams to spread out the weight proportinally to the weigh of the matched links.

There are three ways to match an incomplete link -1 2 3, in priority order:

Exact match:
x 2 3 -> x 2 3
Partial match:
x 2 y -> x 2 3
Shifted match:
x y 2 -> y 2 3

If no matches are found, a self-link 2 2 3 is created if self-links are allowed.

Multiplex format

In a multiplex network, each physical node can exist in a number of layers, with different link structure for each layer. A general multiplex format follows the Pajek layout, but with the links defined between nodes for each layer:

# A network in a general multiplex format
*Vertices 4
1 "Node 1"
2 "Node 2"
3 "Node 3"
4 "Node 4"
*Multiplex
# layer node layer node [weight]
1 1 1 2 2
1 1 2 2 1
1 2 1 1 1
1 3 2 2 1
2 2 1 3 1
2 3 2 2 1
2 4 2 1 2
2 4 1 2 1

The weight column is optional. Links without the last column will get weight 1.0 by default.

If no heading (beginning with *) is given, the multiplex format assumes *Multiplex link format:

# A network in a general multiplex format
# layer node layer node [weight]
1 1 1 2 2
1 1 2 2 1
1 2 1 1 1
1 3 2 2 1
2 2 1 3 1
2 3 2 2 1
2 4 2 1 2
2 4 1 2 1

The multiplex format above gives full control of the dynamics, and no other movements are encoded. However, it is often useful to consider a dynamics in which a random walker moves within a layer and with a given relax rate jumps to another layer without recording this movement, such that the constraints from moving in different layers can be gradually relaxed. The format below explicitely divides the links into two groups, links within layers (intra-layer links) and links between layers (inter-layer links). For the first group, the second layer column can be omitted. For the second group, the second node can be omitted, as a shorthand for an unrecorded jump between layers. That is, each inter-layer link layer1 node1 layer2 is expanded to weighted multiplex links layer1 node1 layer2 node2, one for each node2 that node1 is connected to in layer2 with weight proportional to the weight of the link between node1 and node2 in layer2. In this way, the random walker seamlessly switches to a different layer at a rate proportional to the inter-layer link weight to that layer, and the encoded dynamics correspond to relaxed layer constraints (see the interactive storyboard for illustration). To define links like this, use the *Intra and *Inter headings:

# A network in multiplex format
*Intra
# layer node node weight
1 1 2 1
1 2 1 1
1 2 3 1
1 3 2 1
1 3 1 1
1 1 3 1
1 2 4 1
1 4 2 1
2 4 5 1
2 5 4 1
2 5 6 1
2 6 5 1
2 6 4 1
2 4 6 1
2 3 6 1
2 6 3 1
*Inter
# layer node layer weight
1 3 2
2 3 1
1 4 2
2 4 1

If no inter links are provided, the inter links will be generated from the intra link structure by relaxing the layer constraints on those links.

Bipartite format

The bipartite network format uses prefixes f and n for features and nodes, respectively. The bipartite network can be provided both with node names:

# A bipartite network with node names
*Vertices 5
1 "Node 1"
2 "Node 2"
3 "Node 3"
4 "Node 4"
5 "Node 5"
*Edges 6
# feature node [weight]
f1 n1 2
f1 n2 2
f1 n3 1
f2 n3 1
f2 n4 2
f2 n5 2

and in the link list format:

# A bipartite network in link list format
# feature node [weight]
f1 n1 2
f1 n2 2
f1 n3 1
f2 n3 1
f2 n4 2
f2 n5 2

Node names can be provided under f and n for features and nodes, respectively. In the link list format, for example, it takes the form:

State format

The state format describes the exact network used internally by Infomap. It can model both ordinary networks and memory networks (of variable order).

# A network in state format
*Vertices 4
1 "PRE"
2 "SCIENCE"
3 "PRL"
4 "BIO"
*States
1 2 "1 2"
2 3 "2 3"
3 2 "4 2"
4 4 "2 4"
*Links
1 2
3 4

The *Vertices section is optional and follows the Pajek format. It is used to name the physical nodes. The *States section describes all internal nodes with the format uniqueId physicalId [name], where name is optional. The unique index is referenced in the *Links section, and the physical index of the node is optionally referenced in the *Vertices section. The *Links section describes all links as from to [weight], where from and to references the first index in the *States section.

The example state network above is equivalent to the 3-gram network below:

# A 3-gram network equivalent to the state network above
*Vertices 4
1 "PRE"
2 "SCIENCE"
3 "PRL"
4 "BIO"
*3grams
1 2 3
4 2 4

Output

Tree format

The resulting hierarchy will be written to a file with the extension .tree (plain text file) and corresponds to the best hierarchical partition (shortest description length) of the attempts. The output format has the pattern:

# Codelength = 3.48419 bits.
1:1:1 0.0384615 "7" 6
1:1:2 0.0384615 "8" 7
1:1:3 0.0384615 "9" 8
1:2:1 0.0384615 "4" 3
1:2:2 0.0384615 "5" 4
...

Each row (except the first one, which summarizes the result) begins with the multilevel module assignments of a node. The module assignments are colon separated from coarse to fine level, and all modules within each level are sorted by the total flow (PageRank) of the nodes they contain. Further, the integer after the last colon is the rank within the finest-level module, the decimal number is the amount of flow in that node, i.e. the steady state population of random walkers, the content within quotation marks is the node name, and finally, the last integer is the index of the node in the original network file.

Map format

A .map file (plain text file) describes the best two-level partition (shortest description length) of the attempts. The format has the pattern:

# modules: 4
# modulelinks: 4
# nodes: 16
# links: 20
# codelength: 3.32773
*Directed
*Modules 4
1 "Node 13,..." 0.25 0.0395432
2 "Node 5,..." 0.25 0.0395432
3 "Node 9,..." 0.25 0.0395432
4 "Node 1,..." 0.25 0.0395432
*Nodes 16
1:1 "Node 13" 0.0820133
1:2 "Node 14" 0.0790863
1:3 "Node 16" 0.0459137
1:4 "Node 15" 0.0429867
2:1 "Node 5" 0.0820133
2:2 "Node 6" 0.0790863
2:3 "Node 8" 0.0459137
2:4 "Node 7" 0.0429867
3:1 "Node 9" 0.0820133
3:2 "Node 10" 0.0790863
3:3 "Node 12" 0.0459137
3:4 "Node 11" 0.0429867
4:1 "Node 1" 0.0820133
4:2 "Node 2" 0.0790863
4:3 "Node 4" 0.0459137
4:4 "Node 3" 0.0429867
*Links 4
1 4 0.0395432
2 3 0.0395432
3 1 0.0395432
4 2 0.0395432
...

This file contains the necessary information to generate a visual map in the Map Generator. The names under *Modules are derived from the node with the highest flow volume within the module and 0.25 0.0395432 represent, respectively, the aggregated flow volume of all nodes within the module and the per step exit flow from the module. The nodes are listed with their module assignments together with their flow volumes. Finally, all links between the modules are listed in order from high flow to low flow.

Clu format

A .clu file (plain text file) describes the best two-level partition (shortest description length) of the attempts. The format has the pattern:

*Vertices: 8
1
1
2
2
2
2
3
3

This file contains the module indices for all nodes in the same order as in the input network file. It can be added to the apps after reading a network file to generate the modular network map.

Binary tree formats

The resulting hierarchy can be written to a compact binary format using the --btree or --bftree flag. The former includes only the tree, while the latter also includes the horizontal flow links between nodes on each level of the tree. The data is written in a breath-first pattern to be streamable, which the Network Navigator utilizes to be able to show the top modular structure of the network before the deeper parts of the tree are loaded.

Troubleshooting

Don't hesitate to contact us if you have questions or comments about the code.

Features

Infomap optimizes the map equation, which exploits the information-theoretic duality between the problem of compressing data, and the problem of detecting and extracting significant patterns or structures within those data.

Specifically, the map equation is a flow-based method and operates on dynamics on the network.

Infomap handles both unweighted and weighted, undirected and directed links.

Two-level and multi-level solutions

Infomap clusters tightly interconnected nodes into modules (two-level clustering) or the optimal number of nested modules (multi-level clustering).

First- and second-order dynamics

Infomap captures flow patterns modeled with both first-order dynamics (as on a conventional network: where flow moves to on the network only depends on where it currently is) and second-order dynamics (where flow moves to on the network both depends on where it currently is and where it just came from). Infomap captures second-order dynamics by performing first-order dynamics on memory nodes, see Memory in network flows and its effects on spreading dynamics and community detection and this interactive storyboard.

Hard partitions and overlapping modules

Infomap can identify overlapping modules by first representing the dynamics with memory nodes and then clustering the memory nodes in hard partitions.).

Single- and multi-layer networks

Infomap can identify (overlapping) modules in multilayer (multiplex) networks that may not be identified in a single aggregated network or by analyzing the layers separately. See Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems and this interactive storyboard.

Benchmark

The speed and accuracy of Infomap is visualized below, compared to the Louvain method, on generated benchmark networks described by Andrea Lancichinetti et al.

Speed benchmark
Speed benchmark The speed is measured as the time needed to partition the benchmark networks in two levels.
Accuracy benchmark
Accuracy benchmark The accuracy is measured as Normalized Mutual Information (NMI) between the output cluster and the reference cluster. Benchmark networks are generated with 5000 nodes and community sizes between 20 and 200.
Hierarchical benchmark
Triangle network
Hierarchical accuracy benchmark. The figure shows how well the algorithm reveal the hierarchical organization of nodes in triangle networks of different levels (see next figure).
Triangle network of three levels. Networks for the hierarchical accuracy benchmark are generated as the Sierpinski fractal of different levels.

Algorithm

The hierarchical map equation measures the per-step average code length necessary to describe a random walker's movements on a network, given a hierarchical network partition, but the challenge is to find the partition that minimizes the description length. Into how many hierarchical levels should a given network be partitioned? How many modules should each level have? And which nodes should be members of which modules?

Below we describe the Infomap algorithm and start with a description of the original two-level clustering of nodes into modules

Two-level algorithm

The core of the algorithm follows closely the Louvain method: neighboring nodes are joined into modules, which subsequently are joined into supermodules and so on. First, each node is assigned to its own module. Then, in random sequential order, each node is moved to the neighboring module that results in the largest decrease of the map equation. If no move results in a decrease of the map equation, the node stays in its original module. This procedure is repeated, each time in a new random sequential order, until no move generates a decrease of the map equation. Now the network is rebuilt, with the modules of the last level forming the nodes at this level, and, exactly as at the previous level, the nodes are joined into modules. This hierarchical rebuilding of the network is repeated until the map equation cannot be reduced further.

With this algorithm, a fairly good clustering of the network can be found in a very short time. Let us call this the core algorithm and see how it can be improved. The nodes assigned to the same module are forced to move jointly when the network is rebuilt. As a result, what was an optimal move early in the algorithm might have the opposite effect later in the algorithm. Because two or more modules that merge together and form one single module when the network is rebuilt can never be separated again in this algorithm, the accuracy can be improved by breaking the modules of the final state of the core algorithm in either of the two following ways:

  • Submodule movements. First, each cluster is treated as a network on its own and the main algorithm is applied to this network. This procedure generates one or more submodules for each module. Then all submodules are moved back to their respective modules of the previous step. At this stage, with the same partition as in the previous step but with each submodule being freely movable between the modules, the main algorithm is re-applied on the submodules.

  • Single-node movements. First, each node is re-assigned to be the sole member of its own module, in order to allow for single-node movements. Then all nodes are moved back to their respective modules of the previous step. At this stage, with the same partition as in the previous step but with each single node being freely movable between the modules, the main algorithm is re-applied on the single nodes.

In practice, we repeat the two extensions to the core algorithm in sequence and as long as the clustering is improved. Moreover, we apply the submodule movements recursively. That is, to find the submodules to be moved, the algorithm first splits the submodules into subsubmodules, subsubsubmodules, and so on until no further splits are possible. Finally, because the algorithm is stochastic and fast, we can restart the algorithm from scratch every time the clustering cannot be improved further and the algorithm stops. The implementation is straightforward and, by repeating the search more than once, 100 times or more if possible, the final partition is less likely to correspond to a local minimum. For each iteration, we record the clustering if the description length is shorter than the previously shortest description length.

Multilevel algorithm

We have generalized our search algorithm for the two-level map equation to recursively search for multilevel solutions. The recursive search operates on a module at any level; this can be all the nodes in the entire network, or a few nodes at the finest level. For a given module, the algorithm first generates submodules if this gives a shorter description length. If not, the recursive search does not go further down this branch. But if adding submodules gives a shorter description length, the algorithm tests if movements within the module can be further compressed by additional index codebooks. Further compression can be achieved both by adding one or more coarser codebooks to compress movements between submodules or by adding one or more finer index codebooks to compress movements within submodules. To test for all combinations, the algorithm calls itself recursively, both operating on the network formed by the submodules and on the networks formed by the nodes within every submodule. In this way, the algorithm successively increases and decreases the depth of different branches of the multilevel code structure in its search for the optimal hierarchical partitioning. For every split of a module into submodules, we use the two-level search algorithm described above.