Here we provide the source code to the Infomap algorithm for detecting communities in large networks with the map equation framework. See also GossipMap, a distributed version for billion-edge directed graphs implemented with GraphLab PowerGraph by Seung-Hee Bae and Bill Howe at University of Washington. We also provide significance clustering code.
Infomap is a command line software written in C++ that runs in Linux, Mac OS X, and Windows with
Infomap requires a working
In Ubuntu, for example, the metapackage
build-essential installs the compiler as well as related packages. Install it from the terminal with
sudo apt-get install build-essential
Since Mac OS X 10.9, the standard compiler tools are based on clang, which can be installed with
However, the current version lacks OpenMP support for parallelization. While the Makefile automatically skips the
-fopenmapcompiler flag if the standard compiler is clang, to get support for OpenMP you can manually install a gcc-based compiler. A simple way is to install Homebrew and type, for example,
brew install gcc43in the terminal.
We recommend running Infomap in bash on ubuntu on Windows 10. Follow the installation guide for a Linux environment directly on Windows. For example, install
sudo apt-get install git. Then install the metapackage
build-essential for the compiler and related packages with
sudo apt-get install build-essential
To extract the zipped archive and compile the source code in a Unix-like environment, open a terminal and type
cd [path/to/Infomap] unzip Infomap.zip cd Infomap make
[path/to/Infomap] to the folder where Infomap was downloaded, such as ~/Downloads. With
git installed, instead clone the source code with
git clone https://github.com/mapequation/infomap.git. If you installed a custom compiler with, for example,
brew install gcc43, replace the last command with
Compilation error due to missing parallelization library. If there is a compilation error that references
-lpthread, the compiler probably can't find the OpenMP library which is used for parallelization. To compile without the OpenMP library, type
make noomp. You can also try to compile with another compiler by calling for example
CXX=g++-4.3 make. Please give us feedback if you have any troubles!
./Infomap [options] network_data dest
The optional arguments can be put anywhere, see the Options section for the available
network_data should point to a valid network file (see Input section) and
dest to a directory where the Infomap should write its output files.
optionis provided, Infomap will assume an undirected network and try to partition it hierarchically.
Run Infomap with the options
--help to list the options below:
-d2N3is equivalent to
-d -2 -N 3.
Run Infomap with the options
-hh to list the options below:
Argument types are defined as:
[+] means that the option can be repeated
mkdir output ./Infomap ninetriangles.net output/ -N 10 --tree --bftree
Here, we first create a new directory where we want the put the results, and feed that to Infomap together with an input network and an option. Now Infomap will try to parse the file
ninetriangles.net as an undirected network, partition it hierarchically and write the best result out of
10 attempts to the
output directory, both in the human readable .tree format and the binary .bftree format, which can be loaded into the Network Navigator.
./Infomap ninetriangles.net output/ -N 10 --directed --two-level --clu --map
Here, Infomap will treat the network as directed and try to find the optimal two-level partition with respect to the flow. This will also create an output .clu file with the cluster indices, and a .map file with the modular network that can be visualized using the Map Generator.
Multiplex networks describes a set of networks with corresponding nodes but different link structures, and possibly also an inter-network link structure (see multiplex format). If no inter-link data is provided, or if the relax rate is explicitly set, the inter-link structure will be generated by relaxing the layer constraints of the intra-network links.
There are two ways to provide a multiplex network to Infomap:
./Infomap multiplex-network.net output/ -i multiplex
./Infomap network1.net network2.net [... networkN.net] output/
To see a minimal example on how Infomap can be run as a library, check the files
Makefile in the
examples/cpp/minimal directory, and run by typing:
cd examples/cpp/minimal make ./example
The included directory
example/cpp/Infomap-igraph-interface-library/ contains an example on how to write a small interface library to simplify integration with other network libraries, here igraph. Example code to use that is included in the
brew install swig.
To run Infomap from the included python example code, type:
cd examples/python make python example-networkx.py
make command first calls
make python in the Infomap base directory, which uses swig and setuptools to compile Infomap to a python module. Then that module is copied to the current example directory to be easily imported in the included python code.
To run Infomap from the included Jupyter notebook, first compile Infomap to a python module as described above, then install Jupyter by typing
pip install jupyter using the pip tool, and finally start the notebook server and open the example notebook in
examples/python by typing:
jupyter notebook infomap-examples.ipynb
Assuming R is installed, you can run Infomap from from the included R example code, type:
cd examples/R make R source("example-minimal.R")
make command first calls
make R in the Infomap base directory, which uses swig to generate wrapper code for R and
R CMD SHLIB to compile Infomap as a shared library. The library can then be loaded in an
R environment by calling
source("load-infomap.R"), which is called by the example files.
example-igraph.R code gives a more complete example, depending on the igraph R package to generate a network and to plot the communities found by Infomap.
Infomap decomposes a network into modules by optimally compressing a description of information flows on the network. To model this flow of information, we release a random walker on the network and encodes its steps and module exits. To tune the modular solution, you can therefore tune the dynamics of the random walker, or the encoding scheme.
Some relevant options here are
Relevant options are
--recorded-teleportation, the solution captures the blurring effect of the random teleportation, typically needed on directed networks to reach a stationary solution. To minimize this artificial effect, we therefore no longer use the option by default.
--markov-time defines how many steps the random walker takes before its position is encoded (default 1). Shorter Markov times than 1 mean that the average transition rate of a random walker is lower than the encoding rate of its position, such that the same node will be encoded multiple times in a row. As a result, the map equation will favor more and smaller modules. Oppositely, longer Markov times mean that the average transition rate is higher than the encoding rate, such that not every node on the trajectory will be encoded, and the map equation will favor fewer and larger modules. Read more in Efficient community detection of network flows for varying Markov times and bipartite networks.
Currently, Infomap understands three different file formats for the input network data, a minimal link list format, the standard Pajek format, and a 3-gram format for second-order dynamics. Specify the input format to Infomap with the
--input-format flag together with
link-list as its argument. If no input format is specified, Infomap will try to parse the file by this assumption:
0, you have to add the flag
--zero-basedto Infomap, otherwise it will assume that the node numbering starts from
A link list is a minimal format to describe a network by only specifying a set of links:
# A network in link list format 1 2 1 1 3 1 2 3 2 3 5 0.5 ...
Each line corresponds to the triad
source target weight which describes a weighted link between the nodes with specified numbers. The
weight can be any non-negative value. If omitted, the link weight will default to
1. The first node are assumed to start from
1* and the total number of nodes will be determined by the maximum node number.
The Pajek format specifies both the nodes and the links in two different sections of the file:
# A network in Pajek format *Vertices 27 1 "1" 2 "2" 3 "3" 4 "4" ... *Edges 33 1 2 1 1 3 1 1 4 1 2 3 1 ...
The node section must start with
*Vertices N and the link section with
*Edges N or
*Arcs N (case insensitive), where
N is the number of nodes or links. The characters within each quotation mark defines the corresponding node name. Weights can be given to nodes by adding a third column with positive numbers, and the list of links are treated the same as for a link list.
The 3-gram format contains information about second-order dynamics and is similar to the pajek format but the links are defined by at least three columns,
from through to [weight]:
# A network in 3-gram format *Vertices 27 1 "1" 2 "2" 3 "3" 4 "4" ... *3grams 33 1 1 2 1 2 1 2 1 3 1 2 1 1 2 1 1 ...
The node section format follows exactly the Pajek format, but the link section must begin with
*3grams N or
*Arcs N (case insensitive), where
N is the number of links. Because pathways are inherently directed, the links are assumed to be directed by default.
You can mix the trigrams with bigrams, i.e. ordinary links, by setting the first column to
-1. Infomap will then try to match those links with the existing trigrams to spread out the weight proportinally to the weigh of the matched links.
There are three ways to match an incomplete link
-1 2 3, in priority order:
x 2 3->
x 2 3
x 2 y->
x 2 3
x y 2->
y 2 3
If no matches are found, a self-link
2 2 3 is created if self-links are allowed.
In a multiplex network, each physical node can exist in a number of layers, with different link structure for each layer. A general multiplex format follows the Pajek layout, but with the links defined between nodes for each layer:
# A network in a general multiplex format *Vertices 4 1 "Node 1" 2 "Node 2" 3 "Node 3" 4 "Node 4" *Multiplex # layer node layer node [weight] 1 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 3 2 2 1 2 2 1 3 1 2 3 2 2 1 2 4 2 1 2 2 4 1 2 1
weight column is optional. Links without the last column will get weight
1.0 by default.
If no heading (beginning with
*) is given, the multiplex format assumes
*Multiplex link format:
# A network in a general multiplex format # layer node layer node [weight] 1 1 1 2 2 1 1 2 2 1 1 2 1 1 1 1 3 2 2 1 2 2 1 3 1 2 3 2 2 1 2 4 2 1 2 2 4 1 2 1
The multiplex format above gives full control of the dynamics, and no other movements are encoded. However, it is often useful to consider a dynamics in which a random walker moves within a layer and with a given relax rate jumps to another layer without recording this movement, such that the constraints from moving in different layers can be gradually relaxed. The format below explicitely divides the links into two groups, links within layers (intra-layer links) and links between layers (inter-layer links). For the first group, the second layer column can be omitted. For the second group, the second node can be omitted, as a shorthand for an unrecorded jump between layers. That is, each inter-layer link
layer1 node1 layer2 is expanded to weighted multiplex links
layer1 node1 layer2 node2, one for each
node1 is connected to in
layer2 with weight proportional to the weight of the link between
layer2. In this way, the random walker seamlessly switches to a different layer at a rate proportional to the inter-layer link weight to that layer, and the encoded dynamics correspond to relaxed layer constraints (see the interactive storyboard for illustration). To define links like this, use the
# A network in multiplex format *Intra # layer node node weight 1 1 2 1 1 2 1 1 1 2 3 1 1 3 2 1 1 3 1 1 1 1 3 1 1 2 4 1 1 4 2 1 2 4 5 1 2 5 4 1 2 5 6 1 2 6 5 1 2 6 4 1 2 4 6 1 2 3 6 1 2 6 3 1 *Inter # layer node layer weight 1 3 2 2 3 1 1 4 2 2 4 1
If no inter links are provided, the inter links will be generated from the intra link structure by relaxing the layer constraints on those links.
The bipartite network format uses prefixes
n for features and nodes, respectively. The bipartite network can be provided both with node names:
# A bipartite network with node names *Vertices 5 1 "Node 1" 2 "Node 2" 3 "Node 3" 4 "Node 4" 5 "Node 5" *Edges 6 # feature node [weight] f1 n1 2 f1 n2 2 f1 n3 1 f2 n3 1 f2 n4 2 f2 n5 2
and in the link list format:
# A bipartite network in link list format # feature node [weight] f1 n1 2 f1 n2 2 f1 n3 1 f2 n3 1 f2 n4 2 f2 n5 2
Node names can be provided under
n for features and nodes, respectively. In the link list format, for example, it takes the form:
The state format describes the exact network used internally by Infomap. It can model both ordinary networks and memory networks of variable order as illustrated in this notebook.
# A network in state format *Vertices 4 1 "PRE" 2 "SCIENCE" 3 "PRL" 4 "BIO" *States 1 2 "1 2" 2 3 "2 3" 3 2 "4 2" 4 4 "2 4" *Links 1 2 3 4
*Vertices section is optional and follows the Pajek format. It is used to name the physical nodes. The
*States section describes all internal nodes with the format
uniqueId physicalId [name], where
name is optional. The unique index is referenced in the
*Links section, and the physical index of the node is optionally referenced in the
*Vertices section. The
*Links section describes all links as
from to [weight], where
to references the first index in the
The example state network above is equivalent to the 3-gram network below:
# A 3-gram network equivalent to the state network above *Vertices 4 1 "PRE" 2 "SCIENCE" 3 "PRL" 4 "BIO" *3grams 1 2 3 4 2 4
The resulting hierarchy will be written to a file with the extension
.tree (plain text file) and corresponds to the best hierarchical partition (shortest description length) of the attempts. The output format has the pattern:
# Codelength = 3.48419 bits. 1:1:1 0.0384615 "7" 6 1:1:2 0.0384615 "8" 7 1:1:3 0.0384615 "9" 8 1:2:1 0.0384615 "4" 3 1:2:2 0.0384615 "5" 4 ...
Each row (except the first one, which summarizes the result) begins with the multilevel module assignments of a node. The module assignments are colon separated from coarse to fine level, and all modules within each level are sorted by the total flow (PageRank) of the nodes they contain. Further, the integer after the last colon is the rank within the finest-level module, the decimal number is the amount of flow in that node, i.e. the steady state population of random walkers, the content within quotation marks is the node name, and finally, the last integer is the index of the node in the original network file.
The Tree format with an appended section of links with the flow between nodes within the same parent. The file is saved with the extension
.ftree (plain text file), and begins with the node hierarchy formatted as the Tree format, followed by the links formatted as:
*Links undirected #*Links path exitFlow numEdges numChildren *Links root 0.0 6 5 1 2 0.0128205 1 3 0.0128205 2 4 0.0128205 3 4 0.0128205 3 5 0.0128205 4 5 0.0128205 *Links 1 0.025641 3 3 1 2 0.0128205 1 3 0.0128205 2 3 0.0128205 *Links 1:1 0.0384615 3 3 1 2 0.0128205 1 3 0.0128205 2 3 0.0128205 *Links 1:2 0.0384615 3 3 1 2 0.0128205 1 3 0.0128205 2 3 0.0128205 ...
First is a line stating
directed links. Then each module is identified by the
path field, followed by all links between child nodes of that module. The
path is a colon separated path in the tree from the
root to finest-level modules. The links are sorted by flow within each module. Links exiting the module are not included, but the flow on those links is aggregated to
exitFlow. The number of edges and child nodes within each module is also included in the module header line, as defined in the commented line above. (Lines starting with
# are treated as comments and are ignored when parsed.)
.map file (plain text file) describes the best two-level partition (shortest description length) of the attempts. The format has the pattern:
# modules: 4 # modulelinks: 4 # nodes: 16 # links: 20 # codelength: 3.32773 *Directed *Modules 4 1 "Node 13,..." 0.25 0.0395432 2 "Node 5,..." 0.25 0.0395432 3 "Node 9,..." 0.25 0.0395432 4 "Node 1,..." 0.25 0.0395432 *Nodes 16 1:1 "Node 13" 0.0820133 1:2 "Node 14" 0.0790863 1:3 "Node 16" 0.0459137 1:4 "Node 15" 0.0429867 2:1 "Node 5" 0.0820133 2:2 "Node 6" 0.0790863 2:3 "Node 8" 0.0459137 2:4 "Node 7" 0.0429867 3:1 "Node 9" 0.0820133 3:2 "Node 10" 0.0790863 3:3 "Node 12" 0.0459137 3:4 "Node 11" 0.0429867 4:1 "Node 1" 0.0820133 4:2 "Node 2" 0.0790863 4:3 "Node 4" 0.0459137 4:4 "Node 3" 0.0429867 *Links 4 1 4 0.0395432 2 3 0.0395432 3 1 0.0395432 4 2 0.0395432 ...
This file contains the necessary information to generate a visual map in the Map Generator. The names under
*Modules are derived from the node with the highest flow volume within the module and
0.25 0.0395432 represent, respectively, the aggregated flow volume of all nodes within the module and the per step exit flow from the module. The nodes are listed with their module assignments together with their flow volumes. Finally, all links between the modules are listed in order from high flow to low flow.
.clu file (plain text file) describes the best two-level partition (shortest description length) of the attempts. The format has the pattern:
# node cluster flow: 4 1 0.4 3 1 0.1 1 2 0.2 2 2 0.1 5 2 0.05 8 2 0.05 7 3 0.05 6 3 0.05
.clu file is used as an input clustering to Infomap or the apps, the
flow column will not be used and may be omitted. Even the
node column may be omitted, in which case it assumes an implicit column of naturally ordered integers starting from one.
The resulting hierarchy can be written to a compact binary format using the
--bftree flag. The former includes only the tree, while the latter also includes the horizontal flow links between nodes on each level of the tree. The data is written in a breath-first pattern to be streamable, which the Network Navigator utilizes to be able to show the top modular structure of the network before the deeper parts of the tree are loaded.
Don't hesitate to contact us if you have questions or comments about the code.
Infomap optimizes the map equation, which exploits the information-theoretic duality between the problem of compressing data, and the problem of detecting and extracting significant patterns or structures within those data.
Specifically, the map equation is a flow-based method and operates on dynamics on the network.
Infomap handles both unweighted and weighted, undirected and directed links.
Infomap captures flow patterns modeled with both first-order dynamics (as on a conventional network: where flow moves to on the network only depends on where it currently is) and second-order dynamics (where flow moves to on the network both depends on where it currently is and where it just came from). Infomap captures second-order dynamics by performing first-order dynamics on memory nodes, see Memory in network flows and its effects on spreading dynamics and community detection and this interactive storyboard.
Infomap can identify overlapping modules by first representing the dynamics with memory nodes and then clustering the memory nodes in hard partitions.).
Infomap can identify (overlapping) modules in multilayer (multiplex) networks that may not be identified in a single aggregated network or by analyzing the layers separately. See Identifying modular flows on multilayer networks reveals highly overlapping organization in interconnected systems and this interactive storyboard.
The hierarchical map equation measures the per-step average code length necessary to describe a random walker's movements on a network, given a hierarchical network partition, but the challenge is to find the partition that minimizes the description length. Into how many hierarchical levels should a given network be partitioned? How many modules should each level have? And which nodes should be members of which modules?
Below we describe the Infomap algorithm and start with a description of the original two-level clustering of nodes into modules
The core of the algorithm follows closely the Louvain method: neighboring nodes are joined into modules, which subsequently are joined into supermodules and so on. First, each node is assigned to its own module. Then, in random sequential order, each node is moved to the neighboring module that results in the largest decrease of the map equation. If no move results in a decrease of the map equation, the node stays in its original module. This procedure is repeated, each time in a new random sequential order, until no move generates a decrease of the map equation. Now the network is rebuilt, with the modules of the last level forming the nodes at this level, and, exactly as at the previous level, the nodes are joined into modules. This hierarchical rebuilding of the network is repeated until the map equation cannot be reduced further.
With this algorithm, a fairly good clustering of the network can be found in a very short time. Let us call this the core algorithm and see how it can be improved. The nodes assigned to the same module are forced to move jointly when the network is rebuilt. As a result, what was an optimal move early in the algorithm might have the opposite effect later in the algorithm. Because two or more modules that merge together and form one single module when the network is rebuilt can never be separated again in this algorithm, the accuracy can be improved by breaking the modules of the final state of the core algorithm in either of the two following ways:
In practice, we repeat the two extensions to the core algorithm in sequence and as long as the clustering is improved. Moreover, we apply the submodule movements recursively. That is, to find the submodules to be moved, the algorithm first splits the submodules into subsubmodules, subsubsubmodules, and so on until no further splits are possible. Finally, because the algorithm is stochastic and fast, we can restart the algorithm from scratch every time the clustering cannot be improved further and the algorithm stops. The implementation is straightforward and, by repeating the search more than once, 100 times or more if possible, the final partition is less likely to correspond to a local minimum. For each iteration, we record the clustering if the description length is shorter than the previously shortest description length.
We have generalized our search algorithm for the two-level map equation to recursively search for multilevel solutions. The recursive search operates on a module at any level; this can be all the nodes in the entire network, or a few nodes at the finest level. For a given module, the algorithm first generates submodules if this gives a shorter description length. If not, the recursive search does not go further down this branch. But if adding submodules gives a shorter description length, the algorithm tests if movements within the module can be further compressed by additional index codebooks. Further compression can be achieved both by adding one or more coarser codebooks to compress movements between submodules or by adding one or more finer index codebooks to compress movements within submodules. To test for all combinations, the algorithm calls itself recursively, both operating on the network formed by the submodules and on the networks formed by the nodes within every submodule. In this way, the algorithm successively increases and decreases the depth of different branches of the multilevel code structure in its search for the optimal hierarchical partitioning. For every split of a module into submodules, we use the two-level search algorithm described above.