Blame doc/current-routing.txt

Packit 13e616
Current OpenSM Routing
Packit 13e616
12/13/17
Packit 13e616
Packit 13e616
OpenSM offers ten routing engines:
Packit 13e616
Packit 13e616
1.  Min Hop Algorithm - based on the minimum hops to each node where the
Packit 13e616
path length is optimized.
Packit 13e616
Packit 13e616
2.  UPDN Unicast routing algorithm - also based on the minimum hops to each
Packit 13e616
node, but it is constrained to ranking rules. This algorithm should be chosen
Packit 13e616
if the subnet is not a pure Fat Tree, and deadlock may occur due to a
Packit 13e616
loop in the subnet.
Packit 13e616
Packit 13e616
3. DNUP Unicast routing algorithm - similar to UPDN but allows routing in
Packit 13e616
fabrics which have some CA nodes attached closer to the roots than some switch
Packit 13e616
nodes.
Packit 13e616
Packit 13e616
4.  Fat-tree Unicast routing algorithm - this algorithm optimizes routing
Packit 13e616
of fat-trees for congestion-free "shift" communication pattern.
Packit 13e616
It should be chosen if a subnet is a symmetrical fat-tree.
Packit 13e616
Similar to UPDN routing, Fat-tree routing is credit-loop-free.
Packit 13e616
Packit 13e616
5. LASH unicast routing algorithm - uses InfiniBand virtual layers
Packit 13e616
(SL) to provide deadlock-free shortest-path routing while also
Packit 13e616
distributing the paths between layers. LASH is an alternative
Packit 13e616
deadlock-free topology-agnostic routing algorithm to the non-minimal
Packit 13e616
UPDN algorithm avoiding the use of a potentially congested root node.
Packit 13e616
Packit 13e616
6. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
Packit 13e616
avoids port equalization except for redundant links between the same
Packit 13e616
two switches.  This provides deadlock free routes for hypercubes when
Packit 13e616
the fabric is cabled as a hypercube and for meshes when cabled as a
Packit 13e616
mesh (see details below).
Packit 13e616
Packit 13e616
7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
Packit 13e616
specialized for 2D/3D torus topologies.  Torus-2QoS provides deadlock-free
Packit 13e616
routing while supporting two quality of service (QoS) levels.  In addition
Packit 13e616
it is able to route around multiple failed fabric links or a single failed
Packit 13e616
fabric switch without introducing deadlocks, and without changing path SL
Packit 13e616
values granted before the failure.
Packit 13e616
Packit 13e616
8. DFSSSP unicast routing algorithm - a deadlock-free single-source-
Packit 13e616
shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
Packit 13e616
as the base to optimize link utilization and uses InfiniBand virtual lanes
Packit 13e616
(SL) to provide deadlock-freedom.
Packit 13e616
Packit 13e616
9. SSSP unicast routing algorithm - a single-source-shortest-path routing
Packit 13e616
algorithm, which globally balances the number of routes per link to
Packit 13e616
optimize link utilization. This routing algorithm has no restrictions
Packit 13e616
in terms of the underlying topology.
Packit 13e616
Packit 13e616
10. Nue unicast routing algorithm - a 100%-applicable and deadlock-free
Packit 13e616
routing which can be used for any arbitrary or faulty network topology
Packit 13e616
and any number of virtual lanes (this includes the absence of VLs as well).
Packit 13e616
Paths are globally balanced w.r.t the number of routes per link, and are
Packit 13e616
kept as short as possible while enforcing deadlock-freedom within the VL
Packit 13e616
constraint.
Packit 13e616
Packit 13e616
OpenSM provides an optional unicast routing cache (enabled by -A or
Packit 13e616
--ucast_cache options). When enabled, unicast routing cache prevents
Packit 13e616
routing recalculation (which is a heavy task in a large cluster) when
Packit 13e616
there was no topology change detected during the heavy sweep, or when
Packit 13e616
the topology change does not require new routing calculation, e.g. when
Packit 13e616
one or more CAs/RTRs/leaf switches going down, or one or more of these
Packit 13e616
nodes coming back after being down.
Packit 13e616
A very common case that is handled by the unicast routing cache is host
Packit 13e616
reboot, which otherwise would cause two full routing recalculations: one
Packit 13e616
when the host goes down, and the other when the host comes back online.
Packit 13e616
Packit 13e616
OpenSM also supports a file method which can load routes from a table. See
Packit 13e616
modular-routing.txt for more information on this.
Packit 13e616
Packit 13e616
The basic routing algorithm is comprised of two stages:
Packit 13e616
1. MinHop matrix calculation
Packit 13e616
   How many hops are required to get from each port to each LID ?
Packit 13e616
   The algorithm to fill these tables is different if you run standard
Packit 13e616
(min hop) or Up/Down.
Packit 13e616
   For standard routing, a "relaxation" algorithm is used to propagate
Packit 13e616
min hop from every destination LID through neighbor switches
Packit 13e616
   For Up/Down routing, a BFS from every target is used. The BFS tracks link
Packit 13e616
direction (up or down) and avoid steps that will perform up after a down
Packit 13e616
step was used.
Packit 13e616
Packit 13e616
2. Once MinHop matrices exist, each switch is visited and for each target LID,
Packit 13e616
a decision is made as to what port should be used to get to that LID.
Packit 13e616
   This step is common to standard and Up/Down routing. Each port has a
Packit 13e616
counter counting the number of target LIDs going through it.
Packit 13e616
   When there are multiple alternative ports with same MinHop to a LID,
Packit 13e616
the one with less previously assigned LIDs is selected.
Packit 13e616
   If LMC > 0, more checks are added: Within each group of LIDs assigned to
Packit 13e616
same target port,
Packit 13e616
   a. use only ports which have same MinHop
Packit 13e616
   b. first prefer the ones that go to different systemImageGuid (then
Packit 13e616
the previous LID of the same LMC group)
Packit 13e616
   c. if none - prefer those which go through another NodeGuid
Packit 13e616
   d. fall back to the number of paths method (if all go to same node).
Packit 13e616
Packit 13e616
Packit 13e616
Effect of Topology Changes
Packit 13e616
Packit 13e616
OpenSM will preserve existing routing in any case where there is no change in
Packit 13e616
the fabric switches unless the -r (--reassign_lids) option is specified.
Packit 13e616
Packit 13e616
-r
Packit 13e616
--reassign_lids
Packit 13e616
          This option causes OpenSM to reassign LIDs to all
Packit 13e616
          end nodes. Specifying -r on a running subnet
Packit 13e616
          may disrupt subnet traffic.
Packit 13e616
          Without -r, OpenSM attempts to preserve existing
Packit 13e616
          LID assignments resolving multiple use of same LID.
Packit 13e616
Packit 13e616
If a link is added or removed, OpenSM does not recalculate
Packit 13e616
the routes that do not have to change. A route has to change
Packit 13e616
if the port is no longer UP or no longer the MinHop. When routing changes
Packit 13e616
are performed, the same algorithm for balancing the routes is invoked.
Packit 13e616
Packit 13e616
In the case of using the file based routing, any topology changes are
Packit 13e616
currently ignored The 'file' routing engine just loads the LFTs from the file
Packit 13e616
specified, with no reaction to real topology. Obviously, this will not be able
Packit 13e616
to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
Packit 13e616
switches will be skipped. Multicast is not affected by 'file' routing engine
Packit 13e616
(this uses min hop tables).
Packit 13e616
Packit 13e616
Packit 13e616
Min Hop Algorithm
Packit 13e616
-----------------
Packit 13e616
Packit 13e616
The Min Hop algorithm is invoked by default if no routing algorithm is
Packit 13e616
specified.  It can also be invoked by specifying '-R minhop'.
Packit 13e616
Packit 13e616
The Min Hop algorithm is divided into two stages: computation of
Packit 13e616
min-hop tables on every switch and LFT output port assignment. Link
Packit 13e616
subscription is also equalized with the ability to override based on
Packit 13e616
port GUID. The latter is supplied by:
Packit 13e616
Packit 13e616
-i <equalize-ignore-guids-file>
Packit 13e616
--ignore_guids <equalize-ignore-guids-file>
Packit 13e616
          This option provides the means to define a set of ports
Packit 13e616
          (by guids) that will be ignored by the link load
Packit 13e616
          equalization algorithm.
Packit 13e616
Packit 13e616
LMC awareness routes based on (remote) system or switch basis.
Packit 13e616
Packit 13e616
Packit 13e616
UPDN Routing Algorithm
Packit 13e616
----------------------
Packit 13e616
Packit 13e616
Purpose of UPDN Algorithm
Packit 13e616
Packit 13e616
The UPDN algorithm is designed to prevent deadlocks from occurring in loops
Packit 13e616
of the subnet. A loop-deadlock is a situation in which it is no longer
Packit 13e616
possible to send data between any two hosts connected through the loop. As
Packit 13e616
such, the UPDN routing algorithm should be used if the subnet is not a pure
Packit 13e616
Fat Tree, and one of its loops may experience a deadlock (due, for example,
Packit 13e616
to high pressure).
Packit 13e616
Packit 13e616
The UPDN algorithm is based on the following main stages:
Packit 13e616
Packit 13e616
1.  Auto-detect root nodes - based on the CA hop length from any switch in
Packit 13e616
the subnet, a statistical histogram is built for each switch (hop num vs
Packit 13e616
number of occurrences). If the histogram reflects a specific column (higher
Packit 13e616
than others) for a certain node, then it is marked as a root node. Since
Packit 13e616
the algorithm is statistical, it may not find any root nodes. The list of
Packit 13e616
the root nodes found by this auto-detect stage is used by the ranking
Packit 13e616
process stage.
Packit 13e616
Packit 13e616
    Note 1: The user can override the node list manually.
Packit 13e616
    Note 2: If this stage cannot find any root nodes, and the user did not
Packit 13e616
            specify a guid list file, OpenSM defaults back to the Min Hop
Packit 13e616
            routing algorithm.
Packit 13e616
Packit 13e616
2.  Ranking process - All root switch nodes (found in stage 1) are assigned
Packit 13e616
a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
Packit 13e616
subnet are ranked incrementally. This ranking aids in the process of enforcing
Packit 13e616
rules that ensure loop-free paths.
Packit 13e616
Packit 13e616
3.  Min Hop Table setting - after ranking is done, a BFS algorithm is run from
Packit 13e616
each (CA or switch) node in the subnet. During the BFS process, the FDB table
Packit 13e616
of each switch node traversed by BFS is updated, in reference to the starting
Packit 13e616
node, based on the ranking rules and guid values.
Packit 13e616
Packit 13e616
At the end of the process, the updated FDB tables ensure loop-free paths
Packit 13e616
through the subnet.
Packit 13e616
Packit 13e616
Note: Up/Down routing does not allow LID routing communication between
Packit 13e616
switches that are located inside spine "switch systems".
Packit 13e616
The reason is that there is no way to allow a LID route between them
Packit 13e616
that does not break the Up/Down rule.
Packit 13e616
One ramification of this is that you cannot run SM on switches other
Packit 13e616
than the leaf switches of the fabric.
Packit 13e616
Packit 13e616
Packit 13e616
UPDN Algorithm Usage
Packit 13e616
Packit 13e616
Activation through OpenSM
Packit 13e616
Packit 13e616
Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
Packit 13e616
Use `-a <guid_list_file>' for adding an UPDN guid file that contains the
Packit 13e616
root nodes for ranking.
Packit 13e616
If the `-a' option is not used, OpenSM uses its auto-detect root nodes
Packit 13e616
algorithm.
Packit 13e616
Packit 13e616
Notes on the guid list file:
Packit 13e616
1.   A valid guid file specifies one guid in each line. Lines with an invalid
Packit 13e616
format will be discarded.
Packit 13e616
2.   The user should specify the root switch guids. However, it is also
Packit 13e616
possible to specify CA guids; OpenSM will use the guid of the switch (if
Packit 13e616
it exists) that connects the CA to the subnet as a root node.
Packit 13e616
Packit 13e616
Packit 13e616
To learn more about deadlock-free routing, see the article
Packit 13e616
"Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
Packit 13e616
by William J Dally and Charles L Seitz (1985).
Packit 13e616
Packit 13e616
Packit 13e616
DNUP Routing Algorithm
Packit 13e616
----------------------
Packit 13e616
Packit 13e616
Purpose:
Packit 13e616
Packit 13e616
The DNUP algorithm is designed to serve a similar purpose to UPDN. However
Packit 13e616
it is intended to work in network topologies which are unsuited to
Packit 13e616
UPDN due to nodes being connected closer to the roots than some of
Packit 13e616
the switches.  An example would be a fabric which contains nodes and
Packit 13e616
uplinks connected to the same switch. The operation of DNUP is the
Packit 13e616
same as UPDN with the exception of the ranking process.  In DNUP all
Packit 13e616
switch nodes are ranked based solely on their distance from CA Nodes,
Packit 13e616
all switch nodes directly connected to at least one CA are assigned a
Packit 13e616
value of 1 all other switch nodes are assigned a value of one more than
Packit 13e616
the minimum rank of all neighbor switch nodes.
Packit 13e616
Packit 13e616
Packit 13e616
Fat-tree Routing Algorithm
Packit 13e616
--------------------------
Packit 13e616
Packit 13e616
Purpose:
Packit 13e616
Packit 13e616
The fat-tree algorithm optimizes routing for "shift" communication pattern.
Packit 13e616
It should be chosen if a subnet is a symmetrical or almost symmetrical
Packit 13e616
fat-tree of various types.
Packit 13e616
It supports not just K-ary-N-Trees, by handling for non-constant K,
Packit 13e616
cases where not all leafs (CAs) are present, any Constant
Packit 13e616
Bisectional Ratio (CBB) ratio.  As in UPDN, fat-tree also prevents
Packit 13e616
credit-loop-deadlocks.
Packit 13e616
Packit 13e616
If the root guid file is not provided ('-a' or '--root_guid_file' options),
Packit 13e616
the topology has to be pure fat-tree that complies with the following rules:
Packit 13e616
  - Tree rank should be between two and eight (inclusively)
Packit 13e616
  - Switches of the same rank should have the same number
Packit 13e616
    of UP-going port groups*, unless they are root switches,
Packit 13e616
    in which case the shouldn't have UP-going ports at all.
Packit 13e616
  - Switches of the same rank should have the same number
Packit 13e616
    of DOWN-going port groups, unless they are leaf switches.
Packit 13e616
  - Switches of the same rank should have the same number
Packit 13e616
    of ports in each UP-going port group.
Packit 13e616
  - Switches of the same rank should have the same number
Packit 13e616
    of ports in each DOWN-going port group.
Packit 13e616
  - All the CAs have to be at the same tree level (rank).
Packit 13e616
Packit 13e616
If the root guid file is provided, the topology doesn't have to be pure
Packit 13e616
fat-tree, and it should only comply with the following rules:
Packit 13e616
  - Tree rank should be between two and eight (inclusively)
Packit 13e616
  - All the Compute Nodes** have to be at the same tree level (rank).
Packit 13e616
    Note that non-compute node CAs are allowed here to be at different
Packit 13e616
    tree ranks.
Packit 13e616
Packit 13e616
* ports that are connected to the same remote switch are referenced as
Packit 13e616
'port group'.
Packit 13e616
** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file'
Packit 13e616
OpenSM options.
Packit 13e616
Packit 13e616
Note that although fat-tree algorithm supports trees with non-integer CBB
Packit 13e616
ratio, the routing will not be as balanced as in case of integer CBB ratio.
Packit 13e616
In addition to this, although the algorithm allows leaf switches to have any
Packit 13e616
number of CAs, the closer the tree is to be fully populated, the more effective
Packit 13e616
the "shift" communication pattern will be.
Packit 13e616
In general, even if the root list is provided, the closer the topology to a
Packit 13e616
pure and symmetrical fat-tree, the more optimal the routing will be.
Packit 13e616
Packit 13e616
The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
Packit 13e616
in the same directory where the OpenSM log resides. This ordering file provides
Packit 13e616
the CN order that may be used to create efficient communication pattern, that
Packit 13e616
will match the routing tables.
Packit 13e616
Packit 13e616
Routing between non-CN nodes
Packit 13e616
Packit 13e616
Packit 13e616
The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree.
Packit 13e616
In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes.
Packit 13e616
In the scheme below, N1, N2 and N3 are non-CN nodes. Although all the CN have routes to and from them,
Packit 13e616
there will not necessarily be a route between N1,N2 and N3.
Packit 13e616
Such routes would require to use at least one of the Switch the wrong way around
Packit 13e616
(In fact, go out of one of the top Switch through a downgoing port while we are supposed to go up).
Packit 13e616
Packit 13e616
  Spine1   Spine2    Spine 3
Packit 13e616
   / \     /  |  \    /   \
Packit 13e616
  /   \   /   |   \  /     \
Packit 13e616
 N1  Switch   N2  Switch    N3
Packit 13e616
      /|\          /|\
Packit 13e616
     / | \        / | \
Packit 13e616
    Going down to compute nodes
Packit 13e616
Packit 13e616
To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option.
Packit 13e616
Theses nodes will be allowed to use switches the wrong way around a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'.
Packit 13e616
With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree.
Packit 13e616
Packit 13e616
In the scheme above, with a max_reverse_hop of 1, routes will be instantiated between N1<->N2 and N2<->N3.
Packit 13e616
With a max_reverse_hops value of 2, N1,N2 and N3 will all have routes between them.
Packit 13e616
Packit 13e616
Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way.
Packit 13e616
This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used
Packit 13e616
to allow connectivity for HA purposes or similar.
Packit 13e616
Also having routes the other way around can in theory cause credit loops.
Packit 13e616
Packit 13e616
Use these options with extreme care !
Packit 13e616
Packit 13e616
Packit 13e616
Usage:
Packit 13e616
Packit 13e616
Activation through OpenSM
Packit 13e616
Packit 13e616
Use '-R ftree' option to activate the fat-tree algorithm.
Packit 13e616
Packit 13e616
Note: LMC > 0 is not supported by fat-tree routing. If this is
Packit 13e616
specified, the default routing algorithm is invoked instead.
Packit 13e616
Packit 13e616
Packit 13e616
LASH Routing Algorithm
Packit 13e616
----------------------
Packit 13e616
Packit 13e616
LASH is an acronym for LAyered SHortest Path Routing. It is a
Packit 13e616
deterministic shortest path routing algorithm that enables topology
Packit 13e616
agnostic deadlock-free routing within communication networks.
Packit 13e616
Packit 13e616
When computing the routing function, LASH analyzes the network
Packit 13e616
topology for the shortest-path routes between all pairs of sources /
Packit 13e616
destinations and groups these paths into virtual layers in such a way
Packit 13e616
as to avoid deadlock.
Packit 13e616
Packit 13e616
Note LASH analyzes routes and ensures deadlock freedom between switch
Packit 13e616
pairs. The link from HCA between and switch does not need virtual
Packit 13e616
layers as deadlock will not arise between switch and HCA.
Packit 13e616
Packit 13e616
In more detail, the algorithm works as follows:
Packit 13e616
Packit 13e616
1) LASH determines the shortest-path between all pairs of source /
Packit 13e616
destination switches. Note, LASH ensures the same SL is used for all
Packit 13e616
SRC/DST - DST/SRC pairs and there is no guarantee that the return
Packit 13e616
path for a given DST/SRC will be the reverse of the route SRC/DST.
Packit 13e616
Packit 13e616
2) LASH then begins an SL assignment process where a route is assigned
Packit 13e616
to a layer (SL) if the addition of that route does not cause deadlock
Packit 13e616
within that layer. This is achieved by maintaining and analysing a
Packit 13e616
channel dependency graph for each layer. Once the potential addition
Packit 13e616
of a path could lead to deadlock, LASH opens a new layer and continues
Packit 13e616
the process.
Packit 13e616
Packit 13e616
3) Once this stage has been completed, it is highly likely that the
Packit 13e616
first layers processed will contain more paths than the latter ones.
Packit 13e616
To better balance the use of layers, LASH moves paths from one layer
Packit 13e616
to another so that the number of paths in each layer averages out.
Packit 13e616
Packit 13e616
Note, the implementation of LASH in opensm attempts to use as few layers
Packit 13e616
as possible. This number can be less than the number of actual layers
Packit 13e616
available.
Packit 13e616
Packit 13e616
In general LASH is a very flexible algorithm. It can, for example,
Packit 13e616
reduce to Dimension Order Routing in certain topologies, it is topology
Packit 13e616
agnostic and fares well in the face of faults.
Packit 13e616
Packit 13e616
It has been shown that for both regular and irregular topologies, LASH
Packit 13e616
outperforms Up/Down. The reason for this is that LASH distributes the
Packit 13e616
traffic more evenly through a network, avoiding the bottleneck issues
Packit 13e616
related to a root node and always routes shortest-path.
Packit 13e616
Packit 13e616
The algorithm was developed by Simula Research Laboratory.
Packit 13e616
Packit 13e616
To learn more about LASH and the flexibility behind it, the requirement
Packit 13e616
for layers, performance comparisons to other algorithms, see the
Packit 13e616
following articles:
Packit 13e616
Packit 13e616
"Layered Routing in Irregular Networks", Lysne et al, IEEE
Packit 13e616
Transactions on Parallel and Distributed Systems, VOL.16, No12,
Packit 13e616
December 2005.
Packit 13e616
Packit 13e616
"Routing for the ASI Fabric Manager", Solheim et al. IEEE
Packit 13e616
Communications Magazine, Vol.44, No.7, July 2006.
Packit 13e616
Packit 13e616
"Layered Shortest Path (LASH) Routing in Irregular System Area
Packit 13e616
Networks", Skeie et al. IEEE Computer Society Communication
Packit 13e616
Architecture for Clusters 2002.
Packit 13e616
Packit 13e616
Packit 13e616
Use '-R lash -Q ' option to activate the LASH algorithm.
Packit 13e616
Packit 13e616
Note: QoS support has to be turned on in order that SL/VL mappings are
Packit 13e616
used.
Packit 13e616
Packit 13e616
Note: LMC > 0 is not supported by the LASH routing. If this is
Packit 13e616
specified, the default routing algorithm is invoked instead.
Packit 13e616
Packit 13e616
For open regular cartesian meshes the DOR algorithm is the ideal
Packit 13e616
routing algorithm. For toroidal meshes on the other hand there
Packit 13e616
are routing loops that can cause deadlocks. LASH can be used to
Packit 13e616
route these cases. The performance of LASH can be improved by
Packit 13e616
preconditioning the mesh in cases where there are multiple links
Packit 13e616
connecting switches and also in cases where the switches are not
Packit 13e616
cabled consistently. An option exists for LASH to do this. To
Packit 13e616
invoke this use '-R lash -Q --do_mesh_analysis'. This will
Packit 13e616
add an additional phase that analyses the mesh to try to determine
Packit 13e616
the dimension and size of a mesh. If it determines that the mesh
Packit 13e616
looks like an open or closed cartesian mesh it reorders the ports
Packit 13e616
in dimension order before the rest of the LASH algorithm runs.
Packit 13e616
Packit 13e616
DOR Routing Algorithm
Packit 13e616
---------------------
Packit 13e616
Packit 13e616
The Dimension Order Routing algorithm is based on the Min Hop
Packit 13e616
algorithm and so uses shortest paths.  Instead of spreading traffic
Packit 13e616
out across different paths with the same shortest distance, it chooses
Packit 13e616
among the available shortest paths based on an ordering of dimensions.
Packit 13e616
Each port must be consistently cabled to represent a hypercube
Packit 13e616
dimension or a mesh dimension.  Paths are grown from a destination
Packit 13e616
back to a source using the lowest dimension (port) of available paths
Packit 13e616
at each step.  This provides the ordering necessary to avoid deadlock.
Packit 13e616
When there are multiple links between any two switches, they still
Packit 13e616
represent only one dimension and traffic is balanced across them
Packit 13e616
unless port equalization is turned off.  In the case of hypercubes,
Packit 13e616
the same port must be used throughout the fabric to represent the
Packit 13e616
hypercube dimension and match on both ends of the cable.  In the case
Packit 13e616
of meshes, the dimension should consistently use the same pair of
Packit 13e616
ports, one port on one end of the cable, and the other port on the
Packit 13e616
other end, continuing along the mesh dimension.
Packit 13e616
Packit 13e616
Use '-R dor' option to activate the DOR algorithm.
Packit 13e616
Packit 13e616
Torus-2QoS Routing Algorithm
Packit 13e616
----------------------------
Packit 13e616
Packit 13e616
Torus-2QoS is a routing algorithm designed for large-scale 2D/3D torus fabrics.
Packit 13e616
The torus-2QoS routing engine can provide the following functionality on
Packit 13e616
a 2D/3D torus:
Packit 13e616
- routing that is free of credit loops
Packit 13e616
- two levels of QoS, assuming switches support 8 data VLs
Packit 13e616
- ability to route around a single failed switch, and/or multiple failed
Packit 13e616
    links, without
Packit 13e616
    - introducing credit loops
Packit 13e616
    - changing path SL values
Packit 13e616
- very short run times, with good scaling properties as fabric size
Packit 13e616
    increases
Packit 13e616
Packit 13e616
Unicast Routing:
Packit 13e616
Packit 13e616
Torus-2QoS is a DOR-based algorithm that avoids deadlocks that would otherwise
Packit 13e616
occur in a torus using the concept of a dateline for each torus dimension.
Packit 13e616
It encodes into a path SL which datelines the path crosses as follows:
Packit 13e616
Packit 13e616
  sl = 0;
Packit 13e616
  for (d = 0; d < torus_dimensions; d++)
Packit 13e616
    /* path_crosses_dateline(d) returns 0 or 1 */
Packit 13e616
    sl |= path_crosses_dateline(d) << d;
Packit 13e616
Packit 13e616
For a 3D torus, that leaves one SL bit free, which torus-2QoS uses to
Packit 13e616
implement two QoS levels.
Packit 13e616
Packit 13e616
Torus-2QoS also makes use of the output port dependence of switch SL2VL
Packit 13e616
maps to encode into one VL bit the information encoded in three SL bits.
Packit 13e616
It computes in which torus coordinate direction each inter-switch link
Packit 13e616
"points", and writes SL2VL maps for such ports as follows:
Packit 13e616
Packit 13e616
  for (sl = 0; sl < 16; sl ++)
Packit 13e616
    /* cdir(port) reports which torus coordinate direction a switch port
Packit 13e616
     * "points" in, and returns 0, 1, or 2 */
Packit 13e616
    sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
Packit 13e616
Packit 13e616
Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches,
Packit 13e616
torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0)
Packit 13e616
per QoS level to provide deadlock-free routing on a 3D torus.
Packit 13e616
Packit 13e616
Torus-2QoS routes around link failure by "taking the long way around" any
Packit 13e616
1D ring interrupted by a link failure.  For example, consider the 2D 6x5
Packit 13e616
torus below, where switches are denoted by [+a-zA-Z]:
Packit 13e616
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   4  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   3  --+----+----+----D----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   2  --+----+----I----r----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   1  --m----S----n----T----o----p--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
 y=0  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
Packit 13e616
      x=0    1    2    3    4    5
Packit 13e616
Packit 13e616
For a pristine fabric the path from S to D would be S-n-T-r-D.  In the
Packit 13e616
event that either link S-n or n-T has failed, torus-2QoS would use the path
Packit 13e616
S-m-p-o-T-r-D.  Note that it can do this without changing the path SL
Packit 13e616
value; once the 1D ring m-S-n-T-o-p-m has been broken by failure, path
Packit 13e616
segments using it cannot contribute to deadlock, and the x-direction
Packit 13e616
dateline (between, say, x=5 and x=0) can be ignored for path segments on
Packit 13e616
that ring.
Packit 13e616
Packit 13e616
One result of this is that torus-2QoS can route around many simultaneous
Packit 13e616
link failures, as long as no 1D ring is broken into disjoint segments.  For
Packit 13e616
example, if links n-T and T-o have both failed, that ring has been broken
Packit 13e616
into two disjoint segments, T and o-p-m-S-n.  Torus-2QoS checks for such
Packit 13e616
issues, reports if they are found, and refuses to route such fabrics.
Packit 13e616
Packit 13e616
Note that in the case where there are multiple parallel links between a pair
Packit 13e616
of switches, torus-2QoS will allocate routes across such links in a round-
Packit 13e616
robin fashion, based on ports at the path destination switch that are active
Packit 13e616
and not used for inter-switch links.  Should a link that is one of several
Packit 13e616
such parallel links fail, routes are redistributed across the remaining
Packit 13e616
links.   When the last of such a set of parallel links fails, traffic is
Packit 13e616
rerouted as described above.
Packit 13e616
Packit 13e616
Handling a failed switch under DOR requires introducing into a path at
Packit 13e616
least one turn that would be otherwise "illegal", i.e. not allowed by DOR
Packit 13e616
rules.  Torus-2QoS will introduce such a turn as close as possible to the
Packit 13e616
failed switch in order to route around it.
Packit 13e616
Packit 13e616
In the above example, suppose switch T has failed, and consider the path
Packit 13e616
from S to D.  Torus-2QoS will produce the path S-n-I-r-D, rather than the
Packit 13e616
S-n-T-r-D path for a pristine torus, by introducing an early turn at n.
Packit 13e616
Normal DOR rules will cause traffic arriving at switch I to be forwarded
Packit 13e616
to switch r; for traffic arriving from I due to the "early" turn at n,
Packit 13e616
this will generate an "illegal" turn at I.
Packit 13e616
Packit 13e616
Torus-2QoS will also use the input port dependence of SL2VL maps to set VL
Packit 13e616
bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns, i.e.,
Packit 13e616
those turns that are illegal under DOR.  This causes the first hop after
Packit 13e616
any such turn to use a separate set of VL values, and prevents deadlock in
Packit 13e616
the presence of a single failed switch.
Packit 13e616
Packit 13e616
For any given path, only the hops after a turn that is illegal under DOR
Packit 13e616
can contribute to a credit loop that leads to deadlock.  So in the example
Packit 13e616
above with failed switch T, the location of the illegal turn at I in the
Packit 13e616
path from S to D requires that any credit loop caused by that turn must
Packit 13e616
encircle the failed switch at T.  Thus the second and later hops after the
Packit 13e616
illegal turn at I (i.e., hop r-D) cannot contribute to a credit loop
Packit 13e616
because they cannot be used to construct a loop encircling T.  The hop I-r
Packit 13e616
uses a separate VL, so it cannot contribute to a credit loop encircling T.
Packit 13e616
Packit 13e616
Extending this argument shows that in addition to being capable of routing
Packit 13e616
around a single switch failure without introducing deadlock, torus-2QoS can
Packit 13e616
also route around multiple failed switches on the condition they are
Packit 13e616
adjacent in the last dimension routed by DOR.  For example, consider the
Packit 13e616
following case on a 6x6 2D torus:
Packit 13e616
Packit 13e616
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   5  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   4  --+----+----+----D----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   3  --+----+----I----u----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   2  --+----+----q----R----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   1  --m----S----n----T----o----p--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
 y=0  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
Packit 13e616
      x=0    1    2    3    4    5
Packit 13e616
Packit 13e616
Packit 13e616
Suppose switches T and R have failed, and consider the path from S to D.
Packit 13e616
Torus-2QoS will generate the path S-n-q-I-u-D, with an illegal turn at
Packit 13e616
switch I, and with hop I-u using a VL with bit 1 set.
Packit 13e616
Packit 13e616
As a further example, consider a case that torus-2QoS cannot route without
Packit 13e616
deadlock: two failed switches adjacent in a dimension that is not the last
Packit 13e616
dimension routed by DOR; here the failed switches are O and T:
Packit 13e616
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   5  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   4  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   3  --+----+----+----+----D----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   2  --+----+----I----q----r----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   1  --m----S----n----O----T----p--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
 y=0  --+----+----+----+----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
Packit 13e616
      x=0    1    2    3    4    5
Packit 13e616
Packit 13e616
In a pristine fabric, torus-2QoS would generate the path from S to D as
Packit 13e616
S-n-O-T-r-D.  With failed switches O and T, torus-2QoS will generate the
Packit 13e616
path S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q using a
Packit 13e616
VL with bit 1 set.  In contrast to the earlier examples, the second hop
Packit 13e616
after the illegal turn, q-r, can be used to construct a credit loop
Packit 13e616
encircling the failed switches.
Packit 13e616
Packit 13e616
Multicast Routing:
Packit 13e616
Packit 13e616
Since torus-2QoS uses all four available SL bits, and the three data VL
Packit 13e616
bits that are typically available in current switches, there is no way
Packit 13e616
to use SL/VL values to separate multicast traffic from unicast traffic.
Packit 13e616
Thus, torus-2QoS must generate multicast routing such that credit loops
Packit 13e616
cannot arise from a combination of multicast and unicast path segments.
Packit 13e616
Packit 13e616
It turns out that it is possible to construct spanning trees for multicast
Packit 13e616
routing that have that property.  For the 2D 6x5 torus example above, here
Packit 13e616
is the full-fabric spanning tree that torus-2QoS will construct, where "x"
Packit 13e616
is the root switch and each "+" is a non-root switch:
Packit 13e616
Packit 13e616
   4    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   3    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   2    +----+----+----x----+----+
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   1    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
 y=0    +    +    +    +    +    +
Packit 13e616
Packit 13e616
      x=0    1    2    3    4    5
Packit 13e616
Packit 13e616
For multicast traffic routed from root to tip, every turn in the above
Packit 13e616
spanning tree is a legal DOR turn.
Packit 13e616
Packit 13e616
For traffic routed from tip to root, and some traffic routed through the
Packit 13e616
root, turns are not legal DOR turns.  However, to construct a credit loop,
Packit 13e616
the union of multicast routing on this spanning tree with DOR unicast
Packit 13e616
routing can only provide 3 of the 4 turns needed for the loop.
Packit 13e616
Packit 13e616
In addition, if none of the above spanning tree branches crosses a dateline
Packit 13e616
used for unicast credit loop avoidance on a torus, and if multicast traffic
Packit 13e616
is confined to SL 0 or SL 8 (recall that torus-2QoS uses SL bit 3 to
Packit 13e616
differentiate QoS level), then multicast traffic also cannot contribute to
Packit 13e616
the "ring" credit loops that are otherwise possible in a torus.
Packit 13e616
Packit 13e616
Torus-2QoS uses these ideas to create a master spanning tree.  Every
Packit 13e616
multicast group spanning tree will be constructed as a subset of the master
Packit 13e616
tree, with the same root as the master tree.
Packit 13e616
Packit 13e616
Such multicast group spanning trees will in general not be optimal for
Packit 13e616
groups which are a subset of the full fabric. However, this compromise must
Packit 13e616
be made to enable support for two QoS levels on a torus while preventing
Packit 13e616
credit loops.
Packit 13e616
Packit 13e616
In the presence of link or switch failures that result in a fabric for
Packit 13e616
which torus-2QoS can generate credit-loop-free unicast routes, it is also
Packit 13e616
possible to generate a master spanning tree for multicast that retains the
Packit 13e616
required properties.  For example, consider that same 2D 6x5 torus, with
Packit 13e616
the link from (2,2) to (3,2) failed.  Torus-2QoS will generate the following
Packit 13e616
master spanning tree:
Packit 13e616
Packit 13e616
   4    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   3    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   2  --+----+----+    x----+----+--
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   1    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
 y=0    +    +    +    +    +    +
Packit 13e616
Packit 13e616
      x=0    1    2    3    4    5
Packit 13e616
Packit 13e616
Two things are notable about this master spanning tree.  First, assuming
Packit 13e616
the x dateline was between x=5 and x=0, this spanning tree has a branch
Packit 13e616
that crosses the dateline.  However, just as for unicast, crossing a
Packit 13e616
dateline on a 1D ring (here, the ring for y=2) that is broken by a failure
Packit 13e616
cannot contribute to a torus credit loop.
Packit 13e616
Packit 13e616
Second, this spanning tree is no longer optimal even for multicast groups
Packit 13e616
that encompass the entire fabric.  That, unfortunately, is a compromise that
Packit 13e616
must be made to retain the other desirable properties of torus-2QoS routing.
Packit 13e616
Packit 13e616
In the event that a single switch fails, torus-2QoS will generate a master
Packit 13e616
spanning tree that has no "extra" turns by appropriately selecting a root
Packit 13e616
switch.  In the 2D 6x5 torus example, assume now that the switch at (3,2),
Packit 13e616
i.e. the root for a pristine fabric, fails.  Torus-2QoS will generate the
Packit 13e616
following master spanning tree for that case:
Packit 13e616
Packit 13e616
		       |
Packit 13e616
   4    +    +    +    +    +    +
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
   3    +    +    +    +    +    +
Packit 13e616
	|    |    |         |    |
Packit 13e616
   2    +    +    +         +    +
Packit 13e616
	|    |    |         |    |
Packit 13e616
   1    +----+----x----+----+----+
Packit 13e616
	|    |    |    |    |    |
Packit 13e616
 y=0    +    +    +    +    +    +
Packit 13e616
		       |
Packit 13e616
Packit 13e616
      x=0    1    2    3    4    5
Packit 13e616
Packit 13e616
Assuming the y dateline was between y=4 and y=0, this spanning tree has
Packit 13e616
a branch that crosses a dateline.  However, again this cannot contribute
Packit 13e616
to credit loops as it occurs on a 1D ring (the ring for x=3) that is
Packit 13e616
broken by a failure, as in the above example.
Packit 13e616
Packit 13e616
Torus Topology Discovery:
Packit 13e616
Packit 13e616
The algorithm used by torus-2QoS to construct the torus topology from the
Packit 13e616
undirected graph representing the fabric requires that the radix of each
Packit 13e616
dimension be configured via torus-2QoS.conf. It also requires that the
Packit 13e616
torus topology be "seeded"; for a 3D torus this requires configuring four
Packit 13e616
switches that define the three coordinate directions of the torus.
Packit 13e616
Packit 13e616
Given this starting information, the algorithm is to examine the cube
Packit 13e616
formed by the eight switch locations bounded by the corners (x,y,z) and
Packit 13e616
(x+1,y+1,z+1).  Based on switches already placed into the torus topology at
Packit 13e616
some of these locations, the algorithm examines 4-loops of interswitch
Packit 13e616
links to find the one that is consistent with a face of the cube of switch
Packit 13e616
locations, and adds its switches to the discovered topology in the correct
Packit 13e616
locations.
Packit 13e616
Packit 13e616
Because the algorithm is based on examining the topology of 4-loops of links,
Packit 13e616
a torus with one or more radix-4 dimensions requires extra initial seed
Packit 13e616
configuration.  See torus-2QoS.conf(5) for details. Torus-2QoS will detect
Packit 13e616
and report when it has insufficient configuration for a torus with radix-4
Packit 13e616
dimensions.
Packit 13e616
Packit 13e616
In the event the torus is significantly degraded, i.e., there are many
Packit 13e616
missing switches or links, it may happen that torus-2QoS is unable to place
Packit 13e616
into the torus some switches and/or links that were discovered in the
Packit 13e616
fabric, and will generate a warning in that case.  A similar condition
Packit 13e616
occurs if torus-2QoS is misconfigured, i.e., the radix of a torus dimension
Packit 13e616
as configured does not match the radix of that torus dimension as wired,
Packit 13e616
and many switches/links in the fabric will not be placed into the torus.
Packit 13e616
Packit 13e616
Quality Of Service Configuration:
Packit 13e616
Packit 13e616
OpenSM will not program switches and channel adapters with SL2VL maps or VL
Packit 13e616
arbitration configuration unless it is invoked with -Q.  Since torus-2QoS
Packit 13e616
depends on such functionality for correct operation, always invoke OpenSM
Packit 13e616
with -Q when torus-2QoS is in the list of routing engines.
Packit 13e616
Packit 13e616
Any quality of service configuration method supported by OpenSM will work
Packit 13e616
with torus-2QoS, subject to the following limitations and considerations.
Packit 13e616
Packit 13e616
For all routing engines supported by OpenSM except torus-2QoS, there is a
Packit 13e616
one-to-one correspondence between QoS level and SL. Torus-2QoS can only
Packit 13e616
support two quality of service levels, so only the high-order bit of any SL
Packit 13e616
value used for unicast QoS configuration will be honored by torus-2QoS.
Packit 13e616
Packit 13e616
For multicast QoS configuration, only SL values 0 and 8 should be used with
Packit 13e616
torus-2QoS.
Packit 13e616
Packit 13e616
Since SL to VL map configuration must be under the complete control of
Packit 13e616
torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc., must and
Packit 13e616
will be ignored, and a warning will be generated.
Packit 13e616
Packit 13e616
For inter-switch links, Torus-2QoS uses VL values 0-3 to implement one of
Packit 13e616
its supported QoS levels, and VL values 4-7 to implement the other. For
Packit 13e616
endport links (CA, router, switch management port), Torus-2QoS uses VL
Packit 13e616
value 0 for one of its supported QoS levels and VL value 1 to implement
Packit 13e616
the other.  Hard-to-diagnose application issues may arise if traffic is
Packit 13e616
not delivered fairly across each of these two VL ranges. For
Packit 13e616
inter-switch links, Torus-2QoS will detect and warn if VL arbitration is
Packit 13e616
configured unfairly across VLs in the range 0-3, and also in the range
Packit 13e616
4-7. Note that the default OpenSM VL arbitration configuration does
Packit 13e616
not meet this constraint, so all torus-2QoS users should configure VL
Packit 13e616
arbitration via qos_ca_vlarb_high, qos_swe_vlarb_high, qos_ca_vlarb_low,
Packit 13e616
qos_swe_vlarb_low, etc.
Packit 13e616
Packit 13e616
Note that torus-2QoS maps SL values to VL values differently
Packit 13e616
for inter-switch and endport links.  This is why qos_vlarb_high and
Packit 13e616
qos_vlarb_low should not be used, as using them may result in
Packit 13e616
VL arbitration for a QoS level being different across inter-switch
Packit 13e616
links vs. across endport links.
Packit 13e616
Packit 13e616
Operational Considerations:
Packit 13e616
Packit 13e616
Any routing algorithm for a torus IB fabric must employ path SL values to
Packit 13e616
avoid credit loops. As a result, all applications run over such fabrics
Packit 13e616
must perform a path record query to obtain the correct path SL for
Packit 13e616
connection setup. Applications that use rdma_cm for connection setup will
Packit 13e616
automatically meet this requirement.
Packit 13e616
Packit 13e616
If a change in fabric topology causes changes in path SL values required to
Packit 13e616
route without credit loops, in general all applications would need to
Packit 13e616
repath to avoid message deadlock. Since torus-2QoS has the ability to
Packit 13e616
reroute after a single switch failure without changing path SL values,
Packit 13e616
repathing by running applications is not required when the fabric is routed
Packit 13e616
with torus-2QoS.
Packit 13e616
Packit 13e616
Torus-2QoS can provide unchanging path SL values in the presence of subnet
Packit 13e616
manager failover provided that all OpenSM instances have the same idea of
Packit 13e616
dateline location. See torus-2QoS.conf(5) for details.
Packit 13e616
Packit 13e616
Torus-2QoS will detect configurations of failed switches and links that
Packit 13e616
prevent routing that is free of credit loops, and will log warnings and
Packit 13e616
refuse to route. If "no_fallback" was configured in the list of OpenSM
Packit 13e616
routing engines, then no other routing engine will attempt to route the
Packit 13e616
fabric. In that case all paths that do not transit the failed components
Packit 13e616
will continue to work, and the subset of paths that are still operational
Packit 13e616
will continue to remain free of credit loops. OpenSM will continue to
Packit 13e616
attempt to route the fabric after every sweep interval, and after any
Packit 13e616
change (such as a link up) in the fabric topology. When the fabric
Packit 13e616
components are repaired, full functionality will be restored.
Packit 13e616
Packit 13e616
In the event OpenSM was configured to allow some other engine to route the
Packit 13e616
fabric if torus-2QoS fails, then credit loops and message deadlock are
Packit 13e616
likely if torus-2QoS had previously routed the fabric successfully. Even if
Packit 13e616
the other engine is capable of routing a torus without credit loops,
Packit 13e616
applications that built connections with path SL values granted under
Packit 13e616
torus-2QoS will likely experience message deadlock under routing generated
Packit 13e616
by a different engine, unless they repath.
Packit 13e616
Packit 13e616
To verify that a torus fabric is routed free of credit loops, use ibdmchk
Packit 13e616
to analyze data collected via ibdiagnet -vlr.
Packit 13e616
Packit 13e616
DFSSSP and SSSP Routing Algorithm
Packit 13e616
---------------------------------
Packit 13e616
Packit 13e616
The (Deadlock-Free) Single-Source-Shortest-Path routing algorithm is
Packit 13e616
designed to optimize link utilization thru global balancing of routes,
Packit 13e616
while supporting arbitrary topologies. The DFSSSP routing algorithm
Packit 13e616
uses InfiniBand virtual lanes (SL) to provide deadlock-freedom.
Packit 13e616
Packit 13e616
The DFSSSP algorithm consists of five major steps:
Packit 13e616
1) It discovers the subnet and models the subnet as a directed
Packit 13e616
   multigraph in which each node represents a node of the physical
Packit 13e616
   network and each edge represents one direction of the full-duplex
Packit 13e616
   links used to connect the nodes.
Packit 13e616
2) A loop, which iterates over all CA and switches of the subnet, will
Packit 13e616
   perform three steps to generate the linear forwarding tables for
Packit 13e616
   each switch:
Packit 13e616
2.1) use Dijkstra's algorithm to find the shortest path from all nodes
Packit 13e616
     to the current selected destination;
Packit 13e616
2.2) update the edge weights in the graph, i.e. add the number of
Packit 13e616
     routes, which use a link to reach the destination, to the link/edge;
Packit 13e616
2.3) update the LFT of each switch with the outgoing port which was used
Packit 13e616
     in the current step to route the traffic to the destination node.
Packit 13e616
3) After the number of available virtual lanes or layers in the subnet
Packit 13e616
   is detected and a channel dependency graph is initialized for each layer,
Packit 13e616
   the algorithm will put each possible route of the subnet into the first
Packit 13e616
   layer.
Packit 13e616
4) A loop iterates over all channel dependency graphs (CDG) and performs
Packit 13e616
   the following substeps:
Packit 13e616
4.1) search for a cycle in the current CDG;
Packit 13e616
4.2) when a cycle is found, i.e. a possible deadlock is present,
Packit 13e616
     one edge is selected and all routes, which induced this edge, are moved
Packit 13e616
     to the "next higher" virtual layer (CDG[i+1]);
Packit 13e616
4.3) the cycle search is continued until all cycles are broken and
Packit 13e616
     routes are moved "up".
Packit 13e616
5) When the number of needed layers does not exceeds the number of
Packit 13e616
   available SL/VL to remove all cycles in all CDGs, the routing is
Packit 13e616
   deadlock-free and an relation table is generated, which contains
Packit 13e616
   the assignment of routes from source to destination to a SL
Packit 13e616
Packit 13e616
Note on SSSP:
Packit 13e616
This algorithm does not perform the steps 3)-5) and can not be
Packit 13e616
considered to be deadlock-free for all topologies. But on the one
Packit 13e616
hand, you can choose this algorithm for really large networks
Packit 13e616
(5,000+ CAs and deadlock-free by design) to reduce
Packit 13e616
the runtime of the algorithm. On the other hand, you might use
Packit 13e616
the SSSP routing algorithm as an alternative, when all deadlock-free
Packit 13e616
routing algorithms fail to route the network for whatever reason.
Packit 13e616
In the last case, SSSP was designed to deliver an equal or higher
Packit 13e616
bandwidth due to better congestion avoidance than the Min Hop
Packit 13e616
routing algorithm.
Packit 13e616
Packit 13e616
Notes for usage:
Packit 13e616
  a) running DFSSSP: '-R dfsssp -Q'
Packit 13e616
    a.1) QoS has to be configured to equally spread the load on the
Packit 13e616
         available SL or virtual lanes
Packit 13e616
    a.2) applications must perform a path record query to get path SL for
Packit 13e616
         each route, which the application will use to transmit packages
Packit 13e616
  b) running SSSP:   '-R sssp'
Packit 13e616
  c) both algorithms support LMC > 0
Packit 13e616
Packit 13e616
Hints for separate optimization of compute and I/O traffic:
Packit 13e616
Having more nodes (I/O and compute) connected to a switch than incoming links
Packit 13e616
can result in a 'bad' routing of the I/O traffic as long as (DF)SSSP routing
Packit 13e616
is not aware of the dedicated I/O nodes, i.e., in the following network
Packit 13e616
configuration CN1-CN3 might send all I/O traffic via Link2 to IO1,IO2:
Packit 13e616
Packit 13e616
     CN1         Link1        IO1
Packit 13e616
        \       /----\       /
Packit 13e616
  CN2 -- Switch1      Switch2 -- CN4
Packit 13e616
        /       \----/       \
Packit 13e616
     CN3         Link2        IO2
Packit 13e616
Packit 13e616
To prevent this from happening (DF)SSSP can use both the compute node guid
Packit 13e616
file and the I/O guid file specified by the '-u' or '--cn_guid_file' and
Packit 13e616
'-G' or '--io_guid_file' options (similar to the Fat-Tree routing).
Packit 13e616
This ensures that traffic towards compute nodes and I/O nodes is balanced
Packit 13e616
separately and therefore distributed as much as possible across the available
Packit 13e616
links. Port GUIDs, as listed by ibstat, must be specified (not Node GUIDs).
Packit 13e616
The priority for the optimization is as follows:
Packit 13e616
  compute nodes -> I/O nodes -> other nodes
Packit 13e616
Possible use case scenarios:
Packit 13e616
  a) neither '-u' nor '-G' are specified: all nodes a treated as 'other nodes'
Packit 13e616
     and therefore balanced equally;
Packit 13e616
  b) '-G' is specified: traffic towards I/O nodes will be balanced optimally;
Packit 13e616
  c) the system has three node types, such as login/admin, compute and I/O,
Packit 13e616
     but the balancing focus should be I/O, then one has to use '-u' and '-G'
Packit 13e616
     with I/O guids listed in cn_guid_file and compute node guids listed in
Packit 13e616
     io_guid_file;
Packit 13e616
  d) ...
Packit 13e616
Packit 13e616
For more information about the algorithms, i.e. balancing the routes and
Packit 13e616
moving the routes to different virtual layers, and about comparison with
Packit 13e616
other routing algorithms, please refer to the following articles:
Packit 13e616
1. J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing
Packit 13e616
   for Arbitrary Topologies, In Proceedings of the 25th IEEE International
Packit 13e616
   Parallel & Distributed Processing Symposium (IPDPS 2011)
Packit 13e616
2. T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for
Packit 13e616
   Large-Scale InfiniBand Networks, In 17th Annual IEEE Symposium on High
Packit 13e616
   Performance Interconnects (HOTI 2009)
Packit 13e616
Packit 13e616
Nue Routing Algorithm
Packit 13e616
---------------------
Packit 13e616
Packit 13e616
The implementation of Nue routing for OpenSM is a 100%-applicable, balanced, and
Packit 13e616
deadlock-free unicast routing engine (which also configures multicast tables,
Packit 13e616
see 'Note on multicast' below). The key points of this algorithm are the
Packit 13e616
following:
Packit 13e616
 - 100% fault-tolerant, oblivious routing strategy
Packit 13e616
 - topology-agnostic, i.e., applicable to every topology (no matter if topology
Packit 13e616
   is regular, irregular after faults, or random)
Packit 13e616
 - 100% deadlock-free routing within the resource limits (i.e., it never exceeds
Packit 13e616
   the given number of available virtual lanes, and it does not necessarily
Packit 13e616
   require virtual lanes) for every topology
Packit 13e616
 - very good path balancing and therefore high throughput
Packit 13e616
 - QoS (via SLs/VLs) + deadlock-freedom can be combined (since both rely on VLs)
Packit 13e616
 - forwarding tables are fast to calculate: O(n^2 * log n), however slightly
Packit 13e616
   slower compared to topology-aware routings (for obvious reasons), and
Packit 13e616
 - the path-to-VL mapping only depends on the destination, which may be useful
Packit 13e616
   for scalable, efficient path resolution and caching mechanisms.
Packit 13e616
From a very high level perspective, Nue routing is similar to DFSSSP (see above)
Packit 13e616
in the sense that both use Dijkstra and edge weight updates for path balancing.
Packit 13e616
However, the fundamental difference is that Nue routing doesn't perform the path
Packit 13e616
calculation on the graph representing the real fabric, and instead routes
Packit 13e616
directly within the channel dependency graph. This approach allows Nue routing
Packit 13e616
to place routing restrictions (to avoid any credit loops) in an on-demand
Packit 13e616
manner, which overcomes the problem of all other good VL-based algorithms.
Packit 13e616
Meaning, the competitors cannot control or limit the use of VLs, and might run
Packit 13e616
out of them and have to give up. On the flip side, Nue may have to use detours
Packit 13e616
for a few routes, and hence cannot really be considered "shortest-path" routing,
Packit 13e616
because of the impossibility lemma 6.1 (see ref. [2] Chapter 6) to accomplish
Packit 13e616
deadlock-free, shortest-path routing with an limited number of available virtual
Packit 13e616
lanes for arbitrary network topologies.
Packit 13e616
Packit 13e616
Conceptually, Nue routing works as follows:
Packit 13e616
 * Assume N is the set of destinations and k the number of available VLs
Packit 13e616
 * Assume that virtual lanes (VLs) are combined into virtual layers
Packit 13e616
 * Partition N into k disjoint subsets N_1 ,..., N_k of destinations
Packit 13e616
 * foreach virtual layer L_i with i in {1 ,..., k} do:
Packit 13e616
 *     Create a convex subgraph H_i for N_i
Packit 13e616
 *     Identify central node n_r in N_i of convex H_i via Brandes' algorithm
Packit 13e616
 *     Create a new complete channel dependency graph D_i for layer L_i
Packit 13e616
 *     Define escape paths D* in D_i for spanning tree rooted at n_r
Packit 13e616
 *     foreach destination node n in N_i do:
Packit 13e616
 *         Identify all deadlock-free paths towards n
Packit 13e616
 *         Store these paths in unicast forwarding tables
Packit 13e616
 *         Update channel weights in D_i for these paths
Packit 13e616
and the service level for applications returned in path request is determined
Packit 13e616
by Nue (for k>1) as follows:
Packit 13e616
 * Assuming an input of a given source/destination node (LID) pair
Packit 13e616
 * If Nue mapped the destination to layer L_x, then SL == x-1 is returned, while
Packit 13e616
   presuming that SL2VL tables are being mapped 1:1 for the number of layers.
Packit 13e616
Packit 13e616
While other VL-based routings usually gradually construct the channel dependency
Packit 13e616
graph after all paths have been calculated, Nue creates a "complete" version of
Packit 13e616
it which holds all possible dependencies allowing it to directly search for
Packit 13e616
cycles after "using" a dependency on the path to a destination. Since it would
Packit 13e616
be infeasible to search for cycles each and every time a channel dependency is
Packit 13e616
added, Nue employs the notion of individually colored subgraphs (one for each
Packit 13e616
destination) and only performs real cycle searches in the complete CDG when
Packit 13e616
adding new edges between nodes of the same subgraph, see Section 6.2.6.1 of
Packit 13e616
reference [2] for details on this optimization.
Packit 13e616
Packit 13e616
Use of METIS library:
Packit 13e616
In step 2 of the previously shown pseudo code, Nue routing separates the LIDs
Packit 13e616
into multiple subsets, one for every virtual layer. Nue has two options to
Packit 13e616
perform this partitioning (not to be confused with IB partitions): the first is
Packit 13e616
a fairly simple semi-random assignment of LIDs to layers/subsets, and the second
Packit 13e616
partitioning uses the METIS library to partition the network graph into k
Packit 13e616
approximately equal sized parts. The latter approach has shown better results
Packit 13e616
in terms of path balancing and avoidance of using the escape paths, and hence
Packit 13e616
it is HIGHLY advised to install/use the METIS library with OpenSM (enforced
Packit 13e616
via `--enable-metis' configure flag when building OpenSM). For the rare case,
Packit 13e616
that METIS isn't packaged with the Linux distro, here is a link to the official
Packit 13e616
website to download and install METIS 5.1.0 manually:
Packit 13e616
   http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
Packit 13e616
OpenSM's configure script also provides options in case METIS header and library
Packit 13e616
aren't found in the default path.
Packit 13e616
Packit 13e616
Runtime options for Nue:
Packit 13e616
The behavior of Nue routing can be directly influenced by two osm.conf
Packit 13e616
parameters (one is also available as command line option):
Packit 13e616
 - nue_max_num_vls: which controls/limits the number of virtual lanes which Nue
Packit 13e616
       is allowed to use (detailed explanation in osm.conf file); this option is
Packit 13e616
       also available via command line
Packit 13e616
 - nue_include_switches: the option (if TRUE) enforces Nue to treat switches
Packit 13e616
       as "normal" endpoints sending/receiving data traffic, which is usually
Packit 13e616
       not the case (also with other routings); hence, paths to switches will be
Packit 13e616
       included when calculating deadlock-free ucast tables (suggestion for IB
Packit 13e616
       subnets: FALSE)
Packit 13e616
Furthermore, Nue supports TRUE and FALSE settings of avoid_throttled_links,
Packit 13e616
use_ucast_cache, and qos (more on this hereafter); and lmc > 0.
Packit 13e616
Packit 13e616
Notes on Quality of Service (QoS):
Packit 13e616
The advantage of Nue is that it works with AND without QoS being enabled, i.e.,
Packit 13e616
the usage of SLs/VLs for deadlock-freedom can be avoided. Here are the three
Packit 13e616
possible usage scenarios:
Packit 13e616
 a) nue_max_num_vls = 1 and qos = disabled => Nue assumes that only 1 virtual
Packit 13e616
       layer (identical to the physical network; or OperVLs equal to VL0) is
Packit 13e616
       usable and all paths are to be calculated within this one layer. Hence,
Packit 13e616
       there is no need for special SL2VL mappings in the network and the use of
Packit 13e616
       specific SLs by applications. So, enabling QoS is not required for
Packit 13e616
       credit-free unicast routing tables.
Packit 13e616
 b) nue_max_num_vls = 1 and qos = enabled => This combination works essentially
Packit 13e616
       like (a), meaning the SL returned for path record requests is not defined
Packit 13e616
       by Nue, since all paths are deadlock-free without using VLs. However, any
Packit 13e616
       separate QoS settings may influence the SL returned to applications.
Packit 13e616
 c) nue_max_num_vls > 1 and qos = enabled => In this configuration, applications
Packit 13e616
       have to query and obey the SL for path records as returned by Nue because
Packit 13e616
       otherwise the deadlock-freedom cannot be guaranteed anymore. Furthermore,
Packit 13e616
       errors in the fabric may require applications to repath to avoid message
Packit 13e616
       deadlocks (this is only a current limitation of the implementation since
Packit 13e616
       Nue uses METIS to assign destination LIDs to VLs, and a network fault
Packit 13e616
       may change the outcome of METIS' partitioning; so, if anyone needs to
Packit 13e616
       avoid repaths, then please contact the developer). Since Nue operates on
Packit 13e616
       virtual layer, admins should configure the SL2VL mapping tables in an
Packit 13e616
       homogeneous 1-to-1 manner across the entire subnet to separate the layers
Packit 13e616
       and avoid paths from changing into the wrong layer (example mapping is
Packit 13e616
       VL_i = SL_i % qos_max_vls for all i 0..15). Depending on the actual
Packit 13e616
       setting of nue_max_num_vls, one can further differentiate:
Packit 13e616
   c.1) All operational VLs are used for deadlock-free routes (either by setting
Packit 13e616
        nue_max_num_vls to qos_max_vls or to 0 for "auto detection"), and hence
Packit 13e616
	real QoS to prioritize traffic in the subnet isn't supported. The VL
Packit 13e616
	arbitration settings for this usage scenarios should be configured
Packit 13e616
	equally for all VLs and for all switches to avoid bandwidth limitations
Packit 13e616
	for subset of destinations, e.g.
Packit 13e616
          qos_max_vls 8
Packit 13e616
          qos_high_limit 4
Packit 13e616
          qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64
Packit 13e616
          qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4
Packit 13e616
          qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
Packit 13e616
        when assuming 8 operational VLs in the fabric.
Packit 13e616
   c.2) Nue is limited to a subset of operational VLs allowing the mix of
Packit 13e616
        deadlock-freedom based on some SLs/VLs and real QoS with the remaining
Packit 13e616
	set of SLs/VLs; e.g., using VL0-3 to avoid credit-loops (= 1. QoS level)
Packit 13e616
	and VL4-7 used2for the second QoS level. In this example, application
Packit 13e616
	should either comply with the returned SL (in the path query) or select
Packit 13e616
	SL + 4 to use the second QoS level. (Reason: Nue itself is unaware of
Packit 13e616
	these QoS levels and only returns SLs in the range of 0 to #layers-1)
Packit 13e616
 d) nue_max_num_vls > 1 and qos = disabled => SHOULD NOT BE USED TOGETHER, since
Packit 13e616
       the SL2VL mapping for switches must be configured correctly.
Packit 13e616
As an additional note, using more VLs for Nue usually improves the overall
Packit 13e616
network throughput (as shown in [1] and [2]), so there are trade offs admins
Packit 13e616
may have to consider when configuring the subnet manager with Nue routing.
Packit 13e616
Packit 13e616
Note on multicast:
Packit 13e616
The Nue routing engine configures multicast forwarding tables by utilizing a
Packit 13e616
spanning tree calculation routed at a subnet switch suggested by OpenSM's
Packit 13e616
internal osm_mcast_mgr_find_root_switch(...) fn. This spanning tree for a mcast
Packit 13e616
group will try to use the least overloaded links (w.r.t the ucast paths-per-link
Packit 13e616
metric/weight) in the fabric. However, Nue routing currently does not guarantee
Packit 13e616
deadlock-freedom for the set of multicast routes on all topologies, nor for the
Packit 13e616
combination of deadlock-free unicast routes with additional multicast routes.
Packit 13e616
Assuming, for a given topology the calculated mcast routes are dl-free, then
Packit 13e616
an admin may fix the latter problem by separating the VLs, e.g., using VL0-6 for
Packit 13e616
ucast by specifying nue_max_num_vls=7 and utilizing VL7 for mcast.
Packit 13e616
Packit 13e616
For more information about Nue routing and comparisons with other OpenSM routing
Packit 13e616
algorithms, please refer to the following publications:
Packit 13e616
1. J. Domke, T. Hoefler and S. Matsuoka: "Routing on the Dependency Graph: A New
Packit 13e616
   Approach to Deadlock-Free High-Performance Routing", in HPDC'16
Packit 13e616
   (online: http://doi.acm.org/10.1145/2907294.2907313)
Packit 13e616
2. J. Domke "Routing on the Channel Dependency Graph: A New Approach to
Packit 13e616
   Deadlock-Free, Destination-Based, High-Performance Routing for Lossless
Packit 13e616
   Interconnection Networks", 2017, Dissertation, TU Dresden
Packit 13e616
   (online: http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-225902)