|
Packit |
13e616 |
Current OpenSM Routing
|
|
Packit |
13e616 |
12/13/17
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM offers ten routing engines:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1. Min Hop Algorithm - based on the minimum hops to each node where the
|
|
Packit |
13e616 |
path length is optimized.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
2. UPDN Unicast routing algorithm - also based on the minimum hops to each
|
|
Packit |
13e616 |
node, but it is constrained to ranking rules. This algorithm should be chosen
|
|
Packit |
13e616 |
if the subnet is not a pure Fat Tree, and deadlock may occur due to a
|
|
Packit |
13e616 |
loop in the subnet.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
3. DNUP Unicast routing algorithm - similar to UPDN but allows routing in
|
|
Packit |
13e616 |
fabrics which have some CA nodes attached closer to the roots than some switch
|
|
Packit |
13e616 |
nodes.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
4. Fat-tree Unicast routing algorithm - this algorithm optimizes routing
|
|
Packit |
13e616 |
of fat-trees for congestion-free "shift" communication pattern.
|
|
Packit |
13e616 |
It should be chosen if a subnet is a symmetrical fat-tree.
|
|
Packit |
13e616 |
Similar to UPDN routing, Fat-tree routing is credit-loop-free.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
5. LASH unicast routing algorithm - uses InfiniBand virtual layers
|
|
Packit |
13e616 |
(SL) to provide deadlock-free shortest-path routing while also
|
|
Packit |
13e616 |
distributing the paths between layers. LASH is an alternative
|
|
Packit |
13e616 |
deadlock-free topology-agnostic routing algorithm to the non-minimal
|
|
Packit |
13e616 |
UPDN algorithm avoiding the use of a potentially congested root node.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
6. DOR Unicast routing algorithm - based on the Min Hop algorithm, but
|
|
Packit |
13e616 |
avoids port equalization except for redundant links between the same
|
|
Packit |
13e616 |
two switches. This provides deadlock free routes for hypercubes when
|
|
Packit |
13e616 |
the fabric is cabled as a hypercube and for meshes when cabled as a
|
|
Packit |
13e616 |
mesh (see details below).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
7. Torus-2QoS unicast routing algorithm - a DOR-based routing algorithm
|
|
Packit |
13e616 |
specialized for 2D/3D torus topologies. Torus-2QoS provides deadlock-free
|
|
Packit |
13e616 |
routing while supporting two quality of service (QoS) levels. In addition
|
|
Packit |
13e616 |
it is able to route around multiple failed fabric links or a single failed
|
|
Packit |
13e616 |
fabric switch without introducing deadlocks, and without changing path SL
|
|
Packit |
13e616 |
values granted before the failure.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
8. DFSSSP unicast routing algorithm - a deadlock-free single-source-
|
|
Packit |
13e616 |
shortest-path routing, which uses the SSSP algorithm (see algorithm 9.)
|
|
Packit |
13e616 |
as the base to optimize link utilization and uses InfiniBand virtual lanes
|
|
Packit |
13e616 |
(SL) to provide deadlock-freedom.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
9. SSSP unicast routing algorithm - a single-source-shortest-path routing
|
|
Packit |
13e616 |
algorithm, which globally balances the number of routes per link to
|
|
Packit |
13e616 |
optimize link utilization. This routing algorithm has no restrictions
|
|
Packit |
13e616 |
in terms of the underlying topology.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
10. Nue unicast routing algorithm - a 100%-applicable and deadlock-free
|
|
Packit |
13e616 |
routing which can be used for any arbitrary or faulty network topology
|
|
Packit |
13e616 |
and any number of virtual lanes (this includes the absence of VLs as well).
|
|
Packit |
13e616 |
Paths are globally balanced w.r.t the number of routes per link, and are
|
|
Packit |
13e616 |
kept as short as possible while enforcing deadlock-freedom within the VL
|
|
Packit |
13e616 |
constraint.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM provides an optional unicast routing cache (enabled by -A or
|
|
Packit |
13e616 |
--ucast_cache options). When enabled, unicast routing cache prevents
|
|
Packit |
13e616 |
routing recalculation (which is a heavy task in a large cluster) when
|
|
Packit |
13e616 |
there was no topology change detected during the heavy sweep, or when
|
|
Packit |
13e616 |
the topology change does not require new routing calculation, e.g. when
|
|
Packit |
13e616 |
one or more CAs/RTRs/leaf switches going down, or one or more of these
|
|
Packit |
13e616 |
nodes coming back after being down.
|
|
Packit |
13e616 |
A very common case that is handled by the unicast routing cache is host
|
|
Packit |
13e616 |
reboot, which otherwise would cause two full routing recalculations: one
|
|
Packit |
13e616 |
when the host goes down, and the other when the host comes back online.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM also supports a file method which can load routes from a table. See
|
|
Packit |
13e616 |
modular-routing.txt for more information on this.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The basic routing algorithm is comprised of two stages:
|
|
Packit |
13e616 |
1. MinHop matrix calculation
|
|
Packit |
13e616 |
How many hops are required to get from each port to each LID ?
|
|
Packit |
13e616 |
The algorithm to fill these tables is different if you run standard
|
|
Packit |
13e616 |
(min hop) or Up/Down.
|
|
Packit |
13e616 |
For standard routing, a "relaxation" algorithm is used to propagate
|
|
Packit |
13e616 |
min hop from every destination LID through neighbor switches
|
|
Packit |
13e616 |
For Up/Down routing, a BFS from every target is used. The BFS tracks link
|
|
Packit |
13e616 |
direction (up or down) and avoid steps that will perform up after a down
|
|
Packit |
13e616 |
step was used.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
2. Once MinHop matrices exist, each switch is visited and for each target LID,
|
|
Packit |
13e616 |
a decision is made as to what port should be used to get to that LID.
|
|
Packit |
13e616 |
This step is common to standard and Up/Down routing. Each port has a
|
|
Packit |
13e616 |
counter counting the number of target LIDs going through it.
|
|
Packit |
13e616 |
When there are multiple alternative ports with same MinHop to a LID,
|
|
Packit |
13e616 |
the one with less previously assigned LIDs is selected.
|
|
Packit |
13e616 |
If LMC > 0, more checks are added: Within each group of LIDs assigned to
|
|
Packit |
13e616 |
same target port,
|
|
Packit |
13e616 |
a. use only ports which have same MinHop
|
|
Packit |
13e616 |
b. first prefer the ones that go to different systemImageGuid (then
|
|
Packit |
13e616 |
the previous LID of the same LMC group)
|
|
Packit |
13e616 |
c. if none - prefer those which go through another NodeGuid
|
|
Packit |
13e616 |
d. fall back to the number of paths method (if all go to same node).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Effect of Topology Changes
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM will preserve existing routing in any case where there is no change in
|
|
Packit |
13e616 |
the fabric switches unless the -r (--reassign_lids) option is specified.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
-r
|
|
Packit |
13e616 |
--reassign_lids
|
|
Packit |
13e616 |
This option causes OpenSM to reassign LIDs to all
|
|
Packit |
13e616 |
end nodes. Specifying -r on a running subnet
|
|
Packit |
13e616 |
may disrupt subnet traffic.
|
|
Packit |
13e616 |
Without -r, OpenSM attempts to preserve existing
|
|
Packit |
13e616 |
LID assignments resolving multiple use of same LID.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
If a link is added or removed, OpenSM does not recalculate
|
|
Packit |
13e616 |
the routes that do not have to change. A route has to change
|
|
Packit |
13e616 |
if the port is no longer UP or no longer the MinHop. When routing changes
|
|
Packit |
13e616 |
are performed, the same algorithm for balancing the routes is invoked.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the case of using the file based routing, any topology changes are
|
|
Packit |
13e616 |
currently ignored The 'file' routing engine just loads the LFTs from the file
|
|
Packit |
13e616 |
specified, with no reaction to real topology. Obviously, this will not be able
|
|
Packit |
13e616 |
to recheck LIDs (by GUID) for disconnected nodes, and LFTs for non-existent
|
|
Packit |
13e616 |
switches will be skipped. Multicast is not affected by 'file' routing engine
|
|
Packit |
13e616 |
(this uses min hop tables).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Min Hop Algorithm
|
|
Packit |
13e616 |
-----------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The Min Hop algorithm is invoked by default if no routing algorithm is
|
|
Packit |
13e616 |
specified. It can also be invoked by specifying '-R minhop'.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The Min Hop algorithm is divided into two stages: computation of
|
|
Packit |
13e616 |
min-hop tables on every switch and LFT output port assignment. Link
|
|
Packit |
13e616 |
subscription is also equalized with the ability to override based on
|
|
Packit |
13e616 |
port GUID. The latter is supplied by:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
-i <equalize-ignore-guids-file>
|
|
Packit |
13e616 |
--ignore_guids <equalize-ignore-guids-file>
|
|
Packit |
13e616 |
This option provides the means to define a set of ports
|
|
Packit |
13e616 |
(by guids) that will be ignored by the link load
|
|
Packit |
13e616 |
equalization algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
LMC awareness routes based on (remote) system or switch basis.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
UPDN Routing Algorithm
|
|
Packit |
13e616 |
----------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Purpose of UPDN Algorithm
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The UPDN algorithm is designed to prevent deadlocks from occurring in loops
|
|
Packit |
13e616 |
of the subnet. A loop-deadlock is a situation in which it is no longer
|
|
Packit |
13e616 |
possible to send data between any two hosts connected through the loop. As
|
|
Packit |
13e616 |
such, the UPDN routing algorithm should be used if the subnet is not a pure
|
|
Packit |
13e616 |
Fat Tree, and one of its loops may experience a deadlock (due, for example,
|
|
Packit |
13e616 |
to high pressure).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The UPDN algorithm is based on the following main stages:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1. Auto-detect root nodes - based on the CA hop length from any switch in
|
|
Packit |
13e616 |
the subnet, a statistical histogram is built for each switch (hop num vs
|
|
Packit |
13e616 |
number of occurrences). If the histogram reflects a specific column (higher
|
|
Packit |
13e616 |
than others) for a certain node, then it is marked as a root node. Since
|
|
Packit |
13e616 |
the algorithm is statistical, it may not find any root nodes. The list of
|
|
Packit |
13e616 |
the root nodes found by this auto-detect stage is used by the ranking
|
|
Packit |
13e616 |
process stage.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note 1: The user can override the node list manually.
|
|
Packit |
13e616 |
Note 2: If this stage cannot find any root nodes, and the user did not
|
|
Packit |
13e616 |
specify a guid list file, OpenSM defaults back to the Min Hop
|
|
Packit |
13e616 |
routing algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
2. Ranking process - All root switch nodes (found in stage 1) are assigned
|
|
Packit |
13e616 |
a rank of 0. Using the BFS algorithm, the rest of the switch nodes in the
|
|
Packit |
13e616 |
subnet are ranked incrementally. This ranking aids in the process of enforcing
|
|
Packit |
13e616 |
rules that ensure loop-free paths.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
3. Min Hop Table setting - after ranking is done, a BFS algorithm is run from
|
|
Packit |
13e616 |
each (CA or switch) node in the subnet. During the BFS process, the FDB table
|
|
Packit |
13e616 |
of each switch node traversed by BFS is updated, in reference to the starting
|
|
Packit |
13e616 |
node, based on the ranking rules and guid values.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
At the end of the process, the updated FDB tables ensure loop-free paths
|
|
Packit |
13e616 |
through the subnet.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note: Up/Down routing does not allow LID routing communication between
|
|
Packit |
13e616 |
switches that are located inside spine "switch systems".
|
|
Packit |
13e616 |
The reason is that there is no way to allow a LID route between them
|
|
Packit |
13e616 |
that does not break the Up/Down rule.
|
|
Packit |
13e616 |
One ramification of this is that you cannot run SM on switches other
|
|
Packit |
13e616 |
than the leaf switches of the fabric.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
UPDN Algorithm Usage
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Activation through OpenSM
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Use '-R updn' option (instead of old '-u') to activate the UPDN algorithm.
|
|
Packit |
13e616 |
Use `-a <guid_list_file>' for adding an UPDN guid file that contains the
|
|
Packit |
13e616 |
root nodes for ranking.
|
|
Packit |
13e616 |
If the `-a' option is not used, OpenSM uses its auto-detect root nodes
|
|
Packit |
13e616 |
algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Notes on the guid list file:
|
|
Packit |
13e616 |
1. A valid guid file specifies one guid in each line. Lines with an invalid
|
|
Packit |
13e616 |
format will be discarded.
|
|
Packit |
13e616 |
2. The user should specify the root switch guids. However, it is also
|
|
Packit |
13e616 |
possible to specify CA guids; OpenSM will use the guid of the switch (if
|
|
Packit |
13e616 |
it exists) that connects the CA to the subnet as a root node.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
To learn more about deadlock-free routing, see the article
|
|
Packit |
13e616 |
"Deadlock Free Message Routing in Multiprocessor Interconnection Networks"
|
|
Packit |
13e616 |
by William J Dally and Charles L Seitz (1985).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
DNUP Routing Algorithm
|
|
Packit |
13e616 |
----------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Purpose:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The DNUP algorithm is designed to serve a similar purpose to UPDN. However
|
|
Packit |
13e616 |
it is intended to work in network topologies which are unsuited to
|
|
Packit |
13e616 |
UPDN due to nodes being connected closer to the roots than some of
|
|
Packit |
13e616 |
the switches. An example would be a fabric which contains nodes and
|
|
Packit |
13e616 |
uplinks connected to the same switch. The operation of DNUP is the
|
|
Packit |
13e616 |
same as UPDN with the exception of the ranking process. In DNUP all
|
|
Packit |
13e616 |
switch nodes are ranked based solely on their distance from CA Nodes,
|
|
Packit |
13e616 |
all switch nodes directly connected to at least one CA are assigned a
|
|
Packit |
13e616 |
value of 1 all other switch nodes are assigned a value of one more than
|
|
Packit |
13e616 |
the minimum rank of all neighbor switch nodes.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Fat-tree Routing Algorithm
|
|
Packit |
13e616 |
--------------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Purpose:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The fat-tree algorithm optimizes routing for "shift" communication pattern.
|
|
Packit |
13e616 |
It should be chosen if a subnet is a symmetrical or almost symmetrical
|
|
Packit |
13e616 |
fat-tree of various types.
|
|
Packit |
13e616 |
It supports not just K-ary-N-Trees, by handling for non-constant K,
|
|
Packit |
13e616 |
cases where not all leafs (CAs) are present, any Constant
|
|
Packit |
13e616 |
Bisectional Ratio (CBB) ratio. As in UPDN, fat-tree also prevents
|
|
Packit |
13e616 |
credit-loop-deadlocks.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
If the root guid file is not provided ('-a' or '--root_guid_file' options),
|
|
Packit |
13e616 |
the topology has to be pure fat-tree that complies with the following rules:
|
|
Packit |
13e616 |
- Tree rank should be between two and eight (inclusively)
|
|
Packit |
13e616 |
- Switches of the same rank should have the same number
|
|
Packit |
13e616 |
of UP-going port groups*, unless they are root switches,
|
|
Packit |
13e616 |
in which case the shouldn't have UP-going ports at all.
|
|
Packit |
13e616 |
- Switches of the same rank should have the same number
|
|
Packit |
13e616 |
of DOWN-going port groups, unless they are leaf switches.
|
|
Packit |
13e616 |
- Switches of the same rank should have the same number
|
|
Packit |
13e616 |
of ports in each UP-going port group.
|
|
Packit |
13e616 |
- Switches of the same rank should have the same number
|
|
Packit |
13e616 |
of ports in each DOWN-going port group.
|
|
Packit |
13e616 |
- All the CAs have to be at the same tree level (rank).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
If the root guid file is provided, the topology doesn't have to be pure
|
|
Packit |
13e616 |
fat-tree, and it should only comply with the following rules:
|
|
Packit |
13e616 |
- Tree rank should be between two and eight (inclusively)
|
|
Packit |
13e616 |
- All the Compute Nodes** have to be at the same tree level (rank).
|
|
Packit |
13e616 |
Note that non-compute node CAs are allowed here to be at different
|
|
Packit |
13e616 |
tree ranks.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* ports that are connected to the same remote switch are referenced as
|
|
Packit |
13e616 |
'port group'.
|
|
Packit |
13e616 |
** list of compute nodes (CNs) can be specified by '-u' or '--cn_guid_file'
|
|
Packit |
13e616 |
OpenSM options.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note that although fat-tree algorithm supports trees with non-integer CBB
|
|
Packit |
13e616 |
ratio, the routing will not be as balanced as in case of integer CBB ratio.
|
|
Packit |
13e616 |
In addition to this, although the algorithm allows leaf switches to have any
|
|
Packit |
13e616 |
number of CAs, the closer the tree is to be fully populated, the more effective
|
|
Packit |
13e616 |
the "shift" communication pattern will be.
|
|
Packit |
13e616 |
In general, even if the root list is provided, the closer the topology to a
|
|
Packit |
13e616 |
pure and symmetrical fat-tree, the more optimal the routing will be.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The algorithm also dumps compute node ordering file (opensm-ftree-ca-order.dump)
|
|
Packit |
13e616 |
in the same directory where the OpenSM log resides. This ordering file provides
|
|
Packit |
13e616 |
the CN order that may be used to create efficient communication pattern, that
|
|
Packit |
13e616 |
will match the routing tables.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Routing between non-CN nodes
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The use of the cn_guid_file option allows non-CN nodes to be located on different levels in the fat tree.
|
|
Packit |
13e616 |
In such case, it is not guaranteed that the Fat Tree algorithm will route between two non-CN nodes.
|
|
Packit |
13e616 |
In the scheme below, N1, N2 and N3 are non-CN nodes. Although all the CN have routes to and from them,
|
|
Packit |
13e616 |
there will not necessarily be a route between N1,N2 and N3.
|
|
Packit |
13e616 |
Such routes would require to use at least one of the Switch the wrong way around
|
|
Packit |
13e616 |
(In fact, go out of one of the top Switch through a downgoing port while we are supposed to go up).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Spine1 Spine2 Spine 3
|
|
Packit |
13e616 |
/ \ / | \ / \
|
|
Packit |
13e616 |
/ \ / | \ / \
|
|
Packit |
13e616 |
N1 Switch N2 Switch N3
|
|
Packit |
13e616 |
/|\ /|\
|
|
Packit |
13e616 |
/ | \ / | \
|
|
Packit |
13e616 |
Going down to compute nodes
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
To solve this problem, a list of non-CN nodes can be specified by \'-G\' or \'--io_guid_file\' option.
|
|
Packit |
13e616 |
Theses nodes will be allowed to use switches the wrong way around a specific number of times (specified by \'-H\' or \'--max_reverse_hops\'.
|
|
Packit |
13e616 |
With the proper max_reverse_hops and io_guid_file values, you can ensure full connectivity in the Fat Tree.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the scheme above, with a max_reverse_hop of 1, routes will be instantiated between N1<->N2 and N2<->N3.
|
|
Packit |
13e616 |
With a max_reverse_hops value of 2, N1,N2 and N3 will all have routes between them.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Please note that using max_reverse_hops creates routes that use the switch in a counter-stream way.
|
|
Packit |
13e616 |
This option should never be used to connect nodes with high bandwidth traffic between them ! It should only be used
|
|
Packit |
13e616 |
to allow connectivity for HA purposes or similar.
|
|
Packit |
13e616 |
Also having routes the other way around can in theory cause credit loops.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Use these options with extreme care !
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Usage:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Activation through OpenSM
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Use '-R ftree' option to activate the fat-tree algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note: LMC > 0 is not supported by fat-tree routing. If this is
|
|
Packit |
13e616 |
specified, the default routing algorithm is invoked instead.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
LASH Routing Algorithm
|
|
Packit |
13e616 |
----------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
LASH is an acronym for LAyered SHortest Path Routing. It is a
|
|
Packit |
13e616 |
deterministic shortest path routing algorithm that enables topology
|
|
Packit |
13e616 |
agnostic deadlock-free routing within communication networks.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
When computing the routing function, LASH analyzes the network
|
|
Packit |
13e616 |
topology for the shortest-path routes between all pairs of sources /
|
|
Packit |
13e616 |
destinations and groups these paths into virtual layers in such a way
|
|
Packit |
13e616 |
as to avoid deadlock.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note LASH analyzes routes and ensures deadlock freedom between switch
|
|
Packit |
13e616 |
pairs. The link from HCA between and switch does not need virtual
|
|
Packit |
13e616 |
layers as deadlock will not arise between switch and HCA.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In more detail, the algorithm works as follows:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1) LASH determines the shortest-path between all pairs of source /
|
|
Packit |
13e616 |
destination switches. Note, LASH ensures the same SL is used for all
|
|
Packit |
13e616 |
SRC/DST - DST/SRC pairs and there is no guarantee that the return
|
|
Packit |
13e616 |
path for a given DST/SRC will be the reverse of the route SRC/DST.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
2) LASH then begins an SL assignment process where a route is assigned
|
|
Packit |
13e616 |
to a layer (SL) if the addition of that route does not cause deadlock
|
|
Packit |
13e616 |
within that layer. This is achieved by maintaining and analysing a
|
|
Packit |
13e616 |
channel dependency graph for each layer. Once the potential addition
|
|
Packit |
13e616 |
of a path could lead to deadlock, LASH opens a new layer and continues
|
|
Packit |
13e616 |
the process.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
3) Once this stage has been completed, it is highly likely that the
|
|
Packit |
13e616 |
first layers processed will contain more paths than the latter ones.
|
|
Packit |
13e616 |
To better balance the use of layers, LASH moves paths from one layer
|
|
Packit |
13e616 |
to another so that the number of paths in each layer averages out.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note, the implementation of LASH in opensm attempts to use as few layers
|
|
Packit |
13e616 |
as possible. This number can be less than the number of actual layers
|
|
Packit |
13e616 |
available.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In general LASH is a very flexible algorithm. It can, for example,
|
|
Packit |
13e616 |
reduce to Dimension Order Routing in certain topologies, it is topology
|
|
Packit |
13e616 |
agnostic and fares well in the face of faults.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
It has been shown that for both regular and irregular topologies, LASH
|
|
Packit |
13e616 |
outperforms Up/Down. The reason for this is that LASH distributes the
|
|
Packit |
13e616 |
traffic more evenly through a network, avoiding the bottleneck issues
|
|
Packit |
13e616 |
related to a root node and always routes shortest-path.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The algorithm was developed by Simula Research Laboratory.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
To learn more about LASH and the flexibility behind it, the requirement
|
|
Packit |
13e616 |
for layers, performance comparisons to other algorithms, see the
|
|
Packit |
13e616 |
following articles:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
"Layered Routing in Irregular Networks", Lysne et al, IEEE
|
|
Packit |
13e616 |
Transactions on Parallel and Distributed Systems, VOL.16, No12,
|
|
Packit |
13e616 |
December 2005.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
"Routing for the ASI Fabric Manager", Solheim et al. IEEE
|
|
Packit |
13e616 |
Communications Magazine, Vol.44, No.7, July 2006.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
"Layered Shortest Path (LASH) Routing in Irregular System Area
|
|
Packit |
13e616 |
Networks", Skeie et al. IEEE Computer Society Communication
|
|
Packit |
13e616 |
Architecture for Clusters 2002.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Use '-R lash -Q ' option to activate the LASH algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note: QoS support has to be turned on in order that SL/VL mappings are
|
|
Packit |
13e616 |
used.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note: LMC > 0 is not supported by the LASH routing. If this is
|
|
Packit |
13e616 |
specified, the default routing algorithm is invoked instead.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For open regular cartesian meshes the DOR algorithm is the ideal
|
|
Packit |
13e616 |
routing algorithm. For toroidal meshes on the other hand there
|
|
Packit |
13e616 |
are routing loops that can cause deadlocks. LASH can be used to
|
|
Packit |
13e616 |
route these cases. The performance of LASH can be improved by
|
|
Packit |
13e616 |
preconditioning the mesh in cases where there are multiple links
|
|
Packit |
13e616 |
connecting switches and also in cases where the switches are not
|
|
Packit |
13e616 |
cabled consistently. An option exists for LASH to do this. To
|
|
Packit |
13e616 |
invoke this use '-R lash -Q --do_mesh_analysis'. This will
|
|
Packit |
13e616 |
add an additional phase that analyses the mesh to try to determine
|
|
Packit |
13e616 |
the dimension and size of a mesh. If it determines that the mesh
|
|
Packit |
13e616 |
looks like an open or closed cartesian mesh it reorders the ports
|
|
Packit |
13e616 |
in dimension order before the rest of the LASH algorithm runs.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
DOR Routing Algorithm
|
|
Packit |
13e616 |
---------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The Dimension Order Routing algorithm is based on the Min Hop
|
|
Packit |
13e616 |
algorithm and so uses shortest paths. Instead of spreading traffic
|
|
Packit |
13e616 |
out across different paths with the same shortest distance, it chooses
|
|
Packit |
13e616 |
among the available shortest paths based on an ordering of dimensions.
|
|
Packit |
13e616 |
Each port must be consistently cabled to represent a hypercube
|
|
Packit |
13e616 |
dimension or a mesh dimension. Paths are grown from a destination
|
|
Packit |
13e616 |
back to a source using the lowest dimension (port) of available paths
|
|
Packit |
13e616 |
at each step. This provides the ordering necessary to avoid deadlock.
|
|
Packit |
13e616 |
When there are multiple links between any two switches, they still
|
|
Packit |
13e616 |
represent only one dimension and traffic is balanced across them
|
|
Packit |
13e616 |
unless port equalization is turned off. In the case of hypercubes,
|
|
Packit |
13e616 |
the same port must be used throughout the fabric to represent the
|
|
Packit |
13e616 |
hypercube dimension and match on both ends of the cable. In the case
|
|
Packit |
13e616 |
of meshes, the dimension should consistently use the same pair of
|
|
Packit |
13e616 |
ports, one port on one end of the cable, and the other port on the
|
|
Packit |
13e616 |
other end, continuing along the mesh dimension.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Use '-R dor' option to activate the DOR algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS Routing Algorithm
|
|
Packit |
13e616 |
----------------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS is a routing algorithm designed for large-scale 2D/3D torus fabrics.
|
|
Packit |
13e616 |
The torus-2QoS routing engine can provide the following functionality on
|
|
Packit |
13e616 |
a 2D/3D torus:
|
|
Packit |
13e616 |
- routing that is free of credit loops
|
|
Packit |
13e616 |
- two levels of QoS, assuming switches support 8 data VLs
|
|
Packit |
13e616 |
- ability to route around a single failed switch, and/or multiple failed
|
|
Packit |
13e616 |
links, without
|
|
Packit |
13e616 |
- introducing credit loops
|
|
Packit |
13e616 |
- changing path SL values
|
|
Packit |
13e616 |
- very short run times, with good scaling properties as fabric size
|
|
Packit |
13e616 |
increases
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Unicast Routing:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS is a DOR-based algorithm that avoids deadlocks that would otherwise
|
|
Packit |
13e616 |
occur in a torus using the concept of a dateline for each torus dimension.
|
|
Packit |
13e616 |
It encodes into a path SL which datelines the path crosses as follows:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
sl = 0;
|
|
Packit |
13e616 |
for (d = 0; d < torus_dimensions; d++)
|
|
Packit |
13e616 |
/* path_crosses_dateline(d) returns 0 or 1 */
|
|
Packit |
13e616 |
sl |= path_crosses_dateline(d) << d;
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For a 3D torus, that leaves one SL bit free, which torus-2QoS uses to
|
|
Packit |
13e616 |
implement two QoS levels.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS also makes use of the output port dependence of switch SL2VL
|
|
Packit |
13e616 |
maps to encode into one VL bit the information encoded in three SL bits.
|
|
Packit |
13e616 |
It computes in which torus coordinate direction each inter-switch link
|
|
Packit |
13e616 |
"points", and writes SL2VL maps for such ports as follows:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
for (sl = 0; sl < 16; sl ++)
|
|
Packit |
13e616 |
/* cdir(port) reports which torus coordinate direction a switch port
|
|
Packit |
13e616 |
* "points" in, and returns 0, 1, or 2 */
|
|
Packit |
13e616 |
sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Thus, on a pristine 3D torus, i.e., in the absence of failed fabric switches,
|
|
Packit |
13e616 |
torus-2QoS consumes 8 SL values (SL bits 0-2) and 2 VL values (VL bit 0)
|
|
Packit |
13e616 |
per QoS level to provide deadlock-free routing on a 3D torus.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS routes around link failure by "taking the long way around" any
|
|
Packit |
13e616 |
1D ring interrupted by a link failure. For example, consider the 2D 6x5
|
|
Packit |
13e616 |
torus below, where switches are denoted by [+a-zA-Z]:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
4 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 --+----+----+----D----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----I----r----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 --m----S----n----T----o----p--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For a pristine fabric the path from S to D would be S-n-T-r-D. In the
|
|
Packit |
13e616 |
event that either link S-n or n-T has failed, torus-2QoS would use the path
|
|
Packit |
13e616 |
S-m-p-o-T-r-D. Note that it can do this without changing the path SL
|
|
Packit |
13e616 |
value; once the 1D ring m-S-n-T-o-p-m has been broken by failure, path
|
|
Packit |
13e616 |
segments using it cannot contribute to deadlock, and the x-direction
|
|
Packit |
13e616 |
dateline (between, say, x=5 and x=0) can be ignored for path segments on
|
|
Packit |
13e616 |
that ring.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
One result of this is that torus-2QoS can route around many simultaneous
|
|
Packit |
13e616 |
link failures, as long as no 1D ring is broken into disjoint segments. For
|
|
Packit |
13e616 |
example, if links n-T and T-o have both failed, that ring has been broken
|
|
Packit |
13e616 |
into two disjoint segments, T and o-p-m-S-n. Torus-2QoS checks for such
|
|
Packit |
13e616 |
issues, reports if they are found, and refuses to route such fabrics.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note that in the case where there are multiple parallel links between a pair
|
|
Packit |
13e616 |
of switches, torus-2QoS will allocate routes across such links in a round-
|
|
Packit |
13e616 |
robin fashion, based on ports at the path destination switch that are active
|
|
Packit |
13e616 |
and not used for inter-switch links. Should a link that is one of several
|
|
Packit |
13e616 |
such parallel links fail, routes are redistributed across the remaining
|
|
Packit |
13e616 |
links. When the last of such a set of parallel links fails, traffic is
|
|
Packit |
13e616 |
rerouted as described above.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Handling a failed switch under DOR requires introducing into a path at
|
|
Packit |
13e616 |
least one turn that would be otherwise "illegal", i.e. not allowed by DOR
|
|
Packit |
13e616 |
rules. Torus-2QoS will introduce such a turn as close as possible to the
|
|
Packit |
13e616 |
failed switch in order to route around it.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the above example, suppose switch T has failed, and consider the path
|
|
Packit |
13e616 |
from S to D. Torus-2QoS will produce the path S-n-I-r-D, rather than the
|
|
Packit |
13e616 |
S-n-T-r-D path for a pristine torus, by introducing an early turn at n.
|
|
Packit |
13e616 |
Normal DOR rules will cause traffic arriving at switch I to be forwarded
|
|
Packit |
13e616 |
to switch r; for traffic arriving from I due to the "early" turn at n,
|
|
Packit |
13e616 |
this will generate an "illegal" turn at I.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS will also use the input port dependence of SL2VL maps to set VL
|
|
Packit |
13e616 |
bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns, i.e.,
|
|
Packit |
13e616 |
those turns that are illegal under DOR. This causes the first hop after
|
|
Packit |
13e616 |
any such turn to use a separate set of VL values, and prevents deadlock in
|
|
Packit |
13e616 |
the presence of a single failed switch.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For any given path, only the hops after a turn that is illegal under DOR
|
|
Packit |
13e616 |
can contribute to a credit loop that leads to deadlock. So in the example
|
|
Packit |
13e616 |
above with failed switch T, the location of the illegal turn at I in the
|
|
Packit |
13e616 |
path from S to D requires that any credit loop caused by that turn must
|
|
Packit |
13e616 |
encircle the failed switch at T. Thus the second and later hops after the
|
|
Packit |
13e616 |
illegal turn at I (i.e., hop r-D) cannot contribute to a credit loop
|
|
Packit |
13e616 |
because they cannot be used to construct a loop encircling T. The hop I-r
|
|
Packit |
13e616 |
uses a separate VL, so it cannot contribute to a credit loop encircling T.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Extending this argument shows that in addition to being capable of routing
|
|
Packit |
13e616 |
around a single switch failure without introducing deadlock, torus-2QoS can
|
|
Packit |
13e616 |
also route around multiple failed switches on the condition they are
|
|
Packit |
13e616 |
adjacent in the last dimension routed by DOR. For example, consider the
|
|
Packit |
13e616 |
following case on a 6x6 2D torus:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
5 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
4 --+----+----+----D----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 --+----+----I----u----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----q----R----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 --m----S----n----T----o----p--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Suppose switches T and R have failed, and consider the path from S to D.
|
|
Packit |
13e616 |
Torus-2QoS will generate the path S-n-q-I-u-D, with an illegal turn at
|
|
Packit |
13e616 |
switch I, and with hop I-u using a VL with bit 1 set.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
As a further example, consider a case that torus-2QoS cannot route without
|
|
Packit |
13e616 |
deadlock: two failed switches adjacent in a dimension that is not the last
|
|
Packit |
13e616 |
dimension routed by DOR; here the failed switches are O and T:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
5 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
4 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 --+----+----+----+----D----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----I----q----r----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 --m----S----n----O----T----p--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In a pristine fabric, torus-2QoS would generate the path from S to D as
|
|
Packit |
13e616 |
S-n-O-T-r-D. With failed switches O and T, torus-2QoS will generate the
|
|
Packit |
13e616 |
path S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q using a
|
|
Packit |
13e616 |
VL with bit 1 set. In contrast to the earlier examples, the second hop
|
|
Packit |
13e616 |
after the illegal turn, q-r, can be used to construct a credit loop
|
|
Packit |
13e616 |
encircling the failed switches.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Multicast Routing:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Since torus-2QoS uses all four available SL bits, and the three data VL
|
|
Packit |
13e616 |
bits that are typically available in current switches, there is no way
|
|
Packit |
13e616 |
to use SL/VL values to separate multicast traffic from unicast traffic.
|
|
Packit |
13e616 |
Thus, torus-2QoS must generate multicast routing such that credit loops
|
|
Packit |
13e616 |
cannot arise from a combination of multicast and unicast path segments.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
It turns out that it is possible to construct spanning trees for multicast
|
|
Packit |
13e616 |
routing that have that property. For the 2D 6x5 torus example above, here
|
|
Packit |
13e616 |
is the full-fabric spanning tree that torus-2QoS will construct, where "x"
|
|
Packit |
13e616 |
is the root switch and each "+" is a non-root switch:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
4 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 +----+----+----x----+----+
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 + + + + + +
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For multicast traffic routed from root to tip, every turn in the above
|
|
Packit |
13e616 |
spanning tree is a legal DOR turn.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For traffic routed from tip to root, and some traffic routed through the
|
|
Packit |
13e616 |
root, turns are not legal DOR turns. However, to construct a credit loop,
|
|
Packit |
13e616 |
the union of multicast routing on this spanning tree with DOR unicast
|
|
Packit |
13e616 |
routing can only provide 3 of the 4 turns needed for the loop.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In addition, if none of the above spanning tree branches crosses a dateline
|
|
Packit |
13e616 |
used for unicast credit loop avoidance on a torus, and if multicast traffic
|
|
Packit |
13e616 |
is confined to SL 0 or SL 8 (recall that torus-2QoS uses SL bit 3 to
|
|
Packit |
13e616 |
differentiate QoS level), then multicast traffic also cannot contribute to
|
|
Packit |
13e616 |
the "ring" credit loops that are otherwise possible in a torus.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS uses these ideas to create a master spanning tree. Every
|
|
Packit |
13e616 |
multicast group spanning tree will be constructed as a subset of the master
|
|
Packit |
13e616 |
tree, with the same root as the master tree.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Such multicast group spanning trees will in general not be optimal for
|
|
Packit |
13e616 |
groups which are a subset of the full fabric. However, this compromise must
|
|
Packit |
13e616 |
be made to enable support for two QoS levels on a torus while preventing
|
|
Packit |
13e616 |
credit loops.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the presence of link or switch failures that result in a fabric for
|
|
Packit |
13e616 |
which torus-2QoS can generate credit-loop-free unicast routes, it is also
|
|
Packit |
13e616 |
possible to generate a master spanning tree for multicast that retains the
|
|
Packit |
13e616 |
required properties. For example, consider that same 2D 6x5 torus, with
|
|
Packit |
13e616 |
the link from (2,2) to (3,2) failed. Torus-2QoS will generate the following
|
|
Packit |
13e616 |
master spanning tree:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
4 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----+ x----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 + + + + + +
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Two things are notable about this master spanning tree. First, assuming
|
|
Packit |
13e616 |
the x dateline was between x=5 and x=0, this spanning tree has a branch
|
|
Packit |
13e616 |
that crosses the dateline. However, just as for unicast, crossing a
|
|
Packit |
13e616 |
dateline on a 1D ring (here, the ring for y=2) that is broken by a failure
|
|
Packit |
13e616 |
cannot contribute to a torus credit loop.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Second, this spanning tree is no longer optimal even for multicast groups
|
|
Packit |
13e616 |
that encompass the entire fabric. That, unfortunately, is a compromise that
|
|
Packit |
13e616 |
must be made to retain the other desirable properties of torus-2QoS routing.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the event that a single switch fails, torus-2QoS will generate a master
|
|
Packit |
13e616 |
spanning tree that has no "extra" turns by appropriately selecting a root
|
|
Packit |
13e616 |
switch. In the 2D 6x5 torus example, assume now that the switch at (3,2),
|
|
Packit |
13e616 |
i.e. the root for a pristine fabric, fails. Torus-2QoS will generate the
|
|
Packit |
13e616 |
following master spanning tree for that case:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
|
Packit |
13e616 |
4 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 + + + + + +
|
|
Packit |
13e616 |
| | | | |
|
|
Packit |
13e616 |
2 + + + + +
|
|
Packit |
13e616 |
| | | | |
|
|
Packit |
13e616 |
1 +----+----x----+----+----+
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 + + + + + +
|
|
Packit |
13e616 |
|
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Assuming the y dateline was between y=4 and y=0, this spanning tree has
|
|
Packit |
13e616 |
a branch that crosses a dateline. However, again this cannot contribute
|
|
Packit |
13e616 |
to credit loops as it occurs on a 1D ring (the ring for x=3) that is
|
|
Packit |
13e616 |
broken by a failure, as in the above example.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus Topology Discovery:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The algorithm used by torus-2QoS to construct the torus topology from the
|
|
Packit |
13e616 |
undirected graph representing the fabric requires that the radix of each
|
|
Packit |
13e616 |
dimension be configured via torus-2QoS.conf. It also requires that the
|
|
Packit |
13e616 |
torus topology be "seeded"; for a 3D torus this requires configuring four
|
|
Packit |
13e616 |
switches that define the three coordinate directions of the torus.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Given this starting information, the algorithm is to examine the cube
|
|
Packit |
13e616 |
formed by the eight switch locations bounded by the corners (x,y,z) and
|
|
Packit |
13e616 |
(x+1,y+1,z+1). Based on switches already placed into the torus topology at
|
|
Packit |
13e616 |
some of these locations, the algorithm examines 4-loops of interswitch
|
|
Packit |
13e616 |
links to find the one that is consistent with a face of the cube of switch
|
|
Packit |
13e616 |
locations, and adds its switches to the discovered topology in the correct
|
|
Packit |
13e616 |
locations.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Because the algorithm is based on examining the topology of 4-loops of links,
|
|
Packit |
13e616 |
a torus with one or more radix-4 dimensions requires extra initial seed
|
|
Packit |
13e616 |
configuration. See torus-2QoS.conf(5) for details. Torus-2QoS will detect
|
|
Packit |
13e616 |
and report when it has insufficient configuration for a torus with radix-4
|
|
Packit |
13e616 |
dimensions.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the event the torus is significantly degraded, i.e., there are many
|
|
Packit |
13e616 |
missing switches or links, it may happen that torus-2QoS is unable to place
|
|
Packit |
13e616 |
into the torus some switches and/or links that were discovered in the
|
|
Packit |
13e616 |
fabric, and will generate a warning in that case. A similar condition
|
|
Packit |
13e616 |
occurs if torus-2QoS is misconfigured, i.e., the radix of a torus dimension
|
|
Packit |
13e616 |
as configured does not match the radix of that torus dimension as wired,
|
|
Packit |
13e616 |
and many switches/links in the fabric will not be placed into the torus.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Quality Of Service Configuration:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM will not program switches and channel adapters with SL2VL maps or VL
|
|
Packit |
13e616 |
arbitration configuration unless it is invoked with -Q. Since torus-2QoS
|
|
Packit |
13e616 |
depends on such functionality for correct operation, always invoke OpenSM
|
|
Packit |
13e616 |
with -Q when torus-2QoS is in the list of routing engines.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Any quality of service configuration method supported by OpenSM will work
|
|
Packit |
13e616 |
with torus-2QoS, subject to the following limitations and considerations.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For all routing engines supported by OpenSM except torus-2QoS, there is a
|
|
Packit |
13e616 |
one-to-one correspondence between QoS level and SL. Torus-2QoS can only
|
|
Packit |
13e616 |
support two quality of service levels, so only the high-order bit of any SL
|
|
Packit |
13e616 |
value used for unicast QoS configuration will be honored by torus-2QoS.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For multicast QoS configuration, only SL values 0 and 8 should be used with
|
|
Packit |
13e616 |
torus-2QoS.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Since SL to VL map configuration must be under the complete control of
|
|
Packit |
13e616 |
torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl, etc., must and
|
|
Packit |
13e616 |
will be ignored, and a warning will be generated.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For inter-switch links, Torus-2QoS uses VL values 0-3 to implement one of
|
|
Packit |
13e616 |
its supported QoS levels, and VL values 4-7 to implement the other. For
|
|
Packit |
13e616 |
endport links (CA, router, switch management port), Torus-2QoS uses VL
|
|
Packit |
13e616 |
value 0 for one of its supported QoS levels and VL value 1 to implement
|
|
Packit |
13e616 |
the other. Hard-to-diagnose application issues may arise if traffic is
|
|
Packit |
13e616 |
not delivered fairly across each of these two VL ranges. For
|
|
Packit |
13e616 |
inter-switch links, Torus-2QoS will detect and warn if VL arbitration is
|
|
Packit |
13e616 |
configured unfairly across VLs in the range 0-3, and also in the range
|
|
Packit |
13e616 |
4-7. Note that the default OpenSM VL arbitration configuration does
|
|
Packit |
13e616 |
not meet this constraint, so all torus-2QoS users should configure VL
|
|
Packit |
13e616 |
arbitration via qos_ca_vlarb_high, qos_swe_vlarb_high, qos_ca_vlarb_low,
|
|
Packit |
13e616 |
qos_swe_vlarb_low, etc.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note that torus-2QoS maps SL values to VL values differently
|
|
Packit |
13e616 |
for inter-switch and endport links. This is why qos_vlarb_high and
|
|
Packit |
13e616 |
qos_vlarb_low should not be used, as using them may result in
|
|
Packit |
13e616 |
VL arbitration for a QoS level being different across inter-switch
|
|
Packit |
13e616 |
links vs. across endport links.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Operational Considerations:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Any routing algorithm for a torus IB fabric must employ path SL values to
|
|
Packit |
13e616 |
avoid credit loops. As a result, all applications run over such fabrics
|
|
Packit |
13e616 |
must perform a path record query to obtain the correct path SL for
|
|
Packit |
13e616 |
connection setup. Applications that use rdma_cm for connection setup will
|
|
Packit |
13e616 |
automatically meet this requirement.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
If a change in fabric topology causes changes in path SL values required to
|
|
Packit |
13e616 |
route without credit loops, in general all applications would need to
|
|
Packit |
13e616 |
repath to avoid message deadlock. Since torus-2QoS has the ability to
|
|
Packit |
13e616 |
reroute after a single switch failure without changing path SL values,
|
|
Packit |
13e616 |
repathing by running applications is not required when the fabric is routed
|
|
Packit |
13e616 |
with torus-2QoS.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS can provide unchanging path SL values in the presence of subnet
|
|
Packit |
13e616 |
manager failover provided that all OpenSM instances have the same idea of
|
|
Packit |
13e616 |
dateline location. See torus-2QoS.conf(5) for details.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Torus-2QoS will detect configurations of failed switches and links that
|
|
Packit |
13e616 |
prevent routing that is free of credit loops, and will log warnings and
|
|
Packit |
13e616 |
refuse to route. If "no_fallback" was configured in the list of OpenSM
|
|
Packit |
13e616 |
routing engines, then no other routing engine will attempt to route the
|
|
Packit |
13e616 |
fabric. In that case all paths that do not transit the failed components
|
|
Packit |
13e616 |
will continue to work, and the subset of paths that are still operational
|
|
Packit |
13e616 |
will continue to remain free of credit loops. OpenSM will continue to
|
|
Packit |
13e616 |
attempt to route the fabric after every sweep interval, and after any
|
|
Packit |
13e616 |
change (such as a link up) in the fabric topology. When the fabric
|
|
Packit |
13e616 |
components are repaired, full functionality will be restored.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In the event OpenSM was configured to allow some other engine to route the
|
|
Packit |
13e616 |
fabric if torus-2QoS fails, then credit loops and message deadlock are
|
|
Packit |
13e616 |
likely if torus-2QoS had previously routed the fabric successfully. Even if
|
|
Packit |
13e616 |
the other engine is capable of routing a torus without credit loops,
|
|
Packit |
13e616 |
applications that built connections with path SL values granted under
|
|
Packit |
13e616 |
torus-2QoS will likely experience message deadlock under routing generated
|
|
Packit |
13e616 |
by a different engine, unless they repath.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
To verify that a torus fabric is routed free of credit loops, use ibdmchk
|
|
Packit |
13e616 |
to analyze data collected via ibdiagnet -vlr.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
DFSSSP and SSSP Routing Algorithm
|
|
Packit |
13e616 |
---------------------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The (Deadlock-Free) Single-Source-Shortest-Path routing algorithm is
|
|
Packit |
13e616 |
designed to optimize link utilization thru global balancing of routes,
|
|
Packit |
13e616 |
while supporting arbitrary topologies. The DFSSSP routing algorithm
|
|
Packit |
13e616 |
uses InfiniBand virtual lanes (SL) to provide deadlock-freedom.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The DFSSSP algorithm consists of five major steps:
|
|
Packit |
13e616 |
1) It discovers the subnet and models the subnet as a directed
|
|
Packit |
13e616 |
multigraph in which each node represents a node of the physical
|
|
Packit |
13e616 |
network and each edge represents one direction of the full-duplex
|
|
Packit |
13e616 |
links used to connect the nodes.
|
|
Packit |
13e616 |
2) A loop, which iterates over all CA and switches of the subnet, will
|
|
Packit |
13e616 |
perform three steps to generate the linear forwarding tables for
|
|
Packit |
13e616 |
each switch:
|
|
Packit |
13e616 |
2.1) use Dijkstra's algorithm to find the shortest path from all nodes
|
|
Packit |
13e616 |
to the current selected destination;
|
|
Packit |
13e616 |
2.2) update the edge weights in the graph, i.e. add the number of
|
|
Packit |
13e616 |
routes, which use a link to reach the destination, to the link/edge;
|
|
Packit |
13e616 |
2.3) update the LFT of each switch with the outgoing port which was used
|
|
Packit |
13e616 |
in the current step to route the traffic to the destination node.
|
|
Packit |
13e616 |
3) After the number of available virtual lanes or layers in the subnet
|
|
Packit |
13e616 |
is detected and a channel dependency graph is initialized for each layer,
|
|
Packit |
13e616 |
the algorithm will put each possible route of the subnet into the first
|
|
Packit |
13e616 |
layer.
|
|
Packit |
13e616 |
4) A loop iterates over all channel dependency graphs (CDG) and performs
|
|
Packit |
13e616 |
the following substeps:
|
|
Packit |
13e616 |
4.1) search for a cycle in the current CDG;
|
|
Packit |
13e616 |
4.2) when a cycle is found, i.e. a possible deadlock is present,
|
|
Packit |
13e616 |
one edge is selected and all routes, which induced this edge, are moved
|
|
Packit |
13e616 |
to the "next higher" virtual layer (CDG[i+1]);
|
|
Packit |
13e616 |
4.3) the cycle search is continued until all cycles are broken and
|
|
Packit |
13e616 |
routes are moved "up".
|
|
Packit |
13e616 |
5) When the number of needed layers does not exceeds the number of
|
|
Packit |
13e616 |
available SL/VL to remove all cycles in all CDGs, the routing is
|
|
Packit |
13e616 |
deadlock-free and an relation table is generated, which contains
|
|
Packit |
13e616 |
the assignment of routes from source to destination to a SL
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note on SSSP:
|
|
Packit |
13e616 |
This algorithm does not perform the steps 3)-5) and can not be
|
|
Packit |
13e616 |
considered to be deadlock-free for all topologies. But on the one
|
|
Packit |
13e616 |
hand, you can choose this algorithm for really large networks
|
|
Packit |
13e616 |
(5,000+ CAs and deadlock-free by design) to reduce
|
|
Packit |
13e616 |
the runtime of the algorithm. On the other hand, you might use
|
|
Packit |
13e616 |
the SSSP routing algorithm as an alternative, when all deadlock-free
|
|
Packit |
13e616 |
routing algorithms fail to route the network for whatever reason.
|
|
Packit |
13e616 |
In the last case, SSSP was designed to deliver an equal or higher
|
|
Packit |
13e616 |
bandwidth due to better congestion avoidance than the Min Hop
|
|
Packit |
13e616 |
routing algorithm.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Notes for usage:
|
|
Packit |
13e616 |
a) running DFSSSP: '-R dfsssp -Q'
|
|
Packit |
13e616 |
a.1) QoS has to be configured to equally spread the load on the
|
|
Packit |
13e616 |
available SL or virtual lanes
|
|
Packit |
13e616 |
a.2) applications must perform a path record query to get path SL for
|
|
Packit |
13e616 |
each route, which the application will use to transmit packages
|
|
Packit |
13e616 |
b) running SSSP: '-R sssp'
|
|
Packit |
13e616 |
c) both algorithms support LMC > 0
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Hints for separate optimization of compute and I/O traffic:
|
|
Packit |
13e616 |
Having more nodes (I/O and compute) connected to a switch than incoming links
|
|
Packit |
13e616 |
can result in a 'bad' routing of the I/O traffic as long as (DF)SSSP routing
|
|
Packit |
13e616 |
is not aware of the dedicated I/O nodes, i.e., in the following network
|
|
Packit |
13e616 |
configuration CN1-CN3 might send all I/O traffic via Link2 to IO1,IO2:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
CN1 Link1 IO1
|
|
Packit |
13e616 |
\ /----\ /
|
|
Packit |
13e616 |
CN2 -- Switch1 Switch2 -- CN4
|
|
Packit |
13e616 |
/ \----/ \
|
|
Packit |
13e616 |
CN3 Link2 IO2
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
To prevent this from happening (DF)SSSP can use both the compute node guid
|
|
Packit |
13e616 |
file and the I/O guid file specified by the '-u' or '--cn_guid_file' and
|
|
Packit |
13e616 |
'-G' or '--io_guid_file' options (similar to the Fat-Tree routing).
|
|
Packit |
13e616 |
This ensures that traffic towards compute nodes and I/O nodes is balanced
|
|
Packit |
13e616 |
separately and therefore distributed as much as possible across the available
|
|
Packit |
13e616 |
links. Port GUIDs, as listed by ibstat, must be specified (not Node GUIDs).
|
|
Packit |
13e616 |
The priority for the optimization is as follows:
|
|
Packit |
13e616 |
compute nodes -> I/O nodes -> other nodes
|
|
Packit |
13e616 |
Possible use case scenarios:
|
|
Packit |
13e616 |
a) neither '-u' nor '-G' are specified: all nodes a treated as 'other nodes'
|
|
Packit |
13e616 |
and therefore balanced equally;
|
|
Packit |
13e616 |
b) '-G' is specified: traffic towards I/O nodes will be balanced optimally;
|
|
Packit |
13e616 |
c) the system has three node types, such as login/admin, compute and I/O,
|
|
Packit |
13e616 |
but the balancing focus should be I/O, then one has to use '-u' and '-G'
|
|
Packit |
13e616 |
with I/O guids listed in cn_guid_file and compute node guids listed in
|
|
Packit |
13e616 |
io_guid_file;
|
|
Packit |
13e616 |
d) ...
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For more information about the algorithms, i.e. balancing the routes and
|
|
Packit |
13e616 |
moving the routes to different virtual layers, and about comparison with
|
|
Packit |
13e616 |
other routing algorithms, please refer to the following articles:
|
|
Packit |
13e616 |
1. J. Domke, T. Hoefler and W. Nagel: Deadlock-Free Oblivious Routing
|
|
Packit |
13e616 |
for Arbitrary Topologies, In Proceedings of the 25th IEEE International
|
|
Packit |
13e616 |
Parallel & Distributed Processing Symposium (IPDPS 2011)
|
|
Packit |
13e616 |
2. T. Hoefler, T. Schneider and A. Lumsdaine: Optimized Routing for
|
|
Packit |
13e616 |
Large-Scale InfiniBand Networks, In 17th Annual IEEE Symposium on High
|
|
Packit |
13e616 |
Performance Interconnects (HOTI 2009)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Nue Routing Algorithm
|
|
Packit |
13e616 |
---------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The implementation of Nue routing for OpenSM is a 100%-applicable, balanced, and
|
|
Packit |
13e616 |
deadlock-free unicast routing engine (which also configures multicast tables,
|
|
Packit |
13e616 |
see 'Note on multicast' below). The key points of this algorithm are the
|
|
Packit |
13e616 |
following:
|
|
Packit |
13e616 |
- 100% fault-tolerant, oblivious routing strategy
|
|
Packit |
13e616 |
- topology-agnostic, i.e., applicable to every topology (no matter if topology
|
|
Packit |
13e616 |
is regular, irregular after faults, or random)
|
|
Packit |
13e616 |
- 100% deadlock-free routing within the resource limits (i.e., it never exceeds
|
|
Packit |
13e616 |
the given number of available virtual lanes, and it does not necessarily
|
|
Packit |
13e616 |
require virtual lanes) for every topology
|
|
Packit |
13e616 |
- very good path balancing and therefore high throughput
|
|
Packit |
13e616 |
- QoS (via SLs/VLs) + deadlock-freedom can be combined (since both rely on VLs)
|
|
Packit |
13e616 |
- forwarding tables are fast to calculate: O(n^2 * log n), however slightly
|
|
Packit |
13e616 |
slower compared to topology-aware routings (for obvious reasons), and
|
|
Packit |
13e616 |
- the path-to-VL mapping only depends on the destination, which may be useful
|
|
Packit |
13e616 |
for scalable, efficient path resolution and caching mechanisms.
|
|
Packit |
13e616 |
From a very high level perspective, Nue routing is similar to DFSSSP (see above)
|
|
Packit |
13e616 |
in the sense that both use Dijkstra and edge weight updates for path balancing.
|
|
Packit |
13e616 |
However, the fundamental difference is that Nue routing doesn't perform the path
|
|
Packit |
13e616 |
calculation on the graph representing the real fabric, and instead routes
|
|
Packit |
13e616 |
directly within the channel dependency graph. This approach allows Nue routing
|
|
Packit |
13e616 |
to place routing restrictions (to avoid any credit loops) in an on-demand
|
|
Packit |
13e616 |
manner, which overcomes the problem of all other good VL-based algorithms.
|
|
Packit |
13e616 |
Meaning, the competitors cannot control or limit the use of VLs, and might run
|
|
Packit |
13e616 |
out of them and have to give up. On the flip side, Nue may have to use detours
|
|
Packit |
13e616 |
for a few routes, and hence cannot really be considered "shortest-path" routing,
|
|
Packit |
13e616 |
because of the impossibility lemma 6.1 (see ref. [2] Chapter 6) to accomplish
|
|
Packit |
13e616 |
deadlock-free, shortest-path routing with an limited number of available virtual
|
|
Packit |
13e616 |
lanes for arbitrary network topologies.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Conceptually, Nue routing works as follows:
|
|
Packit |
13e616 |
* Assume N is the set of destinations and k the number of available VLs
|
|
Packit |
13e616 |
* Assume that virtual lanes (VLs) are combined into virtual layers
|
|
Packit |
13e616 |
* Partition N into k disjoint subsets N_1 ,..., N_k of destinations
|
|
Packit |
13e616 |
* foreach virtual layer L_i with i in {1 ,..., k} do:
|
|
Packit |
13e616 |
* Create a convex subgraph H_i for N_i
|
|
Packit |
13e616 |
* Identify central node n_r in N_i of convex H_i via Brandes' algorithm
|
|
Packit |
13e616 |
* Create a new complete channel dependency graph D_i for layer L_i
|
|
Packit |
13e616 |
* Define escape paths D* in D_i for spanning tree rooted at n_r
|
|
Packit |
13e616 |
* foreach destination node n in N_i do:
|
|
Packit |
13e616 |
* Identify all deadlock-free paths towards n
|
|
Packit |
13e616 |
* Store these paths in unicast forwarding tables
|
|
Packit |
13e616 |
* Update channel weights in D_i for these paths
|
|
Packit |
13e616 |
and the service level for applications returned in path request is determined
|
|
Packit |
13e616 |
by Nue (for k>1) as follows:
|
|
Packit |
13e616 |
* Assuming an input of a given source/destination node (LID) pair
|
|
Packit |
13e616 |
* If Nue mapped the destination to layer L_x, then SL == x-1 is returned, while
|
|
Packit |
13e616 |
presuming that SL2VL tables are being mapped 1:1 for the number of layers.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
While other VL-based routings usually gradually construct the channel dependency
|
|
Packit |
13e616 |
graph after all paths have been calculated, Nue creates a "complete" version of
|
|
Packit |
13e616 |
it which holds all possible dependencies allowing it to directly search for
|
|
Packit |
13e616 |
cycles after "using" a dependency on the path to a destination. Since it would
|
|
Packit |
13e616 |
be infeasible to search for cycles each and every time a channel dependency is
|
|
Packit |
13e616 |
added, Nue employs the notion of individually colored subgraphs (one for each
|
|
Packit |
13e616 |
destination) and only performs real cycle searches in the complete CDG when
|
|
Packit |
13e616 |
adding new edges between nodes of the same subgraph, see Section 6.2.6.1 of
|
|
Packit |
13e616 |
reference [2] for details on this optimization.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Use of METIS library:
|
|
Packit |
13e616 |
In step 2 of the previously shown pseudo code, Nue routing separates the LIDs
|
|
Packit |
13e616 |
into multiple subsets, one for every virtual layer. Nue has two options to
|
|
Packit |
13e616 |
perform this partitioning (not to be confused with IB partitions): the first is
|
|
Packit |
13e616 |
a fairly simple semi-random assignment of LIDs to layers/subsets, and the second
|
|
Packit |
13e616 |
partitioning uses the METIS library to partition the network graph into k
|
|
Packit |
13e616 |
approximately equal sized parts. The latter approach has shown better results
|
|
Packit |
13e616 |
in terms of path balancing and avoidance of using the escape paths, and hence
|
|
Packit |
13e616 |
it is HIGHLY advised to install/use the METIS library with OpenSM (enforced
|
|
Packit |
13e616 |
via `--enable-metis' configure flag when building OpenSM). For the rare case,
|
|
Packit |
13e616 |
that METIS isn't packaged with the Linux distro, here is a link to the official
|
|
Packit |
13e616 |
website to download and install METIS 5.1.0 manually:
|
|
Packit |
13e616 |
http://glaros.dtc.umn.edu/gkhome/metis/metis/overview
|
|
Packit |
13e616 |
OpenSM's configure script also provides options in case METIS header and library
|
|
Packit |
13e616 |
aren't found in the default path.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Runtime options for Nue:
|
|
Packit |
13e616 |
The behavior of Nue routing can be directly influenced by two osm.conf
|
|
Packit |
13e616 |
parameters (one is also available as command line option):
|
|
Packit |
13e616 |
- nue_max_num_vls: which controls/limits the number of virtual lanes which Nue
|
|
Packit |
13e616 |
is allowed to use (detailed explanation in osm.conf file); this option is
|
|
Packit |
13e616 |
also available via command line
|
|
Packit |
13e616 |
- nue_include_switches: the option (if TRUE) enforces Nue to treat switches
|
|
Packit |
13e616 |
as "normal" endpoints sending/receiving data traffic, which is usually
|
|
Packit |
13e616 |
not the case (also with other routings); hence, paths to switches will be
|
|
Packit |
13e616 |
included when calculating deadlock-free ucast tables (suggestion for IB
|
|
Packit |
13e616 |
subnets: FALSE)
|
|
Packit |
13e616 |
Furthermore, Nue supports TRUE and FALSE settings of avoid_throttled_links,
|
|
Packit |
13e616 |
use_ucast_cache, and qos (more on this hereafter); and lmc > 0.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Notes on Quality of Service (QoS):
|
|
Packit |
13e616 |
The advantage of Nue is that it works with AND without QoS being enabled, i.e.,
|
|
Packit |
13e616 |
the usage of SLs/VLs for deadlock-freedom can be avoided. Here are the three
|
|
Packit |
13e616 |
possible usage scenarios:
|
|
Packit |
13e616 |
a) nue_max_num_vls = 1 and qos = disabled => Nue assumes that only 1 virtual
|
|
Packit |
13e616 |
layer (identical to the physical network; or OperVLs equal to VL0) is
|
|
Packit |
13e616 |
usable and all paths are to be calculated within this one layer. Hence,
|
|
Packit |
13e616 |
there is no need for special SL2VL mappings in the network and the use of
|
|
Packit |
13e616 |
specific SLs by applications. So, enabling QoS is not required for
|
|
Packit |
13e616 |
credit-free unicast routing tables.
|
|
Packit |
13e616 |
b) nue_max_num_vls = 1 and qos = enabled => This combination works essentially
|
|
Packit |
13e616 |
like (a), meaning the SL returned for path record requests is not defined
|
|
Packit |
13e616 |
by Nue, since all paths are deadlock-free without using VLs. However, any
|
|
Packit |
13e616 |
separate QoS settings may influence the SL returned to applications.
|
|
Packit |
13e616 |
c) nue_max_num_vls > 1 and qos = enabled => In this configuration, applications
|
|
Packit |
13e616 |
have to query and obey the SL for path records as returned by Nue because
|
|
Packit |
13e616 |
otherwise the deadlock-freedom cannot be guaranteed anymore. Furthermore,
|
|
Packit |
13e616 |
errors in the fabric may require applications to repath to avoid message
|
|
Packit |
13e616 |
deadlocks (this is only a current limitation of the implementation since
|
|
Packit |
13e616 |
Nue uses METIS to assign destination LIDs to VLs, and a network fault
|
|
Packit |
13e616 |
may change the outcome of METIS' partitioning; so, if anyone needs to
|
|
Packit |
13e616 |
avoid repaths, then please contact the developer). Since Nue operates on
|
|
Packit |
13e616 |
virtual layer, admins should configure the SL2VL mapping tables in an
|
|
Packit |
13e616 |
homogeneous 1-to-1 manner across the entire subnet to separate the layers
|
|
Packit |
13e616 |
and avoid paths from changing into the wrong layer (example mapping is
|
|
Packit |
13e616 |
VL_i = SL_i % qos_max_vls for all i 0..15). Depending on the actual
|
|
Packit |
13e616 |
setting of nue_max_num_vls, one can further differentiate:
|
|
Packit |
13e616 |
c.1) All operational VLs are used for deadlock-free routes (either by setting
|
|
Packit |
13e616 |
nue_max_num_vls to qos_max_vls or to 0 for "auto detection"), and hence
|
|
Packit |
13e616 |
real QoS to prioritize traffic in the subnet isn't supported. The VL
|
|
Packit |
13e616 |
arbitration settings for this usage scenarios should be configured
|
|
Packit |
13e616 |
equally for all VLs and for all switches to avoid bandwidth limitations
|
|
Packit |
13e616 |
for subset of destinations, e.g.
|
|
Packit |
13e616 |
qos_max_vls 8
|
|
Packit |
13e616 |
qos_high_limit 4
|
|
Packit |
13e616 |
qos_vlarb_high 0:64,1:64,2:64,3:64,4:64,5:64,6:64,7:64
|
|
Packit |
13e616 |
qos_vlarb_low 0:4,1:4,2:4,3:4,4:4,5:4,6:4,7:4
|
|
Packit |
13e616 |
qos_sl2vl 0,1,2,3,4,5,6,7,0,1,2,3,4,5,6,7
|
|
Packit |
13e616 |
when assuming 8 operational VLs in the fabric.
|
|
Packit |
13e616 |
c.2) Nue is limited to a subset of operational VLs allowing the mix of
|
|
Packit |
13e616 |
deadlock-freedom based on some SLs/VLs and real QoS with the remaining
|
|
Packit |
13e616 |
set of SLs/VLs; e.g., using VL0-3 to avoid credit-loops (= 1. QoS level)
|
|
Packit |
13e616 |
and VL4-7 used2for the second QoS level. In this example, application
|
|
Packit |
13e616 |
should either comply with the returned SL (in the path query) or select
|
|
Packit |
13e616 |
SL + 4 to use the second QoS level. (Reason: Nue itself is unaware of
|
|
Packit |
13e616 |
these QoS levels and only returns SLs in the range of 0 to #layers-1)
|
|
Packit |
13e616 |
d) nue_max_num_vls > 1 and qos = disabled => SHOULD NOT BE USED TOGETHER, since
|
|
Packit |
13e616 |
the SL2VL mapping for switches must be configured correctly.
|
|
Packit |
13e616 |
As an additional note, using more VLs for Nue usually improves the overall
|
|
Packit |
13e616 |
network throughput (as shown in [1] and [2]), so there are trade offs admins
|
|
Packit |
13e616 |
may have to consider when configuring the subnet manager with Nue routing.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note on multicast:
|
|
Packit |
13e616 |
The Nue routing engine configures multicast forwarding tables by utilizing a
|
|
Packit |
13e616 |
spanning tree calculation routed at a subnet switch suggested by OpenSM's
|
|
Packit |
13e616 |
internal osm_mcast_mgr_find_root_switch(...) fn. This spanning tree for a mcast
|
|
Packit |
13e616 |
group will try to use the least overloaded links (w.r.t the ucast paths-per-link
|
|
Packit |
13e616 |
metric/weight) in the fabric. However, Nue routing currently does not guarantee
|
|
Packit |
13e616 |
deadlock-freedom for the set of multicast routes on all topologies, nor for the
|
|
Packit |
13e616 |
combination of deadlock-free unicast routes with additional multicast routes.
|
|
Packit |
13e616 |
Assuming, for a given topology the calculated mcast routes are dl-free, then
|
|
Packit |
13e616 |
an admin may fix the latter problem by separating the VLs, e.g., using VL0-6 for
|
|
Packit |
13e616 |
ucast by specifying nue_max_num_vls=7 and utilizing VL7 for mcast.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
For more information about Nue routing and comparisons with other OpenSM routing
|
|
Packit |
13e616 |
algorithms, please refer to the following publications:
|
|
Packit |
13e616 |
1. J. Domke, T. Hoefler and S. Matsuoka: "Routing on the Dependency Graph: A New
|
|
Packit |
13e616 |
Approach to Deadlock-Free High-Performance Routing", in HPDC'16
|
|
Packit |
13e616 |
(online: http://doi.acm.org/10.1145/2907294.2907313)
|
|
Packit |
13e616 |
2. J. Domke "Routing on the Channel Dependency Graph: A New Approach to
|
|
Packit |
13e616 |
Deadlock-Free, Destination-Based, High-Performance Routing for Lossless
|
|
Packit |
13e616 |
Interconnection Networks", 2017, Dissertation, TU Dresden
|
|
Packit |
13e616 |
(online: http://nbn-resolving.de/urn:nbn:de:bsz:14-qucosa-225902)
|