|
Packit |
13e616 |
.TH TORUS\-2QOS 8 "November 10, 2010" "OpenIB" "OpenIB Management"
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH NAME
|
|
Packit |
13e616 |
torus\-2QoS \- Routing engine for OpenSM subnet manager
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH DESCRIPTION
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
Torus-2QoS is routing algorithm designed for large-scale 2D/3D torus fabrics.
|
|
Packit |
13e616 |
The torus-2QoS routing engine can provide the following functionality on
|
|
Packit |
13e616 |
a 2D/3D torus:
|
|
Packit |
13e616 |
.br
|
|
Packit |
13e616 |
\" roff illiteracy leads to following brain-dead list implementation
|
|
Packit |
13e616 |
\"
|
|
Packit |
13e616 |
.na \" otherwise line space adjustment can add spaces between dash and text
|
|
Packit |
13e616 |
.in +2m
|
|
Packit |
13e616 |
\[en]
|
|
Packit |
13e616 |
'in +2m
|
|
Packit |
13e616 |
Routing that is free of credit loops.
|
|
Packit |
13e616 |
.in
|
|
Packit |
13e616 |
\[en]
|
|
Packit |
13e616 |
'in +2m
|
|
Packit |
13e616 |
Two levels of Quality of Service (QoS), assuming switches support eight
|
|
Packit |
13e616 |
data VLs and channel adapters support two data VLs.
|
|
Packit |
13e616 |
.in
|
|
Packit |
13e616 |
\[en]
|
|
Packit |
13e616 |
'in +2m
|
|
Packit |
13e616 |
The ability to route around a single failed switch, and/or multiple failed
|
|
Packit |
13e616 |
links, without
|
|
Packit |
13e616 |
.in
|
|
Packit |
13e616 |
.in +2m
|
|
Packit |
13e616 |
\[en]
|
|
Packit |
13e616 |
'in +2
|
|
Packit |
13e616 |
introducing credit loops, or
|
|
Packit |
13e616 |
.in
|
|
Packit |
13e616 |
\[en]
|
|
Packit |
13e616 |
'in +2m
|
|
Packit |
13e616 |
changing path SL values.
|
|
Packit |
13e616 |
.in -4m
|
|
Packit |
13e616 |
\[en]
|
|
Packit |
13e616 |
'in +2m
|
|
Packit |
13e616 |
Very short run times, with good scaling properties as fabric size increases.
|
|
Packit |
13e616 |
.ad
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH UNICAST ROUTING
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
Unicast routing in torus-2QoS is based on Dimension Order Routing (DOR).
|
|
Packit |
13e616 |
It avoids the deadlocks that would otherwise occur in a DOR-routed
|
|
Packit |
13e616 |
torus using the concept of a dateline for each torus dimension.
|
|
Packit |
13e616 |
It encodes into a path SL which datelines the path crosses, as follows:
|
|
Packit |
13e616 |
\f(CR
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
.nf
|
|
Packit |
13e616 |
sl = 0;
|
|
Packit |
13e616 |
for (d = 0; d < torus_dimensions; d++) {
|
|
Packit |
13e616 |
/* path_crosses_dateline(d) returns 0 or 1 */
|
|
Packit |
13e616 |
sl |= path_crosses_dateline(d) << d;
|
|
Packit |
13e616 |
}
|
|
Packit |
13e616 |
.fi
|
|
Packit |
13e616 |
\fR
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
On a 3D torus this consumes three SL bits, leaving one SL bit unused.
|
|
Packit |
13e616 |
Torus-2QoS uses this SL bit to implement two QoS levels.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Torus-2QoS also makes use of the output port
|
|
Packit |
13e616 |
dependence of switch SL2VL maps to encode into one VL bit the
|
|
Packit |
13e616 |
information encoded in three SL bits.
|
|
Packit |
13e616 |
It computes in which torus coordinate direction each inter-switch link
|
|
Packit |
13e616 |
"points", and writes SL2VL maps for such ports as follows:
|
|
Packit |
13e616 |
\f(CR
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
.nf
|
|
Packit |
13e616 |
for (sl = 0; sl < 16; sl++) {
|
|
Packit |
13e616 |
/* cdir(port) computes which torus coordinate direction
|
|
Packit |
13e616 |
* a switch port "points" in; returns 0, 1, or 2
|
|
Packit |
13e616 |
*/
|
|
Packit |
13e616 |
sl2vl(iport,oport,sl) = 0x1 & (sl >> cdir(oport));
|
|
Packit |
13e616 |
}
|
|
Packit |
13e616 |
.fi
|
|
Packit |
13e616 |
\fR
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Thus, on a pristine 3D torus,
|
|
Packit |
13e616 |
\fIi.e.\fR,
|
|
Packit |
13e616 |
in the absence of failed fabric switches,
|
|
Packit |
13e616 |
torus-2QoS consumes eight SL values (SL bits 0-2) and
|
|
Packit |
13e616 |
two VL values (VL bit 0) per QoS level to provide deadlock-free routing.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Torus-2QoS routes around link failure by "taking the long way around" any
|
|
Packit |
13e616 |
1D ring interrupted by link failure. For example, consider the 2D 6x5
|
|
Packit |
13e616 |
torus below, where switches are denoted by [+a-zA-Z]:
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
\# define macros to start and end ascii art, assuming Roman font.
|
|
Packit |
13e616 |
\# the start macro takes an argument which is the width in ems of
|
|
Packit |
13e616 |
\# the ascii art, and is used to center it.
|
|
Packit |
13e616 |
\#
|
|
Packit |
13e616 |
.de ascii_art
|
|
Packit |
13e616 |
.nop \f(CR
|
|
Packit |
13e616 |
.nr indent_in_ems ((((\\n[.ll] - \\n[.i]) / \\w'm') - \\$1)/2)
|
|
Packit |
13e616 |
.in +\\n[indent_in_ems]m
|
|
Packit |
13e616 |
.nf
|
|
Packit |
13e616 |
..
|
|
Packit |
13e616 |
.de end_ascii_art
|
|
Packit |
13e616 |
.fi
|
|
Packit |
13e616 |
.in
|
|
Packit |
13e616 |
.nop \fR
|
|
Packit |
13e616 |
..
|
|
Packit |
13e616 |
\# end of macro definitions
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.ascii_art 36
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
4 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 --+----+----+----D----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----I----r----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 --m----S----n----T----o----p--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
.end_ascii_art
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For a pristine fabric the path from S to D would be S-n-T-r-D.
|
|
Packit |
13e616 |
In the event that either link S-n or n-T has failed, torus-2QoS would
|
|
Packit |
13e616 |
use the path S-m-p-o-T-r-D.
|
|
Packit |
13e616 |
Note that it can do this without changing the path SL
|
|
Packit |
13e616 |
value; once the 1D ring m-S-n-T-o-p-m has been broken by failure, path
|
|
Packit |
13e616 |
segments using it cannot contribute to deadlock, and the x-direction
|
|
Packit |
13e616 |
dateline (between, say, x=5 and x=0) can be ignored for path segments on
|
|
Packit |
13e616 |
that ring.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
One result of this is that torus-2QoS can route around many simultaneous
|
|
Packit |
13e616 |
link failures, as long as no 1D ring is broken into disjoint segments.
|
|
Packit |
13e616 |
For example, if links n-T and T-o have both failed, that ring has been broken
|
|
Packit |
13e616 |
into two disjoint segments, T and o-p-m-S-n.
|
|
Packit |
13e616 |
Torus-2QoS checks for such
|
|
Packit |
13e616 |
issues, reports if they are found, and refuses to route such fabrics.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Note that in the case where there are multiple parallel links between a
|
|
Packit |
13e616 |
pair of switches, torus-2QoS will allocate routes across such links
|
|
Packit |
13e616 |
in a round-robin fashion, based on ports at the path destination switch that
|
|
Packit |
13e616 |
are active and not used for inter-switch links.
|
|
Packit |
13e616 |
Should a link that is one of several such parallel links fail, routes
|
|
Packit |
13e616 |
are redistributed across the remaining links.
|
|
Packit |
13e616 |
When the last of such a set of parallel links fails, traffic is rerouted
|
|
Packit |
13e616 |
as described above.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Handling a failed switch under DOR requires introducing into a path at
|
|
Packit |
13e616 |
least one turn that would be otherwise "illegal",
|
|
Packit |
13e616 |
\fIi.e.\fR,
|
|
Packit |
13e616 |
not allowed by DOR rules.
|
|
Packit |
13e616 |
Torus-2QoS will introduce such a turn as close as possible to the
|
|
Packit |
13e616 |
failed switch in order to route around it.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In the above example, suppose switch T has failed, and consider the path
|
|
Packit |
13e616 |
from S to D.
|
|
Packit |
13e616 |
Torus-2QoS will produce the path S-n-I-r-D, rather than the
|
|
Packit |
13e616 |
S-n-T-r-D path for a pristine torus, by introducing an early turn at n.
|
|
Packit |
13e616 |
Normal DOR rules will cause traffic arriving at switch I to be forwarded
|
|
Packit |
13e616 |
to switch r; for traffic arriving from I due to the "early" turn at n,
|
|
Packit |
13e616 |
this will generate an "illegal" turn at I.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Torus-2QoS will also use the input port dependence of SL2VL maps to set VL
|
|
Packit |
13e616 |
bit 1 (which would be otherwise unused) for y-x, z-x, and z-y turns,
|
|
Packit |
13e616 |
\fIi.e.\fR,
|
|
Packit |
13e616 |
those turns that are illegal under DOR.
|
|
Packit |
13e616 |
This causes the first hop after any such turn to use a separate set of
|
|
Packit |
13e616 |
VL values, and prevents deadlock in the presence of a single failed switch.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For any given path, only the hops after a turn that is illegal under DOR
|
|
Packit |
13e616 |
can contribute to a credit loop that leads to deadlock. So in the example
|
|
Packit |
13e616 |
above with failed switch T, the location of the illegal turn at I in the
|
|
Packit |
13e616 |
path from S to D requires that any credit loop caused by that turn must
|
|
Packit |
13e616 |
encircle the failed switch at T. Thus the second and later hops after the
|
|
Packit |
13e616 |
illegal turn at I (\fIi.e.\fR, hop r-D) cannot contribute to a credit loop
|
|
Packit |
13e616 |
because they cannot be used to construct a loop encircling T. The hop I-r
|
|
Packit |
13e616 |
uses a separate VL, so it cannot contribute to a credit loop encircling T.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Extending this argument shows that in addition to being capable of routing
|
|
Packit |
13e616 |
around a single switch failure without introducing deadlock, torus-2QoS can
|
|
Packit |
13e616 |
also route around multiple failed switches on the condition they are
|
|
Packit |
13e616 |
adjacent in the last dimension routed by DOR. For example, consider the
|
|
Packit |
13e616 |
following case on a 6x6 2D torus:
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.ascii_art 36
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
5 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
4 --+----+----+----D----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 --+----+----I----u----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----q----R----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 --m----S----n----T----o----p--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
.end_ascii_art
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Suppose switches T and R have failed, and consider the path from S to D.
|
|
Packit |
13e616 |
Torus-2QoS will generate the path S-n-q-I-u-D, with an illegal turn at
|
|
Packit |
13e616 |
switch I, and with hop I-u using a VL with bit 1 set.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
As a further example, consider a case that torus-2QoS cannot route without
|
|
Packit |
13e616 |
deadlock: two failed switches adjacent in a dimension that is not the last
|
|
Packit |
13e616 |
dimension routed by DOR; here the failed switches are O and T:
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.ascii_art 36
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
5 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
4 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 --+----+----+----+----D----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----I----q----r----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 --m----S----n----O----T----p--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 --+----+----+----+----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
.end_ascii_art
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In a pristine fabric, torus-2QoS would generate the path from S to D as
|
|
Packit |
13e616 |
S-n-O-T-r-D. With failed switches O and T, torus-2QoS will generate the
|
|
Packit |
13e616 |
path S-n-I-q-r-D, with illegal turn at switch I, and with hop I-q using a
|
|
Packit |
13e616 |
VL with bit 1 set. In contrast to the earlier examples, the second hop
|
|
Packit |
13e616 |
after the illegal turn, q-r, can be used to construct a credit loop
|
|
Packit |
13e616 |
encircling the failed switches.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH MULTICAST ROUTING
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
Since torus-2QoS uses all four available SL bits, and the three data VL
|
|
Packit |
13e616 |
bits that are typically available in current switches, there is no way
|
|
Packit |
13e616 |
to use SL/VL values to separate multicast traffic from unicast traffic.
|
|
Packit |
13e616 |
Thus, torus-2QoS must generate multicast routing such that credit loops
|
|
Packit |
13e616 |
cannot arise from a combination of multicast and unicast path segments.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
It turns out that it is possible to construct spanning trees for multicast
|
|
Packit |
13e616 |
routing that have that property. For the 2D 6x5 torus example above, here
|
|
Packit |
13e616 |
is the full-fabric spanning tree that torus-2QoS will construct, where "x"
|
|
Packit |
13e616 |
is the root switch and each "+" is a non-root switch:
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.ascii_art 36
|
|
Packit |
13e616 |
4 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 +----+----+----x----+----+
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 + + + + + +
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
.end_ascii_art
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For multicast traffic routed from root to tip, every turn in the above
|
|
Packit |
13e616 |
spanning tree is a legal DOR turn.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For traffic routed from tip to root, and some traffic routed through the
|
|
Packit |
13e616 |
root, turns are not legal DOR turns. However, to construct a credit loop,
|
|
Packit |
13e616 |
the union of multicast routing on this spanning tree with DOR unicast
|
|
Packit |
13e616 |
routing can only provide 3 of the 4 turns needed for the loop.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In addition, if none of the above spanning tree branches crosses a dateline
|
|
Packit |
13e616 |
used for unicast credit loop avoidance on a torus, and if multicast traffic
|
|
Packit |
13e616 |
is confined to SL 0 or SL 8 (recall that torus-2QoS uses SL bit 3 to
|
|
Packit |
13e616 |
differentiate QoS level), then multicast traffic also cannot contribute to
|
|
Packit |
13e616 |
the "ring" credit loops that are otherwise possible in a torus.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Torus-2QoS uses these ideas to create a master spanning tree. Every
|
|
Packit |
13e616 |
multicast group spanning tree will be constructed as a subset of the master
|
|
Packit |
13e616 |
tree, with the same root as the master tree.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Such multicast group spanning trees will in general not be optimal for
|
|
Packit |
13e616 |
groups which are a subset of the full fabric. However, this compromise must
|
|
Packit |
13e616 |
be made to enable support for two QoS levels on a torus while preventing
|
|
Packit |
13e616 |
credit loops.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In the presence of link or switch failures that result in a fabric for
|
|
Packit |
13e616 |
which torus-2QoS can generate credit-loop-free unicast routes, it is also
|
|
Packit |
13e616 |
possible to generate a master spanning tree for multicast that retains the
|
|
Packit |
13e616 |
required properties. For example, consider that same 2D 6x5 torus, with
|
|
Packit |
13e616 |
the link from (2,2) to (3,2) failed. Torus-2QoS will generate the following
|
|
Packit |
13e616 |
master spanning tree:
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.ascii_art 36
|
|
Packit |
13e616 |
4 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
2 --+----+----+ x----+----+--
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
1 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 + + + + + +
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
.end_ascii_art
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Two things are notable about this master spanning tree. First, assuming
|
|
Packit |
13e616 |
the x dateline was between x=5 and x=0, this spanning tree has a branch
|
|
Packit |
13e616 |
that crosses the dateline. However, just as for unicast, crossing a
|
|
Packit |
13e616 |
dateline on a 1D ring (here, the ring for y=2) that is broken by a failure
|
|
Packit |
13e616 |
cannot contribute to a torus credit loop.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Second, this spanning tree is no longer optimal even for multicast groups
|
|
Packit |
13e616 |
that encompass the entire fabric. That, unfortunately, is a compromise that
|
|
Packit |
13e616 |
must be made to retain the other desirable properties of torus-2QoS routing.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In the event that a single switch fails, torus-2QoS will generate a master
|
|
Packit |
13e616 |
spanning tree that has no "extra" turns by appropriately selecting a root
|
|
Packit |
13e616 |
switch.
|
|
Packit |
13e616 |
In the 2D 6x5 torus example, assume now that the switch at (3,2),
|
|
Packit |
13e616 |
\fIi.e.\fR, the root for a pristine fabric, fails.
|
|
Packit |
13e616 |
Torus-2QoS will generate the
|
|
Packit |
13e616 |
following master spanning tree for that case:
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.ascii_art 36
|
|
Packit |
13e616 |
|
|
|
Packit |
13e616 |
4 + + + + + +
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
3 + + + + + +
|
|
Packit |
13e616 |
| | | | |
|
|
Packit |
13e616 |
2 + + + + +
|
|
Packit |
13e616 |
| | | | |
|
|
Packit |
13e616 |
1 +----+----x----+----+----+
|
|
Packit |
13e616 |
| | | | | |
|
|
Packit |
13e616 |
y=0 + + + + + +
|
|
Packit |
13e616 |
|
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
x=0 1 2 3 4 5
|
|
Packit |
13e616 |
.end_ascii_art
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Assuming the y dateline was between y=4 and y=0, this spanning tree has
|
|
Packit |
13e616 |
a branch that crosses a dateline. However, again this cannot contribute
|
|
Packit |
13e616 |
to credit loops as it occurs on a 1D ring (the ring for x=3) that is
|
|
Packit |
13e616 |
broken by a failure, as in the above example.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH TORUS TOPOLOGY DISCOVERY
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
The algorithm used by torus-2QoS to construct the torus topology from
|
|
Packit |
13e616 |
the undirected graph representing the fabric requires that the radix of
|
|
Packit |
13e616 |
each dimension be configured via torus-2QoS.conf.
|
|
Packit |
13e616 |
It also requires that the torus topology be "seeded"; for a 3D torus this
|
|
Packit |
13e616 |
requires configuring four switches that define the three coordinate
|
|
Packit |
13e616 |
directions of the torus.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Given this starting information, the algorithm is to examine the
|
|
Packit |
13e616 |
cube formed by the eight switch locations bounded by the corners
|
|
Packit |
13e616 |
(x,y,z) and (x+1,y+1,z+1).
|
|
Packit |
13e616 |
Based on switches already placed into the torus topology at some of these
|
|
Packit |
13e616 |
locations, the algorithm examines 4-loops of inter-switch links to find the
|
|
Packit |
13e616 |
one that is consistent with a face of the cube of switch locations,
|
|
Packit |
13e616 |
and adds its swiches to the discovered topology in the correct locations.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Because the algorithm is based on examining the topology of 4-loops of links,
|
|
Packit |
13e616 |
a torus with one or more radix-4 dimensions requires extra initial
|
|
Packit |
13e616 |
seed configuration.
|
|
Packit |
13e616 |
See torus-2QoS.conf(5) for details.
|
|
Packit |
13e616 |
Torus-2QoS will detect and report when it has insufficient configuration
|
|
Packit |
13e616 |
for a torus with radix-4 dimensions.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In the event the torus is significantly degraded, \fIi.e.\fR, there are
|
|
Packit |
13e616 |
many missing switches or links, it may happen that torus-2QoS is unable
|
|
Packit |
13e616 |
to place into the torus some switches and/or links that were discovered
|
|
Packit |
13e616 |
in the fabric, and will generate a warning in that case.
|
|
Packit |
13e616 |
A similar condition occurs if torus-2QoS is misconfigured, \fIi.e.\fR,
|
|
Packit |
13e616 |
the radix of a torus dimension as configured does not match the radix
|
|
Packit |
13e616 |
of that torus dimension as wired, and many switches/links in the fabric
|
|
Packit |
13e616 |
will not be placed into the torus.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH QUALITY OF SERVICE CONFIGURATION
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
OpenSM will not program switches and channel adapters with
|
|
Packit |
13e616 |
SL2VL maps or VL arbitration configuration unless it is invoked with -Q.
|
|
Packit |
13e616 |
Since torus-2QoS depends on such functionality for correct operation,
|
|
Packit |
13e616 |
always invoke OpenSM with -Q when torus-2QoS is in the list of routing
|
|
Packit |
13e616 |
engines.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Any quality of service configuration method supported by OpenSM will
|
|
Packit |
13e616 |
work with torus-2QoS, subject to the following limitations and
|
|
Packit |
13e616 |
considerations.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For all routing engines supported by OpenSM except torus-2QoS,
|
|
Packit |
13e616 |
there is a one-to-one correspondence between QoS level and SL.
|
|
Packit |
13e616 |
Torus-2QoS can only support two quality of service levels, so only
|
|
Packit |
13e616 |
the high-order bit of any SL value used for unicast QoS configuration
|
|
Packit |
13e616 |
will be honored by torus-2QoS.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For multicast QoS configuration, only SL values 0 and 8 should be used
|
|
Packit |
13e616 |
with torus-2QoS.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Since SL to VL map configuration must be under the complete control of
|
|
Packit |
13e616 |
torus-2QoS, any configuration via qos_sl2vl, qos_swe_sl2vl,
|
|
Packit |
13e616 |
\fIetc.\fR, must and will be ignored, and a warning will be generated.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
For inter-switch links, Torus-2QoS uses VL values 0-3 to implement one of
|
|
Packit |
13e616 |
its supported QoS levels, and VL values 4-7 to implement the other. For
|
|
Packit |
13e616 |
endport links (CA, router, switch management port), Torus-2QoS uses VL
|
|
Packit |
13e616 |
value 0 for one of its supported QoS levels and VL value 1 to implement
|
|
Packit |
13e616 |
the other. Hard-to-diagnose application issues may arise if traffic is
|
|
Packit |
13e616 |
not delivered fairly across each of these two VL ranges. For
|
|
Packit |
13e616 |
inter-switch links, Torus-2QoS will detect and warn if VL arbitration is
|
|
Packit |
13e616 |
configured unfairly across VLs in the range 0-3, and also in the range
|
|
Packit |
13e616 |
4-7. Note that the default OpenSM VL arbitration configuration does
|
|
Packit |
13e616 |
not meet this constraint, so all torus-2QoS users should configure VL
|
|
Packit |
13e616 |
arbitration via qos_ca_vlarb_high, qos_swe_vlarb_high, qos_ca_vlarb_low,
|
|
Packit |
13e616 |
qos_swe_vlarb_low, \fIetc.\fR
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Note that torus-2QoS maps SL values to VL values differently
|
|
Packit |
13e616 |
for inter-switch and endport links. This is why qos_vlarb_high and
|
|
Packit |
13e616 |
qos_vlarb_low should not be used, as using them may result in
|
|
Packit |
13e616 |
VL arbitration for a QoS level being different across inter-switch
|
|
Packit |
13e616 |
links vs. across endport links.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH OPERATIONAL CONSIDERATIONS
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
Any routing algorithm for a torus IB fabric must employ path
|
|
Packit |
13e616 |
SL values to avoid credit loops.
|
|
Packit |
13e616 |
As a result, all applications run over such fabrics must perform a
|
|
Packit |
13e616 |
path record query to obtain the correct path SL for connection setup.
|
|
Packit |
13e616 |
Applications that use \fBrdma_cm\fR for connection setup will automatically
|
|
Packit |
13e616 |
meet this requirement.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
If a change in fabric topology causes changes in path SL values required
|
|
Packit |
13e616 |
to route without credit loops, in general all applications would need
|
|
Packit |
13e616 |
to repath to avoid message deadlock. Since torus-2QoS has the ability
|
|
Packit |
13e616 |
to reroute after a single switch failure without changing path SL values,
|
|
Packit |
13e616 |
repathing by running applications is not required when the fabric
|
|
Packit |
13e616 |
is routed with torus-2QoS.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Torus-2QoS can provide unchanging path SL values in the presence of
|
|
Packit |
13e616 |
subnet manager failover provided that all OpenSM instances have the
|
|
Packit |
13e616 |
same idea of dateline location. See torus-2QoS.conf(5) for details.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
Torus-2QoS will detect configurations of failed switches and links
|
|
Packit |
13e616 |
that prevent routing that is free of credit loops, and will
|
|
Packit |
13e616 |
log warnings and refuse to route. If "no_fallback" was configured in the
|
|
Packit |
13e616 |
list of OpenSM routing engines, then no other routing engine
|
|
Packit |
13e616 |
will attempt to route the fabric. In that case all paths that
|
|
Packit |
13e616 |
do not transit the failed components will continue to work, and
|
|
Packit |
13e616 |
the subset of paths that are still operational will continue to remain
|
|
Packit |
13e616 |
free of credit loops.
|
|
Packit |
13e616 |
OpenSM will continue to attempt to route the fabric after every sweep
|
|
Packit |
13e616 |
interval, and after any change (such as a link up) in the fabric topology.
|
|
Packit |
13e616 |
When the fabric components are repaired, full functionality will be
|
|
Packit |
13e616 |
restored.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
In the event OpenSM was configured to allow some other engine to
|
|
Packit |
13e616 |
route the fabric if torus-2QoS fails, then credit loops and message
|
|
Packit |
13e616 |
deadlock are likely if torus-2QoS had previously routed
|
|
Packit |
13e616 |
the fabric successfully.
|
|
Packit |
13e616 |
Even if the other engine is capable of routing a torus
|
|
Packit |
13e616 |
without credit loops, applications that built connections with
|
|
Packit |
13e616 |
path SL values granted under torus-2QoS will likely experience
|
|
Packit |
13e616 |
message deadlock under routing generated by a different engine,
|
|
Packit |
13e616 |
unless they repath.
|
|
Packit |
13e616 |
.P
|
|
Packit |
13e616 |
To verify that a torus fabric is routed free of credit loops,
|
|
Packit |
13e616 |
use \fBibdmchk\fR to analyze data collected via \fBibdiagnet -vlr\fR.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH FILES
|
|
Packit |
13e616 |
.TP
|
|
Packit |
13e616 |
.B @OPENSM_CONFIG_DIR@/@OPENSM_CONFIG_FILE@
|
|
Packit |
13e616 |
default OpenSM config file.
|
|
Packit |
13e616 |
.TP
|
|
Packit |
13e616 |
.B @OPENSM_CONFIG_DIR@/@QOS_POLICY_FILE@
|
|
Packit |
13e616 |
default QoS policy config file.
|
|
Packit |
13e616 |
.TP
|
|
Packit |
13e616 |
.B @OPENSM_CONFIG_DIR@/@TORUS2QOS_CONF_FILE@
|
|
Packit |
13e616 |
default torus-2QoS config file.
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
.SH SEE ALSO
|
|
Packit |
13e616 |
.
|
|
Packit |
13e616 |
opensm(8), torus-2QoS.conf(5), ibdiagnet(1), ibdmchk(1), rdma_cm(7).
|