|
Packit |
13e616 |
Performance Manager
|
|
Packit |
13e616 |
2/12/07
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
This document will describe an architecture and a phased plan
|
|
Packit |
13e616 |
for an OpenFabrics OpenIB performance manager.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Currently, there is no open source performance manager, only
|
|
Packit |
13e616 |
a perfquery diagnostic tool which some have scripted into a
|
|
Packit |
13e616 |
"poor man's" performance manager.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The primary responsibilities of the performance manager are to:
|
|
Packit |
13e616 |
1. Monitor subnet topology
|
|
Packit |
13e616 |
2. Based on subnet topology, monitor performance and error counters.
|
|
Packit |
13e616 |
Also, possibly monitor counters related to congestion.
|
|
Packit |
13e616 |
3. Perform data reduction (various calculations (rates, histograms, etc.))
|
|
Packit |
13e616 |
on counters obtained
|
|
Packit |
13e616 |
4. Log performance data and indicate "interesting" related events
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Performance Manager Components
|
|
Packit |
13e616 |
1. Determine subnet topology
|
|
Packit |
13e616 |
Performance manager can determine the subnet topology by subscribing
|
|
Packit |
13e616 |
for GID in and out of service events. Upon receipt of a GID in service
|
|
Packit |
13e616 |
event, use GID to query SA for corresponding LID by using SubnAdmGet
|
|
Packit |
13e616 |
NodeRecord with PortGUID specified. It would utilize the LID and NumPorts
|
|
Packit |
13e616 |
returned and add this to the monitoring list. Note that the monitoring
|
|
Packit |
13e616 |
list can be extended to be distributed with the manager "balancing" the
|
|
Packit |
13e616 |
assignments of new GIDs to the set of known monitors. For GID out of
|
|
Packit |
13e616 |
service events, the GID is removed from the monitoring list.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
2. Monitoring
|
|
Packit |
13e616 |
Counters to be monitored include performance counters (data octets and
|
|
Packit |
13e616 |
packets both receive and transmit) and error counters. These are all in
|
|
Packit |
13e616 |
the mandatory PortCounters attribute. Future support will include the
|
|
Packit |
13e616 |
optional 64 bit counters, PortExtendedCounters (as this is only known
|
|
Packit |
13e616 |
to be supported on one IB device currently). Also, one congestion
|
|
Packit |
13e616 |
counter (PortXmitWait) will also be monitored (on switch ports) initially.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Polling rather than sampling will be used as the monitoring technique. The
|
|
Packit |
13e616 |
polling rate configurable from 1-65535 seconds (default TBD)
|
|
Packit |
13e616 |
Note that with 32 bit counters, on 4x SDR links, byte counts can max out in
|
|
Packit |
13e616 |
16 seconds and on 4x DDR links in 8 seconds. The polling rate needs to
|
|
Packit |
13e616 |
deal with this as accurate byte and packet rates are desired. Since IB
|
|
Packit |
13e616 |
counters are sticky, the counters need to be reset when they get "close"
|
|
Packit |
13e616 |
to max'ing out. This will result in some inaccuracy. When counters are
|
|
Packit |
13e616 |
reset, the time of the reset will be tracked in the monitor and will be
|
|
Packit |
13e616 |
queryable. Note that when the 64 bit counters are supported more generally,
|
|
Packit |
13e616 |
the polling rate can be reduced.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The performance manager will support parallel queries. The level of
|
|
Packit |
13e616 |
parallelism is configurable with a default of 64 queries outstanding
|
|
Packit |
13e616 |
at one time.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Configuration and dynamic adjustment of any performance manager "knobs"
|
|
Packit |
13e616 |
will be supported.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Also, there will be a console interface to obtain performance data.
|
|
Packit |
13e616 |
It will be able to reset counters, report on specific nodes or
|
|
Packit |
13e616 |
node types of interest (CAs only, switches only, all, ...). The
|
|
Packit |
13e616 |
specifics are TBD.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
3. Data Reduction
|
|
Packit |
13e616 |
For errors, rate rather than raw value will be calculated. Error
|
|
Packit |
13e616 |
event is only indicated when rate exceeds a threshold.
|
|
Packit |
13e616 |
For packet and byte counters, small changes will be aggregated
|
|
Packit |
13e616 |
and only significant changes are updated.
|
|
Packit |
13e616 |
Aggregated histograms (per node, all nodes (this is TBD))) for each
|
|
Packit |
13e616 |
counter will be provided. Actual counters will also be written to files.
|
|
Packit |
13e616 |
NodeGUID will be used to identify node. File formats are TBD. One
|
|
Packit |
13e616 |
format to be supported might be CSV.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
4. Logging
|
|
Packit |
13e616 |
"Interesting" events determined by the performance manager will be
|
|
Packit |
13e616 |
logged as well as the performance data itself. Significant events
|
|
Packit |
13e616 |
will be logged to syslog. There are some interesting scalability
|
|
Packit |
13e616 |
issues relative to logging especially for the distributed model.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Events will be based on rates which are configured as thresholds.
|
|
Packit |
13e616 |
There will be configurable thresholds for the error counters with
|
|
Packit |
13e616 |
reasonable defaults. Correlation of PerfManager and SM events is
|
|
Packit |
13e616 |
interesting but not a mandatory requirement.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Performance Manager Scalability
|
|
Packit |
13e616 |
Clearly as the polling rate goes up, the number of nodes which can be
|
|
Packit |
13e616 |
monitored from a single performance management node decreases. There is
|
|
Packit |
13e616 |
some evidence that a single dedicated management node may not be able to
|
|
Packit |
13e616 |
monitor the largest clusters at a rapid rate.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
There are numerous PerfManager models which can be supported:
|
|
Packit |
13e616 |
1. Integrated as thread(s) with OpenSM (run only when SM is master)
|
|
Packit |
13e616 |
2. Standby SM
|
|
Packit |
13e616 |
3. Standalone PerfManager (not running with master or standby SM)
|
|
Packit |
13e616 |
4. Distributed PerfManager (most scalable approach)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note that these models are in order of implementation complexity and
|
|
Packit |
13e616 |
hence "schedule".
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The simplest model is to run the PerfManager with the master SM. This has
|
|
Packit |
13e616 |
the least scalability but is the simplest model. Note that in this model
|
|
Packit |
13e616 |
the topology can be obtained without the GID in and out of service events
|
|
Packit |
13e616 |
but this is needed for any of the other models to be supported.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The next model is to run the PerfManager with a standby SM. Standbys are not
|
|
Packit |
13e616 |
doing much currently (polling the master) so there is much idle CPU.
|
|
Packit |
13e616 |
The downside of this approach is that if the standby takes over as master,
|
|
Packit |
13e616 |
the PerfManager would need to be moved (or is becomes model 1).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
A totally separate standlone PerfManager would allow for a deployment
|
|
Packit |
13e616 |
model which eliminates the downside of model 2 (standby SM). It could
|
|
Packit |
13e616 |
still be built in a similar manner with model 2 with unneeded functions
|
|
Packit |
13e616 |
(SM and SA) not included. The advantage of this model is that it could
|
|
Packit |
13e616 |
be more readily usable with a vendor specific SM (switch based or otherwise).
|
|
Packit |
13e616 |
Vendor specific SMs usually come with a built-in performance manager and
|
|
Packit |
13e616 |
this assumes that there would be a way to disable that performance manager.
|
|
Packit |
13e616 |
Model 2 can act like model 3 if a disable SM feature is supported in OpenSM
|
|
Packit |
13e616 |
(command line/console). This will take the SM to not active.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The most scalable model is a distributed PerfManager. One approach to
|
|
Packit |
13e616 |
distribution is a hierarchial model where there is a PerfManager at the
|
|
Packit |
13e616 |
top level with a number of PerfMonitors which are responsible for some
|
|
Packit |
13e616 |
portion of the subnet.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The separation of PerfManager from OpenSM brings up the following additional
|
|
Packit |
13e616 |
issues:
|
|
Packit |
13e616 |
1. What communication is needed between OpenSM and the PerfManager ?
|
|
Packit |
13e616 |
2. Integration of interesting events with OpenSM log
|
|
Packit |
13e616 |
(Does performance manager assume OpenSM ? Does it need to work with vendor
|
|
Packit |
13e616 |
SMs ?)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Hierarchial distribution brings up some additional issues:
|
|
Packit |
13e616 |
1. How is the hierarchy determined ?
|
|
Packit |
13e616 |
2. How do the PerfManager and PerfMonitors find each other ?
|
|
Packit |
13e616 |
3. How is the subnet divided amongst the PerfMonitors
|
|
Packit |
13e616 |
4. Communication amongst the PerfManager and the PerfMonitors
|
|
Packit |
13e616 |
(including communication failures)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In terms of inter manager communication, there seem to be several
|
|
Packit |
13e616 |
choices:
|
|
Packit |
13e616 |
1. Use vendor specific MADs (which can be RMPP'd) and build on top of
|
|
Packit |
13e616 |
this
|
|
Packit |
13e616 |
2. Use RC QP communication and build on top of this
|
|
Packit |
13e616 |
3. Use IPoIB which is much more powerful as sockets can then be utilized
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
RC QP communication improves on the lower performance of the vendor
|
|
Packit |
13e616 |
specific MAD approach but is not as powerful as the socket based approach.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The only downside of IPoIB is that it requires multicast to be functioning.
|
|
Packit |
13e616 |
It seems reasonable to require IPoIB across the management nodes. This
|
|
Packit |
13e616 |
can either be a separate IPoIB subnet or a shared one with other endnodes
|
|
Packit |
13e616 |
on the subnet. (If this communication is built on top of sockets, it
|
|
Packit |
13e616 |
can be any IP subnet amongst the manager nodes).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The first implementation phase will address models 1-3. Model 3 is optional
|
|
Packit |
13e616 |
as it is similar to models 1 and 2 and may be not be needed.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Model 4 will be addressed in a subsequent implementation phase (and a future
|
|
Packit |
13e616 |
version of this document). Model 4 can be built on the basis of models 1 and
|
|
Packit |
13e616 |
2 where some SM, not necessarily master, is the PerfManager and the rest are
|
|
Packit |
13e616 |
PerfMonitors.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Performance Manager Partition Membership
|
|
Packit |
13e616 |
Note that as the performance manager needs to talk via GSI to the PMAs
|
|
Packit |
13e616 |
in all the end nodes and GSI utilizes PKey sharing, partition membership
|
|
Packit |
13e616 |
if invoked must account for this.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The most straightforward deployment of the performance manager is
|
|
Packit |
13e616 |
to have it be a member of the full default partition (P_Key 0xFFFF).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Performance Manager Redundancy
|
|
Packit |
13e616 |
TBD (future version of this document)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Congestion Management
|
|
Packit |
13e616 |
TBD (future version of this document)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
QoS Management
|
|
Packit |
13e616 |
TBD (future version of this document)
|