Blame doc/perf-manager-arch.txt

Packit 13e616
Performance Manager
Packit 13e616
2/12/07
Packit 13e616
Packit 13e616
This document will describe an architecture and a phased plan
Packit 13e616
for an OpenFabrics OpenIB performance manager.
Packit 13e616
Packit 13e616
Currently, there is no open source performance manager, only
Packit 13e616
a perfquery diagnostic tool which some have scripted into a
Packit 13e616
"poor man's" performance manager.
Packit 13e616
Packit 13e616
The primary responsibilities of the performance manager are to:
Packit 13e616
1. Monitor subnet topology
Packit 13e616
2. Based on subnet topology, monitor performance and error counters.
Packit 13e616
   Also, possibly monitor counters related to congestion.
Packit 13e616
3. Perform data reduction (various calculations (rates, histograms, etc.))
Packit 13e616
   on counters obtained
Packit 13e616
4. Log performance data and indicate "interesting" related events
Packit 13e616
Packit 13e616
Packit 13e616
Performance Manager Components
Packit 13e616
1. Determine subnet topology
Packit 13e616
   Performance manager can determine the subnet topology by subscribing
Packit 13e616
   for GID in and out of service events. Upon receipt of a GID in service
Packit 13e616
   event, use GID to query SA for corresponding LID by using SubnAdmGet
Packit 13e616
   NodeRecord with PortGUID specified. It would utilize the LID and NumPorts
Packit 13e616
   returned and add this to the monitoring list. Note that the monitoring
Packit 13e616
   list can be extended to be distributed with the manager "balancing" the
Packit 13e616
   assignments of new GIDs to the set of known monitors. For GID out of
Packit 13e616
   service events, the GID is removed from the monitoring list.
Packit 13e616
Packit 13e616
2. Monitoring
Packit 13e616
   Counters to be monitored include performance counters (data octets and
Packit 13e616
   packets both receive and transmit) and error counters. These are all in
Packit 13e616
   the mandatory PortCounters attribute. Future support will include the
Packit 13e616
   optional 64 bit counters, PortExtendedCounters (as this is only known
Packit 13e616
   to be supported on one IB device currently). Also, one congestion
Packit 13e616
   counter (PortXmitWait) will also be monitored (on switch ports) initially.
Packit 13e616
Packit 13e616
   Polling rather than sampling will be used as the monitoring technique. The
Packit 13e616
   polling rate configurable from 1-65535 seconds (default TBD)
Packit 13e616
   Note that with 32 bit counters, on 4x SDR links, byte counts can max out in
Packit 13e616
   16 seconds and on 4x DDR links in 8 seconds. The polling rate needs to
Packit 13e616
   deal with this as accurate byte and packet rates are desired. Since IB
Packit 13e616
   counters are sticky, the counters need to be reset when they get "close"
Packit 13e616
   to max'ing out. This will result in some inaccuracy. When counters are
Packit 13e616
   reset, the time of the reset will be tracked in the monitor and will be
Packit 13e616
   queryable. Note that when the 64 bit counters are supported more generally,
Packit 13e616
   the polling rate can be reduced.
Packit 13e616
Packit 13e616
   The performance manager will support parallel queries. The level of
Packit 13e616
   parallelism is configurable with a default of 64 queries outstanding
Packit 13e616
   at one time.
Packit 13e616
Packit 13e616
   Configuration and dynamic adjustment of any performance manager "knobs"
Packit 13e616
   will be supported.
Packit 13e616
Packit 13e616
   Also, there will be a console interface to obtain performance data.
Packit 13e616
   It will be able to reset counters, report on specific nodes or
Packit 13e616
   node types of interest (CAs only, switches only, all, ...). The
Packit 13e616
   specifics are TBD.
Packit 13e616
Packit 13e616
3. Data Reduction
Packit 13e616
   For errors, rate rather than raw value will be calculated. Error
Packit 13e616
   event is only indicated when rate exceeds a threshold.
Packit 13e616
   For packet and byte counters, small changes will be aggregated
Packit 13e616
   and only significant changes are updated.
Packit 13e616
   Aggregated histograms (per node, all nodes (this is TBD))) for each
Packit 13e616
   counter will be provided. Actual counters will also be written to files.
Packit 13e616
   NodeGUID will be used to identify node. File formats are TBD. One
Packit 13e616
   format to be supported might be CSV.
Packit 13e616
Packit 13e616
4. Logging
Packit 13e616
   "Interesting" events determined by the performance manager will be
Packit 13e616
   logged as well as the performance data itself. Significant events
Packit 13e616
   will be logged to syslog. There are some interesting scalability
Packit 13e616
   issues relative to logging especially for the distributed model.
Packit 13e616
Packit 13e616
   Events will be based on rates which are configured as thresholds.
Packit 13e616
   There will be configurable thresholds for the error counters with
Packit 13e616
   reasonable defaults. Correlation of PerfManager and SM events is
Packit 13e616
   interesting but not a mandatory requirement.
Packit 13e616
Packit 13e616
Packit 13e616
Performance Manager Scalability
Packit 13e616
Clearly as the polling rate goes up, the number of nodes which can be
Packit 13e616
monitored from a single performance management node decreases. There is
Packit 13e616
some evidence that a single dedicated management node may not be able to
Packit 13e616
monitor the largest clusters at a rapid rate.
Packit 13e616
Packit 13e616
There are numerous PerfManager models which can be supported:
Packit 13e616
1. Integrated as thread(s) with OpenSM (run only when SM is master)
Packit 13e616
2. Standby SM
Packit 13e616
3. Standalone PerfManager (not running with master or standby SM)
Packit 13e616
4. Distributed PerfManager (most scalable approach)
Packit 13e616
Packit 13e616
Note that these models are in order of implementation complexity and
Packit 13e616
hence "schedule".
Packit 13e616
Packit 13e616
The simplest model is to run the PerfManager with the master SM. This has
Packit 13e616
the least scalability but is the simplest model. Note that in this model
Packit 13e616
the topology can be obtained without the GID in and out of service events
Packit 13e616
but this is needed for any of the other models to be supported.
Packit 13e616
Packit 13e616
The next model is to run the PerfManager with a standby SM. Standbys are not
Packit 13e616
doing much currently (polling the master) so there is much idle CPU.
Packit 13e616
The downside of this approach is that if the standby takes over as master,
Packit 13e616
the PerfManager would need to be moved (or is becomes model 1).
Packit 13e616
Packit 13e616
A totally separate standlone PerfManager would allow for a deployment
Packit 13e616
model which eliminates the downside of model 2 (standby SM). It could
Packit 13e616
still be built in a similar manner with model 2 with unneeded functions
Packit 13e616
(SM and SA) not included. The advantage of this model is that it could
Packit 13e616
be more readily usable with a vendor specific SM (switch based or otherwise).
Packit 13e616
Vendor specific SMs usually come with a built-in performance manager and
Packit 13e616
this assumes that there would be a way to disable that performance manager.
Packit 13e616
Model 2 can act like model 3 if a disable SM feature is supported in OpenSM
Packit 13e616
(command line/console). This will take the SM to not active.
Packit 13e616
Packit 13e616
The most scalable model is a distributed PerfManager. One approach to
Packit 13e616
distribution is a hierarchial model where there is a PerfManager at the
Packit 13e616
top level with a number of PerfMonitors which are responsible for some
Packit 13e616
portion of the subnet.
Packit 13e616
Packit 13e616
The separation of PerfManager from OpenSM brings up the following additional
Packit 13e616
issues:
Packit 13e616
1. What communication is needed between OpenSM and the PerfManager ?
Packit 13e616
2. Integration of interesting events with OpenSM log
Packit 13e616
(Does performance manager assume OpenSM ? Does it need to work with vendor
Packit 13e616
SMs ?)
Packit 13e616
Packit 13e616
Hierarchial distribution brings up some additional issues:
Packit 13e616
1. How is the hierarchy determined ?
Packit 13e616
2. How do the PerfManager and PerfMonitors find each other ?
Packit 13e616
3. How is the subnet divided amongst the PerfMonitors
Packit 13e616
4. Communication amongst the PerfManager and the PerfMonitors
Packit 13e616
(including communication failures)
Packit 13e616
Packit 13e616
In terms of inter manager communication, there seem to be several
Packit 13e616
choices:
Packit 13e616
1. Use vendor specific MADs (which can be RMPP'd) and build on top of
Packit 13e616
this
Packit 13e616
2. Use RC QP communication and build on top of this
Packit 13e616
3. Use IPoIB which is much more powerful as sockets can then be utilized
Packit 13e616
Packit 13e616
RC QP communication improves on the lower performance of the vendor
Packit 13e616
specific MAD approach but is not as powerful as the socket based approach.
Packit 13e616
Packit 13e616
The only downside of IPoIB is that it requires multicast to be functioning.
Packit 13e616
It seems reasonable to require IPoIB across the management nodes. This
Packit 13e616
can either be a separate IPoIB subnet or a shared one with other endnodes
Packit 13e616
on the subnet. (If this communication is built on top of sockets, it
Packit 13e616
can be any IP subnet amongst the manager nodes).
Packit 13e616
Packit 13e616
The first implementation phase will address models 1-3. Model 3 is optional
Packit 13e616
as it is similar to models 1 and 2 and may be not be needed.
Packit 13e616
Packit 13e616
Model 4 will be addressed in a subsequent implementation phase (and a future
Packit 13e616
version of this document). Model 4 can be built on the basis of models 1 and
Packit 13e616
2 where some SM, not necessarily master, is the PerfManager and the rest are
Packit 13e616
PerfMonitors.
Packit 13e616
Packit 13e616
Packit 13e616
Performance Manager Partition Membership
Packit 13e616
Note that as the performance manager needs to talk via GSI to the PMAs
Packit 13e616
in all the end nodes and GSI utilizes PKey sharing, partition membership
Packit 13e616
if invoked must account for this.
Packit 13e616
Packit 13e616
The most straightforward deployment of the performance manager is
Packit 13e616
to have it be a member of the full default partition (P_Key 0xFFFF).
Packit 13e616
Packit 13e616
Packit 13e616
Performance Manager Redundancy
Packit 13e616
TBD (future version of this document)
Packit 13e616
Packit 13e616
Packit 13e616
Congestion Management
Packit 13e616
TBD (future version of this document)
Packit 13e616
Packit 13e616
Packit 13e616
QoS Management
Packit 13e616
TBD (future version of this document)