|
Packit |
13e616 |
OpenSM Release Notes 3.0.13
|
|
Packit |
13e616 |
=============================
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Version: OpenFabrics Enterprise Distribution (OFED) 1.2
|
|
Packit |
13e616 |
Repo: git://git.openfabrics.org/~ofed_1_2/management.git (release)
|
|
Packit |
13e616 |
git://git.openfabrics.org/~halr/management.git (development)
|
|
Packit |
13e616 |
Date: June 2007
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1 Overview
|
|
Packit |
13e616 |
----------
|
|
Packit |
13e616 |
This document describes the contents of the OpenSM OFED 1.2 release.
|
|
Packit |
13e616 |
OpenSM is an InfiniBand compliant Subnet Manager and Administration,
|
|
Packit |
13e616 |
and runs on top of OpenIB. The OpenSM version for this release
|
|
Packit |
13e616 |
is openib-3.0.13
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
This document includes the following sections:
|
|
Packit |
13e616 |
1 This Overview section (describing new features and software
|
|
Packit |
13e616 |
dependencies)
|
|
Packit |
13e616 |
2 Known Issues And Limitations
|
|
Packit |
13e616 |
3 Unsupported IB compliance statements
|
|
Packit |
13e616 |
4 Major Bug Fixes
|
|
Packit |
13e616 |
5 Main Verification Flows
|
|
Packit |
13e616 |
6 Qualified software stacks and devices
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1.1 Major New Features
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Routing improvements
|
|
Packit |
13e616 |
Two additional routing algorithms have been added in addition to
|
|
Packit |
13e616 |
performance improvements to the existing routing algorithms. The
|
|
Packit |
13e616 |
two new routing algorithms are FAT tree and LASH. See the
|
|
Packit |
13e616 |
opensm man page for additional details.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* SA Optional Record support now "virtually" complete
|
|
Packit |
13e616 |
Includes SA InformInfo improvements and InformInfoRecord support in
|
|
Packit |
13e616 |
addition to support for the remaining SA optional records
|
|
Packit |
13e616 |
(MulticastForwardingTableRecord, SwitchInfoRecord). Also, SMInfoRecord
|
|
Packit |
13e616 |
support was improved to include all SMs found.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* SA database dump/restore
|
|
Packit |
13e616 |
OpenSM now includes the ability to dump and restore the SA database.
|
|
Packit |
13e616 |
This allows for all SA registrations (multicast, services, and events)
|
|
Packit |
13e616 |
to be saved and restored later.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In verbose mode, OpenSM will dump SA DB (existing multicast groups,
|
|
Packit |
13e616 |
services and InformInfo) into dump file which named "opensm-sa.dump"
|
|
Packit |
13e616 |
and located under standard OpenSM dump directory (/var/log by default).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
If option -S is specified and SA DB dump file name is provided, OpenSM
|
|
Packit |
13e616 |
will try to restore SA database from this file. And if restore is
|
|
Packit |
13e616 |
successful, OpenSM won't ask for client reregistration at subnet bring-up.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Modular routing for multicast
|
|
Packit |
13e616 |
In conjunction was SA database dump/restore, there is the ability to
|
|
Packit |
13e616 |
dump and load switch lid matrices (min hops tables) which are used
|
|
Packit |
13e616 |
for multicast route calculation.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* IB router enablement
|
|
Packit |
13e616 |
OpenSM now supports router ports properly (in terms of PortInfo handling).
|
|
Packit |
13e616 |
There is also some experimental support for IB routers which is enabled
|
|
Packit |
13e616 |
via the ROUTER_EXP compile flag. This support includes SA PathRecord and
|
|
Packit |
13e616 |
MCMemberRecord support for off subnet GIDs.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Socket support added to console
|
|
Packit |
13e616 |
OpenSM console now supports remote in addition to local access.
|
|
Packit |
13e616 |
Remote access is currently via telnet.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1.2 Minor New Features:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Change output format of DR path from hex to decimal port numbers
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Log rotation
|
|
Packit |
13e616 |
The OpenSM log can now be rotated while OpenSM is running (without
|
|
Packit |
13e616 |
stopping and restarting OpenSM). This is accomplished via SIGUSR1.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Support scope for IPoIB multicast groups in partition config
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Dump filename changed from subnet.lst to osm-subnet.lst
|
|
Packit |
13e616 |
Default temp directory for non Windows platforms was previously changed
|
|
Packit |
13e616 |
from /tmp to /var/log.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Add option for force SDR link speed
|
|
Packit |
13e616 |
Add option to opensm.opts to force link speed. Currently, only forcing
|
|
Packit |
13e616 |
to SDR link speed is supported. This option is not supported as a
|
|
Packit |
13e616 |
command line option.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1.3 Library API Changes
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
None
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1.4 Software Dependencies
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM depends on the installation of either OFED 1.2, OFED 1.1,
|
|
Packit |
13e616 |
OFED 1.0, OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD
|
|
Packit |
13e616 |
distribution), or Mellanox VAPI stacks. The qualified driver versions
|
|
Packit |
13e616 |
are provided in Table 2, "Qualified IB Stacks".
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
1.5 Supported Devices Firmware
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The main task of OpenSM is to initialize InfiniBand devices. The
|
|
Packit |
13e616 |
qualified devices and their corresponding firmware versions
|
|
Packit |
13e616 |
are listed in Table 3.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
2 Known Issues And Limitations
|
|
Packit |
13e616 |
------------------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* No Service / Key associations:
|
|
Packit |
13e616 |
There is no way to manage Service access by Keys.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* No SM to SM SMDB synchronization:
|
|
Packit |
13e616 |
Puts the burden of re-registering services, multicast groups, and
|
|
Packit |
13e616 |
inform-info on the client application (or IB access layer core).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* No "port down" event handling:
|
|
Packit |
13e616 |
Changing the switch port through which OpenSM connects to the IB
|
|
Packit |
13e616 |
fabric may cause incorrect operation. Please restart OpenSM whenever
|
|
Packit |
13e616 |
such a connectivity change is made.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Changing connections during SM operation:
|
|
Packit |
13e616 |
Under some conditions the SM can get confused by a change in
|
|
Packit |
13e616 |
cabling (moving a cable from one switch port to the other) and
|
|
Packit |
13e616 |
momentarily see this as having the same GUID appear connected
|
|
Packit |
13e616 |
to two different IB ports. Under some conditions, when the SM fails to
|
|
Packit |
13e616 |
get the corresponding change event it might mistakenly report this case
|
|
Packit |
13e616 |
as a "duplicated GUID" case and abort. It is advisable to double-check
|
|
Packit |
13e616 |
the syslog after each such change in connectivity and restart
|
|
Packit |
13e616 |
OpenSM if it has exited. The same error ("duplicated GUID") will
|
|
Packit |
13e616 |
also appear with a loopback plug.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
3 Unsupported IB Compliance Statements
|
|
Packit |
13e616 |
--------------------------------------
|
|
Packit |
13e616 |
The following section lists all the IB compliance statements which
|
|
Packit |
13e616 |
OpenSM does not support. Please refer to the IB specification for detailed
|
|
Packit |
13e616 |
information regarding each compliance statement.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-22 (Authentication):
|
|
Packit |
13e616 |
M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
|
|
Packit |
13e616 |
SubnSet method. As a work-around, an OpenSM option is provided for
|
|
Packit |
13e616 |
defining the protect bits.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-67 (Authentication):
|
|
Packit |
13e616 |
On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
|
|
Packit |
13e616 |
the SM shall generate a SubnGetResp if the M_Key matches, or
|
|
Packit |
13e616 |
silently drop the packet if M_Key does not match.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C15-0.1.23.4 (Authentication):
|
|
Packit |
13e616 |
InformInfoRecords shall always be provided with the QPN set to 0,
|
|
Packit |
13e616 |
except for the case of a trusted request, in which case the actual
|
|
Packit |
13e616 |
subscriber QPN shall be returned.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* o13-17.1.2 (Event-FWD):
|
|
Packit |
13e616 |
If no permission to forward, the subscription should be removed and
|
|
Packit |
13e616 |
no further forwarding should occur.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
|
|
Packit |
13e616 |
GUIDInfo - SM should enable assigning Port GUIDInfo.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-44 (Initialization):
|
|
Packit |
13e616 |
If the SM discovers that it is missing an M_Key to update CA/RT/SW,
|
|
Packit |
13e616 |
it should notify the higher level.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-62.1.1.12 (Initialization):
|
|
Packit |
13e616 |
PortInfo:M_Key - Set the M_Key to a node based random value.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-62.1.1.13 (Initialization):
|
|
Packit |
13e616 |
PortInfo:P_KeyProtectBits - set according to an optional policy.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-62.1.1.24 (Initialization):
|
|
Packit |
13e616 |
SwitchInfo:DefaultPort - should be configured for random FDB.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C14-62.1.1.32 (Initialization):
|
|
Packit |
13e616 |
RandomForwardingTable should be configured.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* o15-0.1.12 (Multicast):
|
|
Packit |
13e616 |
If the JoinState is SendOnlyNonMember = 1 (only), then the endport
|
|
Packit |
13e616 |
should join as sender only.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* o15-0.1.8 (Multicast):
|
|
Packit |
13e616 |
If a request for creating an MCG with fields that cannot be met,
|
|
Packit |
13e616 |
return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C15-0.1.8.6 (SA-Query):
|
|
Packit |
13e616 |
Respond to SubnAdmGetTraceTable - this is an optional attribute.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C15-0.1.13 Services:
|
|
Packit |
13e616 |
Reject ServiceRecord create, modify or delete if the given
|
|
Packit |
13e616 |
ServiceP_Key does not match the one included in the ServiceGID port
|
|
Packit |
13e616 |
and the port that sent the request.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* C15-0.1.14 (Services):
|
|
Packit |
13e616 |
Provide means to associate service name and ServiceKeys.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
4 Major Bug Fixes
|
|
Packit |
13e616 |
-----------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The following is a list of bugs that were fixed. Note that other less critical
|
|
Packit |
13e616 |
or visible bugs were also fixed.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sminfo_rcv.c: Add SMInfo self query check. OpenSM can query
|
|
Packit |
13e616 |
itself for SMInfo occassionally due to port moving during subnet
|
|
Packit |
13e616 |
discovery process. Don't create remote SM entry in this case to
|
|
Packit |
13e616 |
prevent deadlocks.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_ucast_updn.c: Two similar bugs in up/down routing fixed.
|
|
Packit |
13e616 |
8-bit integers were used as indexes when scanning subnet, which
|
|
Packit |
13e616 |
in one case caused OpenSM to crash when ranking "path" is longer
|
|
Packit |
13e616 |
than 256 switches, and in the other case, caused OpenSM to go into
|
|
Packit |
13e616 |
an infinite loop when fabric has more than 256 roots.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sm_state_mgr.c: In __osm_sm_state_mgr_send_master_sm_info_req,
|
|
Packit |
13e616 |
handle master GUID port not found properly
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_multipath_record.c: In __osm_mpr_rcv_get_path_parms, return
|
|
Packit |
13e616 |
IB_NOT_FOUND rather than IB_ERROR when can't route to LID from switch
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_path_record.c: In __osm_pr_rcv_get_path_parms, return IB_NOT_FOUND
|
|
Packit |
13e616 |
rather than IB_ERROR when can't route to LID from switch
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_vendor_ibumad.c: In osm_vendor_set_sm, set issmfd to
|
|
Packit |
13e616 |
-1 on open error
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_vendor_ibumad: Termination crash fix
|
|
Packit |
13e616 |
When OpenSM is terminated umad_receiver thread still running even after
|
|
Packit |
13e616 |
the structures are destroyed and freed, this causes to random (but easily
|
|
Packit |
13e616 |
reproducible) crashes. The reason is that osm_vendor_delete() does not
|
|
Packit |
13e616 |
care about thread termination. This patch adds the receiver thread
|
|
Packit |
13e616 |
cancellation (by using pthread_cancel() and pthread_join()) and cares to
|
|
Packit |
13e616 |
keep have all mutexes unlocked upon termination. There is also minor
|
|
Packit |
13e616 |
termination code consolidation - osm_vendor_port_close() function.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_port_profile.h: Fix reinsertion issue in osm_port_prof_set_ignored_port
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_matrix.h: Fix segfault with up/down and root nodes file
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_path_record.c: In osm_pr_rcv_process, fix endian of hop_limit
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_vendor_ibumad.c: Close umad port in osm_vendor_delete
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_(multipath path)_record.c: Fix MultiPathRecord/PathRecord issues
|
|
Packit |
13e616 |
with using MTU/rate/PktLife explicitly ignoring selectors
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM just uses the resulting path MTU/rate/pkt-life and fail the
|
|
Packit |
13e616 |
query even though the selector might be allowing for selecting an
|
|
Packit |
13e616 |
appropriate value.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
After this fix, the following results are obtained for a case of
|
|
Packit |
13e616 |
path allowing maximal 2K MTU.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
In standard mode:
|
|
Packit |
13e616 |
------------------------------------------------------------
|
|
Packit |
13e616 |
MTU greater than ... 256 (0x01) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU less than ...... 256 (0x41) -> NO PATHS
|
|
Packit |
13e616 |
MTU equal to ....... 256 (0x81) -> equal to ....... 256
|
|
Packit |
13e616 |
MTU largest possible 256 (0xc1) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 512 (0x02) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU less than ...... 512 (0x42) -> equal to ....... 256
|
|
Packit |
13e616 |
MTU equal to ....... 512 (0x82) -> equal to ....... 512
|
|
Packit |
13e616 |
MTU largest possible 512 (0xc2) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 1K (0x03) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU less than ...... 1K (0x43) -> equal to ....... 512
|
|
Packit |
13e616 |
MTU equal to ....... 1K (0x83) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU largest possible 1K (0xc3) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 2K (0x04) -> NO PATHS
|
|
Packit |
13e616 |
MTU less than ...... 2K (0x44) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU equal to ....... 2K (0x84) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU largest possible 2K (0xc4) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 4K (0x05) -> NO PATHS
|
|
Packit |
13e616 |
MTU less than ...... 4K (0x45) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU equal to ....... 4K (0x85) -> NO PATHS
|
|
Packit |
13e616 |
MTU largest possible 4K (0xc5) -> equal to ....... 2K
|
|
Packit |
13e616 |
============================================================
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
With enable_quirks (when one of the ends is a Tavor device):
|
|
Packit |
13e616 |
------------------------------------------------------------
|
|
Packit |
13e616 |
MTU greater than ... 256 (0x01) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU less than ...... 256 (0x41) -> NO PATHS
|
|
Packit |
13e616 |
MTU equal to ....... 256 (0x81) -> equal to ....... 256
|
|
Packit |
13e616 |
MTU largest possible 256 (0xc1) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 512 (0x02) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU less than ...... 512 (0x42) -> equal to ....... 256
|
|
Packit |
13e616 |
MTU equal to ....... 512 (0x82) -> equal to ....... 512
|
|
Packit |
13e616 |
MTU largest possible 512 (0xc2) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 1K (0x03) -> NO PATHS
|
|
Packit |
13e616 |
MTU less than ...... 1K (0x43) -> equal to ....... 512
|
|
Packit |
13e616 |
MTU equal to ....... 1K (0x83) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU largest possible 1K (0xc3) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 2K (0x04) -> NO PATHS
|
|
Packit |
13e616 |
MTU less than ...... 2K (0x44) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU equal to ....... 2K (0x84) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU largest possible 2K (0xc4) -> equal to ....... 2K
|
|
Packit |
13e616 |
MTU greater than ... 4K (0x05) -> NO PATHS
|
|
Packit |
13e616 |
MTU less than ...... 4K (0x45) -> equal to ....... 1K
|
|
Packit |
13e616 |
MTU equal to ....... 4K (0x85) -> NO PATHS
|
|
Packit |
13e616 |
MTU largest possible 4K (0xc5) -> equal to ....... 2K
|
|
Packit |
13e616 |
============================================================
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_pkey_rcv.c: rwlock double release fix
|
|
Packit |
13e616 |
When the port is removed from subnet, but previously requested pkey
|
|
Packit |
13e616 |
table block is received after this - the lock will be released twice.
|
|
Packit |
13e616 |
This leads to deadlocks later when other MAD processor will try to
|
|
Packit |
13e616 |
acquire the same lock.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_informinfo.c: Fix InformInfoRecord searches
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Better SA MCMemberRecord leave locking
|
|
Packit |
13e616 |
Hold locked multicast group leave request (MCMember Record) processing.
|
|
Packit |
13e616 |
This prevents kind of race with multicast group join request where
|
|
Packit |
13e616 |
those requests can be reordered during processing.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_informinfo.c: Conformance changes for subscribe component
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_sa_path_record.c: Handle LID 0 as error
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Fix comparing InformInfo records
|
|
Packit |
13e616 |
1. The received InformInfo struct was modified before dumping it.
|
|
Packit |
13e616 |
2. The function that compares InformInfo structures was just
|
|
Packit |
13e616 |
comparing the whole memory allocated for it, including reserved
|
|
Packit |
13e616 |
fields. Fixed to compare more selectively.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
As for QPN, from the IB spec, table 119 InformInfo:
|
|
Packit |
13e616 |
QPN : Ignored except when subscribe=0 (an unsubscribe
|
|
Packit |
13e616 |
request). Queue pair to which Report()s were sent as
|
|
Packit |
13e616 |
a result of a corresponding subscription. If no
|
|
Packit |
13e616 |
subscription for this Report() with this QPN exists,
|
|
Packit |
13e616 |
the request to unsubscribe performs no action and
|
|
Packit |
13e616 |
produces GetResp() with status indicating an invalid
|
|
Packit |
13e616 |
field value.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_trap_rcv.c: Reduce repeated trap messages so log doesn't fill
|
|
Packit |
13e616 |
so quickly
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osm_helper.c: Fix stack smashing detected problem in osm_dump_service_record
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Fix permission on db files directory
|
|
Packit |
13e616 |
When creating directory for db files (guid2lid) storing create it with
|
|
Packit |
13e616 |
reasonable permissions (current 777 decimal = octal 01411) and don't do
|
|
Packit |
13e616 |
it world writable.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Fix node_desc.description as string usages
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
5 Main Verification Flows
|
|
Packit |
13e616 |
-------------------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
OpenSM verification is run using the following activities:
|
|
Packit |
13e616 |
* osmtest - a stand-alone program
|
|
Packit |
13e616 |
* ibmgtsim (IB management simulator) based - a set of flows that
|
|
Packit |
13e616 |
simulate clusters, inject errors and verify OpenSM capability to
|
|
Packit |
13e616 |
respond and bring up the network correctly.
|
|
Packit |
13e616 |
* small cluster regression testing - where the SM is used on back to
|
|
Packit |
13e616 |
back or single switch configurations. The regression includes
|
|
Packit |
13e616 |
multiple OpenSM dedicated tests.
|
|
Packit |
13e616 |
* cluster testing - when we run OpenSM to setup a large cluster, perform
|
|
Packit |
13e616 |
hand-off, reboots and reconnects, verify routing correctness and SA
|
|
Packit |
13e616 |
responsiveness at the ULP level (IPoIB and SDP).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
5.1 osmtest
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
osmtest is an automated verification tool used for OpenSM
|
|
Packit |
13e616 |
testing. Its verification flows are described by list below.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Inventory File: Obtain and verify all port info, node info, link and path
|
|
Packit |
13e616 |
records parameters.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Service Record:
|
|
Packit |
13e616 |
- Register new service
|
|
Packit |
13e616 |
- Register another service (with a lease period)
|
|
Packit |
13e616 |
- Register another service (with service p_key set to zero)
|
|
Packit |
13e616 |
- Get all services by name
|
|
Packit |
13e616 |
- Delete the first service
|
|
Packit |
13e616 |
- Delete the third service
|
|
Packit |
13e616 |
- Added bad flows of get/delete non valid service
|
|
Packit |
13e616 |
- Add / Get same service with different data
|
|
Packit |
13e616 |
- Add / Get / Delete by different component mask values (services
|
|
Packit |
13e616 |
by Name & Key / Name & Data / Name & Id / Id only )
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Multicast Member Record:
|
|
Packit |
13e616 |
- Query of existing Groups (IPoIB)
|
|
Packit |
13e616 |
- BAD Join with insufficient comp mask (o15.0.1.3)
|
|
Packit |
13e616 |
- Create given MGID=0 (o15.0.1.4)
|
|
Packit |
13e616 |
- Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
|
|
Packit |
13e616 |
- Create BAD MGID=0xFA. (o15.0.1.6)
|
|
Packit |
13e616 |
- Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
|
|
Packit |
13e616 |
- New MGID with invalid join state (o15.0.1.9)
|
|
Packit |
13e616 |
- Retry of existing MGID - See JoinState update (o15.0.1.11)
|
|
Packit |
13e616 |
- BAD RATE when connecting to existing MGID (o15.0.1.13)
|
|
Packit |
13e616 |
- Partial JoinState delete request - removing FullMember (o15.0.1.14)
|
|
Packit |
13e616 |
- Full Delete of a group (o15.0.1.14)
|
|
Packit |
13e616 |
- Verify Delete by trying to Join deleted group (o15.0.1.14)
|
|
Packit |
13e616 |
- BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* GUIDInfo Record:
|
|
Packit |
13e616 |
- All GUIDInfoRecords in subnet are obtained
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* MultiPathRecord:
|
|
Packit |
13e616 |
- Perform some compliant and noncompliant MultiPathRecord requests
|
|
Packit |
13e616 |
- Validation is via status in responses and IB analyzer
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* PKeyTableRecord:
|
|
Packit |
13e616 |
- Perform some compliant and noncompliant PKeyTableRecord queries
|
|
Packit |
13e616 |
- Validation is via status in responses and IB analyzer
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* LinearForwardingTableRecord:
|
|
Packit |
13e616 |
- Perform some compliant and noncompliant LinearForwardingTableRecord queries
|
|
Packit |
13e616 |
- Validation is via status in responses and IB analyzer
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Event Forwarding: Register for trap forwarding using reports
|
|
Packit |
13e616 |
- Send a trap and wait for report
|
|
Packit |
13e616 |
- Unregister non-existing
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
|
|
Packit |
13e616 |
disconnecting/connecting ports) and wait for report, then unregister.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Stress Test: send PortInfoRecord queries, both single and RMPP and
|
|
Packit |
13e616 |
check for the rate of responses as well as their validity.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
5.2 IB Management Simulator OpenSM Test Flows:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The simulator provides ability to simulate the SM handling of virtual
|
|
Packit |
13e616 |
topologies that are not limited to actual lab equipment availability.
|
|
Packit |
13e616 |
OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
|
|
Packit |
13e616 |
regressions use smaller (16 and 128 nodes clusters).
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
The following test flows are run on the IB management simulator:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Stability:
|
|
Packit |
13e616 |
Up to 12 links from the fabric are randomly selected to drop packets
|
|
Packit |
13e616 |
at drop rates up to 90%. The SM is required to succeed in bringing the
|
|
Packit |
13e616 |
fabric up. The resulting routing is verified to be correct as well.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* LID Manager:
|
|
Packit |
13e616 |
Using LMC = 2 the fabric is initialized with LIDs. Faults such as
|
|
Packit |
13e616 |
zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
|
|
Packit |
13e616 |
randomly assigned to various nodes and other errors are randomly
|
|
Packit |
13e616 |
output to the guid2lid cache file. The SM sweep is run 5 times and
|
|
Packit |
13e616 |
after each iteration a complete verification is made to ensure that all
|
|
Packit |
13e616 |
LIDs that could possibly be maintained are kept, as well as that all nodes
|
|
Packit |
13e616 |
were assigned a legal LID range.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Multicast Routing:
|
|
Packit |
13e616 |
Nodes randomly join the 0xc000 group and eventually the
|
|
Packit |
13e616 |
resulting routing is verified for completeness and adherence to
|
|
Packit |
13e616 |
Up/Down routing rules.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osmtest:
|
|
Packit |
13e616 |
The complete osmtest flow as described in the previous table is run on
|
|
Packit |
13e616 |
the simulated fabrics.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Stress Test:
|
|
Packit |
13e616 |
This flow merges fabric, LID and stability issues with continuous
|
|
Packit |
13e616 |
PathRecord, ServiceRecord and Multicast Join/Leave activity to
|
|
Packit |
13e616 |
stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get
|
|
Packit |
13e616 |
were added to the test such both existing and non existing nodes
|
|
Packit |
13e616 |
perform them in random order.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
5.3 OpenSM Regression
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Using a back-to-back or single switch connection, the following set of
|
|
Packit |
13e616 |
tests is run nightly on the stacks described in table 2. The included
|
|
Packit |
13e616 |
tests are:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Stress Testing: Flood the SA with queries from multiple channel
|
|
Packit |
13e616 |
adapters to check the robustness of the entire stack up to the SA.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Dynamic Changes: Dynamic Topology changes, through randomly
|
|
Packit |
13e616 |
dropping SMP packets, used to test OpenSM adaptation to an unstable
|
|
Packit |
13e616 |
network & verify DB correctness.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Trap Injection: This flow injects traps to the SM and verifies that it
|
|
Packit |
13e616 |
handles them gracefully.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* SA Query Test: This test exhaustively checks the SA responses to all
|
|
Packit |
13e616 |
possible single component mask. To do that the test examines the
|
|
Packit |
13e616 |
entire set of records the SA can provide, classifies them by their
|
|
Packit |
13e616 |
field values and then selects every field (using component mask and a
|
|
Packit |
13e616 |
value) and verifies that the response matches the expected set of records.
|
|
Packit |
13e616 |
A random selection using multiple component mask bits is also performed.
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
5.4 Cluster testing:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Cluster testing is usually run before a distribution release. It
|
|
Packit |
13e616 |
involves real hardware setups of 16 to 32 nodes (or more if a beta site
|
|
Packit |
13e616 |
is available). Each test is validated by running all-to-all ping through the IB
|
|
Packit |
13e616 |
interface. The test procedure includes:
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Cluster bringup
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Hand-off between 2 or 3 SM's while performing:
|
|
Packit |
13e616 |
- Node reboots
|
|
Packit |
13e616 |
- Switch power cycles (disconnecting the SM's)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Unresponsive port detection and recovery
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* osmtest from multiple nodes
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
* Trap injection and recovery
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
6 Qualification
|
|
Packit |
13e616 |
----------------
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Table 2 - Qualified IB Stacks
|
|
Packit |
13e616 |
=============================
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Stack | Version
|
|
Packit |
13e616 |
-----------------------------------------|--------------------------
|
|
Packit |
13e616 |
OFED | 1.2
|
|
Packit |
13e616 |
OFED | 1.1
|
|
Packit |
13e616 |
OFED | 1.0
|
|
Packit |
13e616 |
OpenIB Gen2 (IBG2 distribution) | 1.0
|
|
Packit |
13e616 |
OpenIB Gen1 (IBGD distribution) | 1.8.0
|
|
Packit |
13e616 |
VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Table 3 - Qualified Devices and Corresponding Firmware
|
|
Packit |
13e616 |
======================================================
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Mellanox
|
|
Packit |
13e616 |
Device | FW versions
|
|
Packit |
13e616 |
--------|-----------------------------------------------------------
|
|
Packit |
13e616 |
MT43132 | InfiniScale - fw-43132 5.2.0 (and later)
|
|
Packit |
13e616 |
MT47396 | InfiniScale III - fw-47396 0.5.0 (and later)
|
|
Packit |
13e616 |
MT23108 | InfiniHost - fw-23108 3.3.2 (and later)
|
|
Packit |
13e616 |
MT25204 | InfiniHost III Lx - fw-25204 1.0.1i (and later)
|
|
Packit |
13e616 |
MT25208 | InfiniHost III Ex (InfiniHost Mode) - fw-25208 4.6.2 (and later)
|
|
Packit |
13e616 |
MT25208 | InfiniHost III Ex (MemFree Mode) - fw-25218 5.0.1 (and later)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
QLogic/PathScale
|
|
Packit |
13e616 |
Device | Note
|
|
Packit |
13e616 |
--------|-----------------------------------------------------------
|
|
Packit |
13e616 |
iPath | QHT6040 (PathScale InfiniPath HT-460)
|
|
Packit |
13e616 |
iPath | QHT6140 (PathScale InfiniPath HT-465)
|
|
Packit |
13e616 |
iPath | QLE6140 (PathScale InfiniPath PE-880)
|
|
Packit |
13e616 |
|
|
Packit |
13e616 |
Note: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
|
|
Packit |
13e616 |
QP0 and QP1. However, it does support it as a device on the subnet.
|
|
Packit |
13e616 |
|