OpenSM Release Notes 2.0.5
============================
Version: OpenFabrics Enterprise Distribution (OFED) 1.1
Repo: https://openib.org/svn/gen2/branches/1.1/src/userspace/management/osm
Version: 9535 (openib-2.0.5)
Date: October 2006
1 Overview
----------
This document describes the contents of the OpenSM OFED 1.1 release.
OpenSM is an InfiniBand compliant Subnet Manager and Administration,
and runs on top of OpenIB. The OpenSM version for this release
is openib-2.0.5
This document includes the following sections:
1 This Overview section (describing new features and software
dependencies)
2 Known Issues And Limitations
3 Unsupported IB compliance statements
4 Major Bug Fixes
5 Main Verification Flows
6 Qualified software stacks and devices
1.1 Major New Features
* Partition manager:
The partition manager provides a means to setup multiple partitions
by providing a partition policy file. For details please read the
doc/partition-config.txt or the opensm man page.
* Basic QoS Manager:
Provides a uniform configuration of the entire fabric with values defined
in the OpenSM options file. The options support different settings for
CAs, Switches, and Routers. Note that this is disabled by default and
using -Q enables QoS fabric setup.
* Loading pre-routes from a file:
A new routing module enables loading pre-routes from a file.
To use this option you should use the command line options:
"-R file --U <your routing file>" or
"--routing_engine file --ucast_file <your routing file>"
For more information refer to the file doc/modular-routing.txt
or the opensm man page.
* SA MultiPathRecord support:
The SA can now handle requests for multiple PathRecords in one query.
This includes methods SA GetMulti/GetMultiResp and dual sided RMPP.
* PPC64 is now QAed and supported
* Support LMC > 0 for Switch Enhanced Port 0:
Allows enhanced switch port 0 (ESP0) to have a non zero
LMC. Use the configured subnet wide LMC for this. Modifications were
necessary to the LID assignment and routing to support this.
Also, added an option to the configuration to use LMC configured for
subnet for enhanced switch port 0 or set it to 0 even if a non zero
LMC is configured for the subnet. The default is currently the
latter option. The new configuration option is: lmc_esp0
1.2 Minor New Features:
* IPoIB broadcast group configuration:
It is now possible to control the IPoIB broadcast group parameters
(MTU, rate, SL) through the partitions configuration file.
* Limiting OpenSM log file size:
By providing the command line option: "-L <size in MB>" or
"--log_limit <size in MB>" the user can limit the generated log
file size. When specified, the log file will be truncated upon reaching
this limit.
* Favor 1K MTU for Tavor (MT23108) HCA
In cases where a PathRecord or MultiPathRecord is queried and the
requestor does not specify the MTU or does specify it in a way
that allows for MTU to be 1K and one of the path ends in a Tavor,
limit the MTU to 1K max.
* Man pages:
Added opensm.8 and osmtest.8
* Leaf VL stall count control:
A new parameter (leaf_vl_stall_count) for controlling the number of
sequential packets dropped on a switch port driving a HCA/TCA/Router
that cause the port to enter the VLStalled state was added to the
options file.
* SM Polling/Handover defaults changed
The default SMInfo polling retries was decreased from 18 to 4
which reduces the default handover time from 3 min to 40 seconds.
1.3 Library API Changes
* cl_mem* APIs deprecated in complib:
These functions are now considered as deprecated and should be
replaced by direct calls to malloc, free, memset, etc.
* osm_log_init_v2 API added in libopensm:
Supports providing the new option for log file truncation.
1.4 Software Dependencies
OpenSM depends on the installation of either OFED 1.1, OFED 1.0,
OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD
distribution), or Mellanox VAPI stacks. The qualified driver versions
are provided in Table 2, "Qualified IB Stacks".
1.5 Supported Devices Firmware
The main task of OpenSM is to initialize InfiniBand devices. The
qualified devices and their corresponding firmware versions
are listed in Table 3.
2 Known Issues And Limitations
------------------------------
* No Service / Key associations:
There is no way to manage Service access by Keys.
* No SM to SM SMDB synchronization:
Puts the burden of re-registering services, multicast groups, and
inform-info on the client application (or IB access layer core).
* No "port down" event handling:
Changing the switch port through which OpenSM connects to the IB
fabric may cause incorrect operation. Please restart OpenSM whenever
such a connectivity change is made.
* Changing connections during SM operation:
Under some conditions the SM can get confused by a change in
cabling (moving a cable from one switch port to the other) and
momentarily see this as having the same GUID appear connected
to two different IB ports. Under some conditions, when the SM fails to
get the corresponding change event it might mistakenly report this case
as a "duplicated GUID" case and abort. It is advisable to double-check
the syslog after each such change in connectivity and restart
OpenSM if it has exited.
3 Unsupported IB Compliance Statements
--------------------------------------
The following section lists all the IB compliance statements which
OpenSM does not support. Please refer to the IB specification for detailed
information regarding each compliance statement.
* C14-22 (Authentication):
M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
SubnSet method. As a work-around, an OpenSM option is provided for
defining the protect bits.
* C14-67 (Authentication):
On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
the SM shall generate a SubnGetResp if the M_Key matches, or
silently drop the packet if M_Key does not match.
* C15-0.1.23.4 (Authentication):
InformInfoRecords shall always be provided with the QPN set to 0,
except for the case of a trusted request, in which case the actual
subscriber QPN shall be returned.
* o13-17.1.2 (Event-FWD):
If no permission to forward, the subscription should be removed and
no further forwarding should occur.
* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
GUIDInfo - SM should enable assigning Port GUIDInfo.
* C14-44 (Initialization):
If the SM discovers that it is missing an M_Key to update CA/RT/SW,
it should notify the higher level.
* C14-62.1.1.12 (Initialization):
PortInfo:M_Key - Set the M_Key to a node based random value.
* C14-62.1.1.13 (Initialization):
PortInfo:P_KeyProtectBits - set according to an optional policy.
* C14-62.1.1.24 (Initialization):
SwitchInfo:DefaultPort - should be configured for random FDB.
* C14-62.1.1.32 (Initialization):
RandomForwardingTable should be configured.
* o15-0.1.12 (Multicast):
If the JoinState is SendOnlyNonMember = 1 (only), then the endport
should join as sender only.
* o15-0.1.8 (Multicast):
If a request for creating an MCG with fields that cannot be met,
return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass).
* C15-0.1.8.6 (SA-Query):
Respond to SubnAdmGetTraceTable - this is an optional attribute.
* C15-0.1.13 Services:
Reject ServiceRecord create, modify or delete if the given
ServiceP_Key does not match the one included in the ServiceGID port
and the port that sent the request.
* C15-0.1.14 (Services):
Provide means to associate service name and ServiceKeys.
4 Major Bug Fixes
-----------------
The following is a list of bugs that were fixed. Note that other less critical
or visible bugs were also fixed.
* "Broken" fabric (duplicated port GUIDs) handling improved
Replace assert with a real check to handle invalid physical port
in osm_node_info_rcv.c which could occur on a broken fabric
* SA client synchronous request failed but status returned was IB_SUCCESS
even if there was no response.
There was a missing setting of the status in the synchronous case.
* Memory leak fixes:
1. In libvendor/osm_vendor_ibumad.c:osm_vendor_get_all_port_attr
2. In libvendor/osm_vendor_ibumad_sa.c:__osmv_sa_mad_rcv_cb
3. On receiving SMInfo SA request from a node that does not share a
partition, the response mad was allocated but never free'd
as it was never sent.
* Set(InformInfo) OpenSM Deadlock:
When receiving a request with unknown LID
* PathRecord to inconsistent multicast destination:
Fix the return error when multicast destination is not consistently
indicated.
* Remove double calculation of reversible path
In osm_sa_path_record.c:__osm_pr_rcv_get_lid_pair_path a PathRecord
query used to double check if the path is reversible
* Some PathRecord log messages use "net order":
Fix GUID net to host conversion in some osm_log messages
* DR/LID routed SMPs direction bit handling:
osm_resp.c:osm_resp_make_resp_smp, set direction bit only if direct
routed class. This bug caused two issues:
1. Get/Set responses always had direction bit set.
2. Trap represses never had direction bit set.
The direction bit needs setting in direct routed responses and it
doesn't exist in LID routed responses.
osm_sm_mad_ctrl.c: did not detect the "direction bit" correctly.
* OpenSM crash due to transaction lookup (interop with Cisco stack)
When a wire TID that maps to internal TID of zero (after applying
mask) was received the lookup of the transaction was successful.
The stale transaction pointed to "free'd" memory.
* Better handling for Path/MultiPath requests for raw traffic
* Wrong ProducerType provided in Notice Reports:
When formating an SM generated report, the ProducerType was using
CL_NTOH32 which can not be used to format a 24bit network order number.
* OpenSM break on PPC64
complib: Fixed memory corruption in cl_pool.c:cl_qcpool_init. This
affected big endian 64-bit architectures only.
* Illegal Set(InformInfo) was wrongly successful in updating the SMDB
osm_sa_informinfo.c: In osm_infr_rcv_process_set_method, if sending
error, don't call osm_infr_rcv_process_set_method
* RMPP queries of InformInfoRecord fail
ib_types.h: Pad ib_inform_info_record_t to be modulo 8 in size so
that attribute offset is calculated properly
* Returning "invalid request" rather than "unsupported method/attribute"
In these cases, a noncompliant response was being provided.
* Noncompliant response for SubnAdmGet(PortInfoRecord) with no match
osm_pir_rcv_process, now returns "SA no records error" for SubnAdmGet
with 0 records found
* Noncompliant non base LID returned by some queries:
The following attributes used to return the request LID rather than
its base LID in responses: PKeyTableRecord, GUIDInfoRecord,
SLtoVLMappingTableRecord, VLArbitrationTableRecord, LinkRecord
* Noncompliant SubnAdmGet and SubnAdmGetTable:
Mixing of error codes in case of no records or multiple records
fixed for the attributes:
LinearForwardingTableRecord, GUIDInfoRecord,
VLArbitrationTableRecord, LinkRecord, PathRecord
* segfault in InformInfo flows
Under stress concurrent Set/Delete/Get flows. Fixed by adding
missing lock.
* SA queries containing LID out if range did not return ERR_REQ_INVALID
5 Main Verification Flows
-------------------------
OpenSM verification is run using the following activities:
* osmtest - a stand-alone program
* ibmgtsim (IB management simulator) based - a set of flows that
simulate clusters, inject errors and verify OpenSM capability to
respond and bring up the network correctly.
* small cluster regression testing - where the SM is used on back to
back or single switch configurations. The regression includes
multiple OpenSM dedicated tests.
* cluster testing - when we run OpenSM to setup a large cluster, perform
hand-off, reboots and reconnects, verify routing correctness and SA
responsiveness at the ULP level (IPoIB and SDP).
5.1 osmtest
osmtest is an automated verification tool used for OpenSM
testing. Its verification flows are described by list below.
* Inventory File: Obtain and verify all port info, node info, link and path
records parameters.
* Service Record:
- Register new service
- Register another service (with a lease period)
- Register another service (with service p_key set to zero)
- Get all services by name
- Delete the first service
- Delete the third service
- Added bad flows of get/delete non valid service
- Add / Get same service with different data
- Add / Get / Delete by different component mask values (services
by Name & Key / Name & Data / Name & Id / Id only )
* Multicast Member Record:
- Query of existing Groups (IPoIB)
- BAD Join with insufficient comp mask (o15.0.1.3)
- Create given MGID=0 (o15.0.1.4)
- Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
- Create BAD MGID=0xFA. (o15.0.1.6)
- Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
- New MGID with invalid join state (o15.0.1.9)
- Retry of existing MGID - See JoinState update (o15.0.1.11)
- BAD RATE when connecting to existing MGID (o15.0.1.13)
- Partial JoinState delete request - removing FullMember (o15.0.1.14)
- Full Delete of a group (o15.0.1.14)
- Verify Delete by trying to Join deleted group (o15.0.1.14)
- BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
* GUIDInfo Record:
- All GUIDInfoRecords in subnet are obtained
* MultiPathRecord:
- Perform some compliant and noncompliant MultiPathRecord requests
- Validation is via status in responses and IB analyzer
* PKeyTableRecord:
- Perform some compliant and noncompliant PKeyTableRecord queries
- Validation is via status in responses and IB analyzer
* LinearForwardingTableRecord:
- Perform some compliant and noncompliant LinearForwardingTableRecord queries
- Validation is via status in responses and IB analyzer
* Event Forwarding: Register for trap forwarding using reports
- Send a trap and wait for report
- Unregister non-existing
* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
disconnecting/connecting ports) and wait for report, then unregister.
* Stress Test: send PortInfoRecord queries, both single and RMPP and
check for the rate of responses as well as their validity.
5.2 IB Management Simulator OpenSM Test Flows:
The simulator provides ability to simulate the SM handling of virtual
topologies that are not limited to actual lab equipment availability.
OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
regressions use smaller (16 and 128 nodes clusters).
The following test flows are run on the IB management simulator:
* Stability:
Up to 12 links from the fabric are randomly selected to drop packets
at drop rates up to 90%. The SM is required to succeed in bringing the
fabric up. The resulting routing is verified to be correct as well.
* LID Manager:
Using LMC = 2 the fabric is initialized with LIDs. Faults such as
zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
randomly assigned to various nodes and other errors are randomly
output to the guid2lid cache file. The SM sweep is run 5 times and
after each iteration a complete verification is made to ensure that all
LIDs that could possibly be maintained are kept, as well as that all nodes
were assigned a legal LID range.
* Multicast Routing:
Nodes randomly join the 0xc000 group and eventually the
resulting routing is verified for completeness and adherence to
Up/Down routing rules.
* osmtest:
The complete osmtest flow as described in the previous table is run on
the simulated fabrics.
* Stress Test:
This flow merges fabric, LID and stability issues with continuous
PathRecord, ServiceRecord and Multicast Join/Leave activity to
stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get
were added to the test such both existing and non existing nodes
perform them in random order.
5.3 OpenSM Regression
Using a back-to-back or single switch connection, the following set of
tests is run nightly on the stacks described in table 2. The included
tests are:
* Stress Testing: Flood the SA with queries from multiple channel
adapters to check the robustness of the entire stack up to the SA.
* Dynamic Changes: Dynamic Topology changes, through randomly
dropping SMP packets, used to test OpenSM adaptation to an unstable
network & verify DB correctness.
* Trap Injection: This flow injects traps to the SM and verifies that it
handles them gracefully.
* SA Query Test: This test exhaustively checks the SA responses to all
possible single component mask. To do that the test examines the
entire set of records the SA can provide, classifies them by their
field values and then selects every field (using component mask and a
value) and verifies that the response matches the expected set of records.
A random selection using multiple component mask bits is also performed.
5.4 Cluster testing:
Cluster testing is usually run before a distribution release. It
involves real hardware setups of 16 to 32 nodes (or more if a beta site
is available). Each test is validated by running all-to-all ping through the IB
interface. The test procedure includes:
* Cluster bringup
* Hand-off between 2 or 3 SM's while performing:
- Node reboots
- Switch power cycles (disconnecting the SM's)
* Unresponsive port detection and recovery
* osmtest from multiple nodes
* Trap injection and recovery
6 Qualification
----------------
Table 2 - Qualified IB Stacks
=============================
Stack | Version
-----------------------------------------|--------------------------
OFED | 1.1
OFED | 1.0
OpenIB Gen2 (IBG2 distribution) | 1.0
OpenIB Gen1 (IBGD distribution) | 1.8.0
VAPI (Mellanox InfiniBand HCA Driver) | 3.2 and later
Table 3 - Qualified Devices and Corresponding Firmware
======================================================
Mellanox
Device | FW versions
--------|-----------------------------------------------------------
MT43132 | InfiniScale - fw-43132 5.2.0 (and later)
MT47396 | InfiniScale III - fw-47396 0.5.0 (and later)
MT23108 | InfiniHost - fw-23108 3.3.2 (and later)
MT25204 | InfiniHost III Lx - fw-25204 1.0.1i (and later)
MT25208 | InfiniHost III Ex (InfiniHost Mode) - fw-25208 4.6.2 (and later)
MT25208 | InfiniHost III Ex (MemFree Mode) - fw-25218 5.0.1 (and later)
QLogic/PathScale
Device | Note
--------|-----------------------------------------------------------
iPath | QHT6040 (PathScale InfiniPath HT-460)
iPath | QHT6140 (PathScale InfiniPath HT-465)
iPath | QLE6140 (PathScale InfiniPath PE-880)
Note: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
QP0 and QP1. However, it does support it as a device on the subnet.