Blob Blame History Raw
                           OpenSM Release Notes
								  ======================

Release: IBG2
Repo: https://openib.org/svn/trunk/contrib/mellanox/gen2/src/userspace/management/osm
Version: 4956
Date:    Jan 2006

1 Overview
----------
This document describes the contents of the OpenSM IBG2 release.
OpenSM is an InfiniBand compliant Subnet Manager and Administrator,
and runs on top of OpenIB.

This document includes the following sections:
1 This Overview section (describing new features and software
  dependencies)
2 Known Issues And Limitations
3 Unsupported IB compliancy statements
4 Major Bug Fixes
5 Main Verification Flows
6 Qualified software stacks and devices

1.1 New Features

* New libs created during installation: libopensm - contains interface
  to the logging and mads pool machanism. libosmcomp - contains
  interface to the complib utilities. libosmvendor - contains
  interface to sending/receiving MADs through the SMI or GSI over the
  IBG2 driver.

* Change building mechanism to use autotools.

* Change directory stucturing of the OpenSM code according to libs:
  osm/libvendor - for vendor specific files. osm/complib - for complib
  specific files. osm/opensm - for opensm core files. osm/include

* Semi-static LID assignment: OpenSM uses a cache file for storing all
  LID assignments such that, even after a reboot, the LIDs do not
  change. The static LID assignment is built on top of a new
  "persistancy" layer that abstracts that actual database from its
  usage. The implemented database is based on files stored under
  /var/cache/osm (this location can be overriden via the environment
  variable OSM_CACHE_DIR). Other implementations can use LDAP for
  example. Note that a standby SM ignores its previously assigned LIDs
  when it becomes the master, and the previous master LID settings are
  used.

* Irresponsive Port Handling: A port that does not respond to SM
  queries will be queried upon future light or heavy sweeps, and if
  then it responds, it will be setup immediately. Previously such a
  port was queried only upon a heavy sweep.

* Leaf Switch Port HOQ: A different maximal head of queue life time is
  assigned to switch ports connected to HCAs such that a bad chipset
  or defective hardware will not cause back presure on the fabric.

* OSM_TMP_DIR: This is a new environment variable controlling the
  directory where subnet.lst, osm.fdbs and osm.mcfdbs files are
  created. The deafult is still /tmp.

* Configuration Options cache file: OpenSM was enhanced to provide a
  means to modify all its internal configuration options, including
  the ones that oreviously were only available under osmsh. The new
  file is located under the cache directory and is named
  opensm.opts. To automatically create this file OpenSM supports a new
  flag: `-c'. The file is  generated with the current set of options
  used by OpenSM.

* Previously, under extreme load conditions, when OpenSM got
  overloaded with SA queries during which the incoming messages queue
  also grew, delays were incurred in message response-time beyond the
  expected. This new version of OpenSM has been enhanced such that,
  under such a case, incoming new SA queries are returned with a
  RESOURCE_BUSY status (per the InfiniBand Architecture
  Specification).

* Kill -HUP: If the OpenSM process (ps -efww |grep opensm.bin) gets a
  SIGHUP (sent by kill -HUP), it will start a heavy sweep as if a trap
  was received or a change in topology was observed by the SM.

1.2 Software Dependencies

OpenSM depends on the installation of either OpenIB gen2 (e.g. IBG2
distribution), OpenIB gen1 (r.g. IBGD distribution) or Mellanox VAPI
stacks. The qualified driver versions are provided in Table 2,
"Qualified IB Stacks".

1.4 Supported Devices Firmware

The main task of OpenSM is to initialize InfiniBand devices. The
qualified devices and their corresponding firmware versions
listed in Table 3.

2 Known Issues And Limitations
------------------------------

* No Partition/Pkey policy support:
  OpenSM does not provide means to set poartitions.

* IB "trusted" concept is unsupported:
  Queries that should be classified according to the trustworthiness of
  their sources will not be handled correctly.

* No Service / Key associations:
  There is no way to manage Service access by Keys.

* No SM to SM SMDB synchronization:
  Puts the burden of re-registering services, multicast groups, and
  inform-info on the client application (or IB access layer core).

* NPTL problem under Red Hat 9.0, Red Hat AS 3.0:
  There are some bugs (pthread conditional wait missing events)
  with thread handling when using the dynamic Native POSIX Thread
  Library (/lib/tls) of Red Hat 9.0 & Red Hat AS 3.0 OSs. To overcome
  that, OpenSM installation places wrapper scripts named opensm and
  osmtest in the /usr/bin directory, which preload the standard libc
  and libptherad before invoking the executables. If using the osm
  package, a similar workaround is possible by putting the LD_PRELOAD
  setting in .tclshrc file, for example: set env (LD_PRELOAD)
  "/lib/libc.so.6:/lib/libpthread.so.0"

* InformInfo failure over IBMGT:
  OpenSM might not respect a valid InformInfo unsubscribe request when
  running over Mellanox's IBMGT user level MAD interface (not on
  IBGD). This will be fixed in the next release.

* No "port down" event handling:
  Changing the switch port through which OpenSM connects to the IB
  fabric may cause wrong operation. Please restart OpenSM whenever
  such a connectivity change is made.

3 Unsupported IB Compliancy Statements
--------------------------------------
The following section lists all the IB compliancy statements which
OpenSM does not support. Please refer to IB specification for detailed
information on each compliancy statement.

* C14-22 (Authentication):
  M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
  SubnSet method. As a work-around, an OpenSM option is provided for
  defining the protect bits.

* C14-67 (Authentication):
  On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
  the SM shall generate a SubnGetResp if the M_Key is matching or
  silently drop the packet if M_Key is not matching.

* C15-0.1.23.1 (Authentication):
  PortInfoRecords shall always be provided with the M_Key component
  set to 0, except in the case of a trusted request, in which case the
  actual M_Key component contents shall be provided.

* C15-0.1.23.2 (Authentication):
  P_KeyTableRecords and ServiceAssociationRecords shall only be
  provided in responses to trusted requests.

* C15-0.1.23.4 (Authentication):
  InformInfoRecords shall always be provided with the QPN set to
  0, except for the case of a trusted request, in which case the actual
  subscriber QPN shall be returned.

* o13-17.1.2 (Event-FWD):
  If no permission to forward, the subscription should be removed and
  no further forwarding should occur.

* C14-37.1.2 (Handover):
  Priority should be kept in non-volatile memory.

* C14-38.1.1 (Handover):
  Support AttributeModifier values in SubnSet(SMInfo). If the state
  transition requested is invalid - return with status code 7.

* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
  GUIDInfo - SM should enable assigning Port GUIDInfo.

* C14-44 (Initialization):
  If the SM discovers that it is missing an M_Key to update CA/RT/SW,
  it should notify the higher level.

* C14-62.1.1.11 (Initialization):
  PortInfo:VLHighLimit should match the configured VLArb on the port.

* C14-62.1.1.12 (Initialization):
  PortInfo:M_Key - Set the M_Key to a node based random value.

* C14-62.1.1.13 (Initialization):
  PortInfo:P_KeyProtectBits - set according to an optional policy.

* C14-62.1.1.24 (Initialization):
  SwitchInfo:DefaultPort - should be configured for random FDB.

* C14-62.1.1.32 (Initialization):
  RandomForwardingTable should be configured.

* o15-0.1.12 (Multicast):
  If the JoinState is SendOnlyNonMember = 1 (only), then the endport
  should join as sender only.

* o15-0.1.13 (Multicast):
  If a Join request using unrealistic parameters is received, return
  ERR_REQ_INVALID.

* o15-0.1.8 (Multicast):
  If a request for creating an MCG with fields that cannot be met,
  return ERR_REQ_INVALID (currently ignoring SL and FlowLabelTclass).

* C15-0.1.11 (SA-Query):
  Query response should use only base LIDs (as the feature has not
  been qualified yet).

* C15-0.1.19 (SA-Query):
  Respond to SubnGetMulti(MultiPathRec)

* C15-0.1.8.6 (SA-Query):
  Respond to SubnAdmGetTraceTable - this is an optional attribute.

* C15-0.1.8.7 (SA-Query):
  SubnAdmGetMulti SubnAdmGetMultiResp - Only in case of a MultiPath.

* C15-X.Y.Z.W (SA-Query):
  SubAdmGet/GetTable GUIDInfo - support GUIDInfo setting/retrieval.

* C15-0.1.13 Services:
  Reject ServiceRecord create, modify or delete if the given
  ServiceP_Key does not match the one included in the ServiceGID port
  and the port that sent the request.

* C15-0.1.14 (Services):
  Provide means to associate service name and ServiceKeys.

4 Major Bug Fixes
-----------------

The following list of bugs were fixed. Note that other less critical
or visible bugs were also fixed.

* PortInfo query was not matching on several fields.	These fields
  were added to teh comparison function.

* OpenSM would crash during exit flow if run with "-o" flag	A fix to
  the complib global timer destruction sequence solves this problem.

* OpenSM does not complete the sweep if the driver fails to send a MAD
  Counting the number of outstanding MADs the SM waits for response
  for was enhanced to take this acse into acount

* OpenSM was not compliant to the spec statement: C14.62.1.1 Table 183
  p870 l34: ".., the SM shall ensure that one of the P_KeyTable
  entries in every node contains either the value 0xFFFF (the default
  P_Key, full membership) or the value 0x7FFF (the default P_Key,
  partial membership)."	OpenSM sets the PKey table with an entry of
  0xffff in case there is no such entry or 0x7fff entries on that
  port. Switch ports are ignored.

* If the SA is queried with IB_PIR_COMPMASK_BASELID and base_lid of 0,
  the SA was incorrectly returning all the ports. Fix: do not ignore base
  lid of 0 as a query criteria.

* When provided a PathRecord query with num_paths = 0 the SM should
  assuem num_paths = 1. Fix: in the PathRecord query code.

* PathRecord query returned a deleted multicast groups info. Fix:
  Added a check for multicast group state to avoid such cases.

* LinkRecord query provided wrong results. Fix: in query code.

* PathRecord did not honor PacketLifeTime component. Fix: Added the
  check for packet lifetime matching.

* Multicast and other registration hapenning all the time on the
  cluster. Fix: OpenSM was sending false "client-re-registration"
  messages (in PortInfo).

* On some heavy load cases OpenSM would consume 100% CPU time. Fix: an
  endless loop in timer implementation that would happen under rare
  heavy CPU load cases.

* OpenSM hangs during LID assignment phase. Fix: Some condition that
  cause that was fixed.

* OpenSM core dump in the middle of sweep. Fix: A memory range
  overflow write was found by valgrind and fix.

* OpenSM core dump as result fo PathRecord query with no results. Fix:
  A memory free on non allocated memory was fixed.

* OpenSM sweep algorithm confused by a timing race. Fix: A significant
  race conditionin the SM sweep algorithm was found and fixed.

* OpenSM deadlock due to out of order SMINfo and NodeInfo MAD
  received. Fix: A fix in lock ordering resolves this issue.

* TrapRepress sent even if not a master. Fix: in trap receiver.

5 Main Verification Flows
-------------------------

OpenSM verification is run using the following activities:
* osmtest - a standalone program
* ibmgtsim (IB management simulator) based - a set of flows that
  simulate clusters, inject errors and verify OpenSM capability to
  respond and bring up the network correctly.
* small cluster regression testing - where the SM is used on back to
  back or single switch configuration. The regression includes
  multiple OpenSM dedicated tests
* cluster testing - when we run OpenSM to setup large cluster, perform
  handoff, reboots and reconnects, verify routing correctness and SA
  responsiveness at teh ULP level (IPoIB and SDP)

5.1 osmtest

OsmTest is the main automated verification tool used for OpenSM
testing. Its verification flows are described by list below.

* Inventory File: Obtain and verify all port info, node info, and path
  records parameters.

* Service Record:
   - Register new service
	- Register another service (with a lease period)
	- Register another service (with service p_key set to zero)
	- Get all services by name
	- Delete the first service
	- Delete the third service.
	- Added bad flows of get/delete  non valid service
	- Add / Get same service with different data
	- Add / Get / Delete by different component  mask values (services
	  by Name & Key / Name & Data / Name & Id / Id only )

* Multicast Member Record:
	- Query of existing Groups (IPoIB)
	- BAD Join with insufficient comp mask (o15.0.1.3)
	- Create given MGID=0 (o15.0.1.4)
	- Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
	- Create BAD MGID=0xFA. (o15.0.1.6)
	- Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
	- New MGID with invalid join state (o15.0.1.9)
	- Retry of existing MGID - See JoinState update (o15.0.1.11)
	- BAD RATE when connecting to existing MGID (o15.0.1.13)
	- Partial JoinState delete request - removing FullMember (o15.0.1.14)
	- Full Delete of a group (o15.0.1.14)
	- Verify Delete by trying to Join deleted group (o15.0.1.14)
	- BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)

* Event Forwarding: Register for trap forwarding using reports
	- Send a trap and wait for report
	- Unregister non-existing

* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
  disconnect/connect ports) and wait for report, then unregister.

* Stress Test: send PortInfoRecord queries both single and RMPP and
  check for the rate of responses as well as their validity.

5.2 IB Management Simulator OpenSM Test Flows:

The simulator provides ability to simulate the SM handling of virtual
topologies that are not limitted to actual lab equipment availability.
OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
regressions use smaller (16 and 128 nodes clusters).

The following test flows are running on the IB management simulator:

* Stability:
  Up to 12 links from the fabric are randomly selected to drop packets
  at drop rates up to 90%. The SM is required to succeed bringing the
  fabric up. The reulting routing is verified to be correct too.

* LID Manager:
  Using LMC = 2 the fabric is being initialized with LIDs. Faults like
  zero LID, Duplicated LID, non-aligned (to LMC) LIDs are being
  randomly assigned to various nodes and other errors are randomly
  output to the guid2lid cache file. The SM sweep is run 5 times and
  after each iteration a complete verification is made to ensure all
  LIDs that could possibly be maintained are kept, as well as all nodes
  were assigned a legal LID range.

* Multicast Routing:
  Nodes are randomly joining the 0xc000 group and eventually the
  resulting routing is verified for completness and adherance to
  Up/Down routing rules.

* OsmTest:
  The complete osmtest flow as desribed in previous table is run on
  the simulated fabrics.

5.3 OpenSM Regression

Using a back to back or single switch connection the following set of
tests are run nightly on the stacks described in table 2. The included
tests are:

* Stress Testing: Flood the SA with queries from multiple channel
  adapters to check the robustness of the entire stack up to the SA.

* Dynamic Changes: Dynamic Topology changes, through randomlly
  droping SMP packets used to test OpenSM adaptation to unstable
  network & verify DB correctness.

* Trap Injection: This flow injects traps to the SM and verify it does
  handle them gracefully.

* SA Query Test: This test exhoustivly checks the SA responses to all
  possible single component mask. To do that the test examine the
  entire set of records the SA can provide, classify them by their
  field values and then select every field (using component mask and a
  value) and verify the response matches the expected set of records.
  A random selection using multiple component mask bits is also performed.

5.4 Cluster testing:

Cluster testing is usually run before a distribution release. It
involves real hardware setup of 16 to 32 nodes (or more if beta site
is available). Each test is validated by running all-to-all ping through IB
interface. The test procedure includes:

* Cluster bringup

* Handoff between 2 or 3 SM's while performing
  - Node reboots
  - Switch power cycles (disconneting the SM's)

* Irresponsive port detection and recovery

* osmtest from multiple nodes

* Trap injection and recovery


6 Qualification
----------------

Table 2 - Qualified IB Stacks
=============================

Stack					  							 | Version
----------------------------------------|--------------------------
VAPI (Mellanox Infininband HCA Driver)	 |	3.2 and later
OpenIB Gen1 (IBGD distribution)			 | 1.8.0
OpenIB Gen2 (IBG2 distribution)			 | 1.0

Table 3 - Qualified Devices and Corresponding Firmware
======================================================

Device  |	FW versions
--------|-----------------------------------------------------------
MT43132 |	InfiniScale - fw-43132	5.2.0 (and later)
MT47396 |	InfiniScale III - fw-47396	0.5.0 (and later)
MT23108 |	InfiniHost - fw-23108	3.3.2
MT25204 |	InfiniHost III Lx - fw-25204	1.0.1
MT25208 |	InfiniHost III Ex (InfiniHost Mode) - fw-25208	4.6.2 (and later)
MT25208 |	InfiniHost III Ex (MemFree Mode) - fw-25218	5.0.1 (and later)

Other vendors HCAs not yet verified but eHCA is known to be discovered and configured
correctly.