Blame doc/opensm_release_notes_openib-3.0.13.txt

Packit 13e616
                        OpenSM Release Notes 3.0.13
Packit 13e616
                       =============================
Packit 13e616
Packit 13e616
Version: OpenFabrics Enterprise Distribution (OFED) 1.2
Packit 13e616
Repo:    git://git.openfabrics.org/~ofed_1_2/management.git (release)
Packit 13e616
         git://git.openfabrics.org/~halr/management.git (development)
Packit 13e616
Date:    June 2007
Packit 13e616
Packit 13e616
1 Overview
Packit 13e616
----------
Packit 13e616
This document describes the contents of the OpenSM OFED 1.2 release.
Packit 13e616
OpenSM is an InfiniBand compliant Subnet Manager and Administration,
Packit 13e616
and runs on top of OpenIB. The OpenSM version for this release
Packit 13e616
is openib-3.0.13
Packit 13e616
Packit 13e616
This document includes the following sections:
Packit 13e616
1 This Overview section (describing new features and software
Packit 13e616
  dependencies)
Packit 13e616
2 Known Issues And Limitations
Packit 13e616
3 Unsupported IB compliance statements
Packit 13e616
4 Major Bug Fixes
Packit 13e616
5 Main Verification Flows
Packit 13e616
6 Qualified software stacks and devices
Packit 13e616
Packit 13e616
1.1 Major New Features
Packit 13e616
Packit 13e616
* Routing improvements
Packit 13e616
  Two additional routing algorithms have been added in addition to
Packit 13e616
  performance improvements to the existing routing algorithms. The
Packit 13e616
  two new routing algorithms are FAT tree and LASH. See the
Packit 13e616
  opensm man page for additional details.
Packit 13e616
Packit 13e616
* SA Optional Record support now "virtually" complete
Packit 13e616
  Includes SA InformInfo improvements and InformInfoRecord support in
Packit 13e616
  addition to support for the remaining SA optional records
Packit 13e616
  (MulticastForwardingTableRecord, SwitchInfoRecord). Also, SMInfoRecord
Packit 13e616
  support was improved to include all SMs found.
Packit 13e616
Packit 13e616
* SA database dump/restore
Packit 13e616
  OpenSM now includes the ability to dump and restore the SA database.
Packit 13e616
  This allows for all SA registrations (multicast, services, and events)
Packit 13e616
  to be saved and restored later.
Packit 13e616
Packit 13e616
  In verbose mode, OpenSM will dump SA DB (existing multicast groups,
Packit 13e616
  services and InformInfo) into dump file which named "opensm-sa.dump"
Packit 13e616
  and located under standard OpenSM dump directory (/var/log by default).
Packit 13e616
Packit 13e616
  If option -S is specified and SA DB dump file name is provided, OpenSM
Packit 13e616
  will try to restore SA database from this file. And if restore is
Packit 13e616
  successful, OpenSM won't ask for client reregistration at subnet bring-up.
Packit 13e616
Packit 13e616
* Modular routing for multicast
Packit 13e616
  In conjunction was SA database dump/restore, there is the ability to
Packit 13e616
  dump and load switch lid matrices (min hops tables) which are used
Packit 13e616
  for multicast route calculation.
Packit 13e616
Packit 13e616
* IB router enablement
Packit 13e616
  OpenSM now supports router ports properly (in terms of PortInfo handling).
Packit 13e616
  There is also some experimental support for IB routers which is enabled
Packit 13e616
  via the ROUTER_EXP compile flag. This support includes SA PathRecord and
Packit 13e616
  MCMemberRecord support for off subnet GIDs.
Packit 13e616
Packit 13e616
* Socket support added to console
Packit 13e616
  OpenSM console now supports remote in addition to local access.
Packit 13e616
  Remote access is currently via telnet.
Packit 13e616
Packit 13e616
1.2 Minor New Features:
Packit 13e616
Packit 13e616
* Change output format of DR path from hex to decimal port numbers
Packit 13e616
Packit 13e616
* Log rotation
Packit 13e616
  The OpenSM log can now be rotated while OpenSM is running (without
Packit 13e616
  stopping and restarting OpenSM). This is accomplished via SIGUSR1.
Packit 13e616
Packit 13e616
* Support scope for IPoIB multicast groups in partition config
Packit 13e616
Packit 13e616
* Dump filename changed from subnet.lst to osm-subnet.lst
Packit 13e616
  Default temp directory for non Windows platforms was previously changed
Packit 13e616
  from /tmp to /var/log.
Packit 13e616
Packit 13e616
* Add option for force SDR link speed
Packit 13e616
  Add option to opensm.opts to force link speed. Currently, only forcing
Packit 13e616
  to SDR link speed is supported. This option is not supported as a
Packit 13e616
  command line option.
Packit 13e616
Packit 13e616
1.3 Library API Changes
Packit 13e616
Packit 13e616
  None
Packit 13e616
Packit 13e616
1.4 Software Dependencies
Packit 13e616
Packit 13e616
OpenSM depends on the installation of either OFED 1.2, OFED 1.1,
Packit 13e616
OFED 1.0, OpenIB gen2 (e.g. IBG2 distribution), OpenIB gen1 (e.g. IBGD
Packit 13e616
distribution), or Mellanox VAPI stacks. The qualified driver versions
Packit 13e616
are provided in Table 2, "Qualified IB Stacks".
Packit 13e616
Packit 13e616
1.5 Supported Devices Firmware
Packit 13e616
Packit 13e616
The main task of OpenSM is to initialize InfiniBand devices. The
Packit 13e616
qualified devices and their corresponding firmware versions
Packit 13e616
are listed in Table 3.
Packit 13e616
Packit 13e616
2 Known Issues And Limitations
Packit 13e616
------------------------------
Packit 13e616
Packit 13e616
* No Service / Key associations:
Packit 13e616
  There is no way to manage Service access by Keys.
Packit 13e616
Packit 13e616
* No SM to SM SMDB synchronization:
Packit 13e616
  Puts the burden of re-registering services, multicast groups, and
Packit 13e616
  inform-info on the client application (or IB access layer core).
Packit 13e616
Packit 13e616
* No "port down" event handling:
Packit 13e616
  Changing the switch port through which OpenSM connects to the IB
Packit 13e616
  fabric may cause incorrect operation. Please restart OpenSM whenever
Packit 13e616
  such a connectivity change is made.
Packit 13e616
Packit 13e616
* Changing connections during SM operation:
Packit 13e616
  Under some conditions the SM can get confused by a change in
Packit 13e616
  cabling (moving a cable from one switch port to the other) and
Packit 13e616
  momentarily see this as having the same GUID appear connected
Packit 13e616
  to two different IB ports. Under some conditions, when the SM fails to
Packit 13e616
  get the corresponding change event it might mistakenly report this case
Packit 13e616
  as a "duplicated GUID" case and abort. It is advisable to double-check
Packit 13e616
  the syslog after each such change in connectivity and restart
Packit 13e616
  OpenSM if it has exited. The same error ("duplicated GUID") will
Packit 13e616
  also appear with a loopback plug.
Packit 13e616
Packit 13e616
3 Unsupported IB Compliance Statements
Packit 13e616
--------------------------------------
Packit 13e616
The following section lists all the IB compliance statements which
Packit 13e616
OpenSM does not support. Please refer to the IB specification for detailed
Packit 13e616
information regarding each compliance statement.
Packit 13e616
Packit 13e616
* C14-22 (Authentication):
Packit 13e616
  M_Key M_KeyProtectBits and M_KeyLeasePeriod shall be set in one
Packit 13e616
  SubnSet method. As a work-around, an OpenSM option is provided for
Packit 13e616
  defining the protect bits.
Packit 13e616
Packit 13e616
* C14-67 (Authentication):
Packit 13e616
  On SubnGet(SMInfo) and SubnSet(SMInfo) - if M_Key is not zero then
Packit 13e616
  the SM shall generate a SubnGetResp if the M_Key matches, or
Packit 13e616
  silently drop the packet if M_Key does not match.
Packit 13e616
Packit 13e616
* C15-0.1.23.4 (Authentication):
Packit 13e616
  InformInfoRecords shall always be provided with the QPN set to 0,
Packit 13e616
  except for the case of a trusted request, in which case the actual
Packit 13e616
  subscriber QPN shall be returned.
Packit 13e616
Packit 13e616
* o13-17.1.2 (Event-FWD):
Packit 13e616
  If no permission to forward, the subscription should be removed and
Packit 13e616
  no further forwarding should occur.
Packit 13e616
Packit 13e616
* C14-24.1.1.5 and C14-62.1.1.22 (Initialization):
Packit 13e616
  GUIDInfo - SM should enable assigning Port GUIDInfo.
Packit 13e616
Packit 13e616
* C14-44 (Initialization):
Packit 13e616
  If the SM discovers that it is missing an M_Key to update CA/RT/SW,
Packit 13e616
  it should notify the higher level.
Packit 13e616
Packit 13e616
* C14-62.1.1.12 (Initialization):
Packit 13e616
  PortInfo:M_Key - Set the M_Key to a node based random value.
Packit 13e616
Packit 13e616
* C14-62.1.1.13 (Initialization):
Packit 13e616
  PortInfo:P_KeyProtectBits - set according to an optional policy.
Packit 13e616
Packit 13e616
* C14-62.1.1.24 (Initialization):
Packit 13e616
  SwitchInfo:DefaultPort - should be configured for random FDB.
Packit 13e616
Packit 13e616
* C14-62.1.1.32 (Initialization):
Packit 13e616
  RandomForwardingTable should be configured.
Packit 13e616
Packit 13e616
* o15-0.1.12 (Multicast):
Packit 13e616
  If the JoinState is SendOnlyNonMember = 1 (only), then the endport
Packit 13e616
  should join as sender only.
Packit 13e616
Packit 13e616
* o15-0.1.8 (Multicast):
Packit 13e616
  If a request for creating an MCG with fields that cannot be met,
Packit 13e616
  return ERR_REQ_INVALID (currently ignores SL and FlowLabelTClass).
Packit 13e616
Packit 13e616
* C15-0.1.8.6 (SA-Query):
Packit 13e616
  Respond to SubnAdmGetTraceTable - this is an optional attribute.
Packit 13e616
Packit 13e616
* C15-0.1.13 Services:
Packit 13e616
  Reject ServiceRecord create, modify or delete if the given
Packit 13e616
  ServiceP_Key does not match the one included in the ServiceGID port
Packit 13e616
  and the port that sent the request.
Packit 13e616
Packit 13e616
* C15-0.1.14 (Services):
Packit 13e616
  Provide means to associate service name and ServiceKeys.
Packit 13e616
Packit 13e616
4 Major Bug Fixes
Packit 13e616
-----------------
Packit 13e616
Packit 13e616
The following is a list of bugs that were fixed. Note that other less critical
Packit 13e616
or visible bugs were also fixed.
Packit 13e616
Packit 13e616
* osm_sminfo_rcv.c: Add SMInfo self query check. OpenSM can query
Packit 13e616
  itself for SMInfo occassionally due to port moving during subnet
Packit 13e616
  discovery process. Don't create remote SM entry in this case to
Packit 13e616
  prevent deadlocks.
Packit 13e616
Packit 13e616
* osm_ucast_updn.c: Two similar bugs in up/down routing fixed.
Packit 13e616
  8-bit integers were used as indexes when scanning subnet, which
Packit 13e616
  in one case caused OpenSM to crash when ranking "path" is longer
Packit 13e616
  than 256 switches, and in the other case, caused OpenSM to go into
Packit 13e616
  an infinite loop when fabric has more than 256 roots.
Packit 13e616
Packit 13e616
* osm_sm_state_mgr.c: In __osm_sm_state_mgr_send_master_sm_info_req,
Packit 13e616
  handle master GUID port not found properly
Packit 13e616
Packit 13e616
* osm_sa_multipath_record.c: In __osm_mpr_rcv_get_path_parms, return
Packit 13e616
  IB_NOT_FOUND rather than IB_ERROR when can't route to LID from switch
Packit 13e616
Packit 13e616
* osm_sa_path_record.c: In __osm_pr_rcv_get_path_parms, return IB_NOT_FOUND
Packit 13e616
  rather than IB_ERROR when can't route to LID from switch
Packit 13e616
Packit 13e616
* osm_vendor_ibumad.c:  In osm_vendor_set_sm, set issmfd to
Packit 13e616
  -1 on open error
Packit 13e616
Packit 13e616
* osm_vendor_ibumad: Termination crash fix
Packit 13e616
  When OpenSM is terminated umad_receiver thread still running even after
Packit 13e616
  the structures are destroyed and freed, this causes to random (but easily
Packit 13e616
  reproducible) crashes. The reason is that osm_vendor_delete() does not
Packit 13e616
  care about thread termination. This patch adds the receiver thread
Packit 13e616
  cancellation (by using pthread_cancel() and pthread_join()) and cares to
Packit 13e616
  keep have all mutexes unlocked upon termination. There is also minor
Packit 13e616
  termination code consolidation - osm_vendor_port_close() function.
Packit 13e616
Packit 13e616
* osm_port_profile.h: Fix reinsertion issue in osm_port_prof_set_ignored_port
Packit 13e616
Packit 13e616
* osm_matrix.h: Fix segfault with up/down and root nodes file
Packit 13e616
Packit 13e616
* osm_sa_path_record.c: In osm_pr_rcv_process, fix endian of hop_limit
Packit 13e616
Packit 13e616
* osm_vendor_ibumad.c: Close umad port in osm_vendor_delete
Packit 13e616
Packit 13e616
* osm_sa_(multipath path)_record.c: Fix MultiPathRecord/PathRecord issues
Packit 13e616
  with using MTU/rate/PktLife explicitly ignoring selectors
Packit 13e616
Packit 13e616
  OpenSM just uses the resulting path MTU/rate/pkt-life and fail the
Packit 13e616
  query even though the selector might be allowing for selecting an
Packit 13e616
  appropriate value.
Packit 13e616
Packit 13e616
  After this fix, the following results are obtained for a case of
Packit 13e616
  path allowing maximal 2K MTU.
Packit 13e616
Packit 13e616
In standard mode:
Packit 13e616
------------------------------------------------------------
Packit 13e616
MTU greater than ... 256     (0x01) ->  equal to ....... 2K
Packit 13e616
MTU less than ...... 256     (0x41) ->  NO PATHS
Packit 13e616
MTU equal to ....... 256     (0x81) ->  equal to ....... 256
Packit 13e616
MTU largest possible 256     (0xc1) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 512     (0x02) ->  equal to ....... 2K
Packit 13e616
MTU less than ...... 512     (0x42) ->  equal to ....... 256
Packit 13e616
MTU equal to ....... 512     (0x82) ->  equal to ....... 512
Packit 13e616
MTU largest possible 512     (0xc2) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 1K      (0x03) ->  equal to ....... 2K
Packit 13e616
MTU less than ...... 1K      (0x43) ->  equal to ....... 512
Packit 13e616
MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
Packit 13e616
MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 2K      (0x04) ->  NO PATHS
Packit 13e616
MTU less than ...... 2K      (0x44) ->  equal to ....... 1K
Packit 13e616
MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
Packit 13e616
MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 4K      (0x05) ->  NO PATHS
Packit 13e616
MTU less than ...... 4K      (0x45) ->  equal to ....... 2K
Packit 13e616
MTU equal to ....... 4K      (0x85) ->  NO PATHS
Packit 13e616
MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
Packit 13e616
============================================================
Packit 13e616
Packit 13e616
With enable_quirks (when one of the ends is a Tavor device):
Packit 13e616
------------------------------------------------------------
Packit 13e616
MTU greater than ... 256     (0x01) ->  equal to ....... 1K
Packit 13e616
MTU less than ...... 256     (0x41) ->  NO PATHS
Packit 13e616
MTU equal to ....... 256     (0x81) ->  equal to ....... 256
Packit 13e616
MTU largest possible 256     (0xc1) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 512     (0x02) ->  equal to ....... 1K
Packit 13e616
MTU less than ...... 512     (0x42) ->  equal to ....... 256
Packit 13e616
MTU equal to ....... 512     (0x82) ->  equal to ....... 512
Packit 13e616
MTU largest possible 512     (0xc2) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 1K      (0x03) ->  NO PATHS
Packit 13e616
MTU less than ...... 1K      (0x43) ->  equal to ....... 512
Packit 13e616
MTU equal to ....... 1K      (0x83) ->  equal to ....... 1K
Packit 13e616
MTU largest possible 1K      (0xc3) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 2K      (0x04) ->  NO PATHS
Packit 13e616
MTU less than ...... 2K      (0x44) ->  equal to ....... 1K
Packit 13e616
MTU equal to ....... 2K      (0x84) ->  equal to ....... 2K
Packit 13e616
MTU largest possible 2K      (0xc4) ->  equal to ....... 2K
Packit 13e616
MTU greater than ... 4K      (0x05) ->  NO PATHS
Packit 13e616
MTU less than ...... 4K      (0x45) ->  equal to ....... 1K
Packit 13e616
MTU equal to ....... 4K      (0x85) ->  NO PATHS
Packit 13e616
MTU largest possible 4K      (0xc5) ->  equal to ....... 2K
Packit 13e616
============================================================
Packit 13e616
Packit 13e616
* osm_pkey_rcv.c: rwlock double release fix
Packit 13e616
  When the port is removed from subnet, but previously requested pkey
Packit 13e616
  table block is received after this - the lock will be released twice.
Packit 13e616
  This leads to deadlocks later when other MAD processor will try to
Packit 13e616
  acquire the same lock.
Packit 13e616
Packit 13e616
* osm_sa_informinfo.c: Fix InformInfoRecord searches
Packit 13e616
Packit 13e616
* Better SA MCMemberRecord leave locking
Packit 13e616
  Hold locked multicast group leave request (MCMember Record) processing.
Packit 13e616
  This prevents kind of race with multicast group join request where
Packit 13e616
  those requests can be reordered during processing.
Packit 13e616
Packit 13e616
* osm_sa_informinfo.c: Conformance changes for subscribe component
Packit 13e616
Packit 13e616
* osm_sa_path_record.c: Handle LID 0 as error
Packit 13e616
Packit 13e616
* Fix comparing InformInfo records
Packit 13e616
  1. The received InformInfo struct was modified before dumping it.
Packit 13e616
  2. The function that compares InformInfo structures was just
Packit 13e616
     comparing the whole memory allocated for it, including reserved
Packit 13e616
     fields. Fixed to compare more selectively.
Packit 13e616
Packit 13e616
  As for QPN, from the IB spec, table 119 InformInfo:
Packit 13e616
  QPN : Ignored except when subscribe=0 (an unsubscribe
Packit 13e616
  request). Queue pair to which Report()s were sent as
Packit 13e616
  a result of a corresponding subscription. If no
Packit 13e616
  subscription for this Report() with this QPN exists,
Packit 13e616
  the request to unsubscribe performs no action and
Packit 13e616
  produces GetResp() with status indicating an invalid
Packit 13e616
  field value.
Packit 13e616
Packit 13e616
* osm_trap_rcv.c: Reduce repeated trap messages so log doesn't fill
Packit 13e616
  so quickly
Packit 13e616
Packit 13e616
* osm_helper.c: Fix stack smashing detected problem in osm_dump_service_record
Packit 13e616
Packit 13e616
* Fix permission on db files directory
Packit 13e616
  When creating directory for db files (guid2lid) storing create it with
Packit 13e616
  reasonable permissions (current 777 decimal = octal 01411) and don't do
Packit 13e616
  it world writable.
Packit 13e616
Packit 13e616
* Fix node_desc.description as string usages
Packit 13e616
Packit 13e616
5 Main Verification Flows
Packit 13e616
-------------------------
Packit 13e616
Packit 13e616
OpenSM verification is run using the following activities:
Packit 13e616
* osmtest - a stand-alone program
Packit 13e616
* ibmgtsim (IB management simulator) based - a set of flows that
Packit 13e616
  simulate clusters, inject errors and verify OpenSM capability to
Packit 13e616
  respond and bring up the network correctly.
Packit 13e616
* small cluster regression testing - where the SM is used on back to
Packit 13e616
  back or single switch configurations. The regression includes
Packit 13e616
  multiple OpenSM dedicated tests.
Packit 13e616
* cluster testing - when we run OpenSM to setup a large cluster, perform
Packit 13e616
  hand-off, reboots and reconnects, verify routing correctness and SA
Packit 13e616
  responsiveness at the ULP level (IPoIB and SDP).
Packit 13e616
Packit 13e616
5.1 osmtest
Packit 13e616
Packit 13e616
osmtest is an automated verification tool used for OpenSM
Packit 13e616
testing. Its verification flows are described by list below.
Packit 13e616
Packit 13e616
* Inventory File: Obtain and verify all port info, node info, link and path
Packit 13e616
  records parameters.
Packit 13e616
Packit 13e616
* Service Record:
Packit 13e616
   - Register new service
Packit 13e616
   - Register another service (with a lease period)
Packit 13e616
   - Register another service (with service p_key set to zero)
Packit 13e616
   - Get all services by name
Packit 13e616
   - Delete the first service
Packit 13e616
   - Delete the third service
Packit 13e616
   - Added bad flows of get/delete  non valid service
Packit 13e616
   - Add / Get same service with different data
Packit 13e616
   - Add / Get / Delete by different component  mask values (services
Packit 13e616
     by Name & Key / Name & Data / Name & Id / Id only )
Packit 13e616
Packit 13e616
* Multicast Member Record:
Packit 13e616
   - Query of existing Groups (IPoIB)
Packit 13e616
   - BAD Join with insufficient comp mask (o15.0.1.3)
Packit 13e616
   - Create given MGID=0 (o15.0.1.4)
Packit 13e616
   - Create given MGID=0xFF12A01C,FE800000,00000000,12345678 (o15.0.1.4)
Packit 13e616
   - Create BAD MGID=0xFA. (o15.0.1.6)
Packit 13e616
   - Create BAD MGID=0xFF12A01B w/ link-local not set (o15.0.1.6)
Packit 13e616
   - New MGID with invalid join state (o15.0.1.9)
Packit 13e616
   - Retry of existing MGID - See JoinState update (o15.0.1.11)
Packit 13e616
   - BAD RATE when connecting to existing MGID (o15.0.1.13)
Packit 13e616
   - Partial JoinState delete request - removing FullMember (o15.0.1.14)
Packit 13e616
   - Full Delete of a group (o15.0.1.14)
Packit 13e616
   - Verify Delete by trying to Join deleted group (o15.0.1.14)
Packit 13e616
   - BAD Delete of IPoIB membership (no prev join) (o15.0.1.15)
Packit 13e616
Packit 13e616
* GUIDInfo Record:
Packit 13e616
   - All GUIDInfoRecords in subnet are obtained
Packit 13e616
Packit 13e616
* MultiPathRecord:
Packit 13e616
   - Perform some compliant and noncompliant MultiPathRecord requests
Packit 13e616
   - Validation is via status in responses and IB analyzer
Packit 13e616
Packit 13e616
* PKeyTableRecord:
Packit 13e616
  - Perform some compliant and noncompliant PKeyTableRecord queries
Packit 13e616
  - Validation is via status in responses and IB analyzer
Packit 13e616
Packit 13e616
* LinearForwardingTableRecord:
Packit 13e616
  - Perform some compliant and noncompliant LinearForwardingTableRecord queries
Packit 13e616
  - Validation is via status in responses and IB analyzer
Packit 13e616
Packit 13e616
* Event Forwarding: Register for trap forwarding using reports
Packit 13e616
   - Send a trap and wait for report
Packit 13e616
   - Unregister non-existing
Packit 13e616
Packit 13e616
* Trap 64/65 Flow: Register to Trap 64-65, create traps (by
Packit 13e616
  disconnecting/connecting ports) and wait for report, then unregister.
Packit 13e616
Packit 13e616
* Stress Test: send PortInfoRecord queries, both single and RMPP and
Packit 13e616
  check for the rate of responses as well as their validity.
Packit 13e616
Packit 13e616
Packit 13e616
5.2 IB Management Simulator OpenSM Test Flows:
Packit 13e616
Packit 13e616
The simulator provides ability to simulate the SM handling of virtual
Packit 13e616
topologies that are not limited to actual lab equipment availability.
Packit 13e616
OpenSM was simulated to bring up clusters of up to 10,000 nodes. Daily
Packit 13e616
regressions use smaller (16 and 128 nodes clusters).
Packit 13e616
Packit 13e616
The following test flows are run on the IB management simulator:
Packit 13e616
Packit 13e616
* Stability:
Packit 13e616
  Up to 12 links from the fabric are randomly selected to drop packets
Packit 13e616
  at drop rates up to 90%. The SM is required to succeed in bringing the
Packit 13e616
  fabric up. The resulting routing is verified to be correct as well.
Packit 13e616
Packit 13e616
* LID Manager:
Packit 13e616
  Using LMC = 2 the fabric is initialized with LIDs. Faults such as
Packit 13e616
  zero LID, Duplicated LID, non-aligned (to LMC) LIDs are
Packit 13e616
  randomly assigned to various nodes and other errors are randomly
Packit 13e616
  output to the guid2lid cache file. The SM sweep is run 5 times and
Packit 13e616
  after each iteration a complete verification is made to ensure that all
Packit 13e616
  LIDs that could possibly be maintained are kept, as well as that all nodes
Packit 13e616
  were assigned a legal LID range.
Packit 13e616
Packit 13e616
* Multicast Routing:
Packit 13e616
  Nodes randomly join the 0xc000 group and eventually the
Packit 13e616
  resulting routing is verified for completeness and adherence to
Packit 13e616
  Up/Down routing rules.
Packit 13e616
Packit 13e616
* osmtest:
Packit 13e616
  The complete osmtest flow as described in the previous table is run on
Packit 13e616
  the simulated fabrics.
Packit 13e616
Packit 13e616
* Stress Test:
Packit 13e616
  This flow merges fabric, LID and stability issues with continuous
Packit 13e616
  PathRecord, ServiceRecord and Multicast Join/Leave activity to
Packit 13e616
  stress the SM/SA during continuous sweeps. InformInfo Set/Delete/Get
Packit 13e616
  were added to the test such both existing and non existing nodes
Packit 13e616
  perform them in random order.
Packit 13e616
Packit 13e616
5.3 OpenSM Regression
Packit 13e616
Packit 13e616
Using a back-to-back or single switch connection, the following set of
Packit 13e616
tests is run nightly on the stacks described in table 2. The included
Packit 13e616
tests are:
Packit 13e616
Packit 13e616
* Stress Testing: Flood the SA with queries from multiple channel
Packit 13e616
  adapters to check the robustness of the entire stack up to the SA.
Packit 13e616
Packit 13e616
* Dynamic Changes: Dynamic Topology changes, through randomly
Packit 13e616
  dropping SMP packets, used to test OpenSM adaptation to an unstable
Packit 13e616
  network & verify DB correctness.
Packit 13e616
Packit 13e616
* Trap Injection: This flow injects traps to the SM and verifies that it
Packit 13e616
  handles them gracefully.
Packit 13e616
Packit 13e616
* SA Query Test: This test exhaustively checks the SA responses to all
Packit 13e616
  possible single component mask. To do that the test examines the
Packit 13e616
  entire set of records the SA can provide, classifies them by their
Packit 13e616
  field values and then selects every field (using component mask and a
Packit 13e616
  value) and verifies that the response matches the expected set of records.
Packit 13e616
  A random selection using multiple component mask bits is also performed.
Packit 13e616
Packit 13e616
5.4 Cluster testing:
Packit 13e616
Packit 13e616
Cluster testing is usually run before a distribution release. It
Packit 13e616
involves real hardware setups of 16 to 32 nodes (or more if a beta site
Packit 13e616
is available). Each test is validated by running all-to-all ping through the IB
Packit 13e616
interface. The test procedure includes:
Packit 13e616
Packit 13e616
* Cluster bringup
Packit 13e616
Packit 13e616
* Hand-off between 2 or 3 SM's while performing:
Packit 13e616
  - Node reboots
Packit 13e616
  - Switch power cycles (disconnecting the SM's)
Packit 13e616
Packit 13e616
* Unresponsive port detection and recovery
Packit 13e616
Packit 13e616
* osmtest from multiple nodes
Packit 13e616
Packit 13e616
* Trap injection and recovery
Packit 13e616
Packit 13e616
Packit 13e616
6 Qualification
Packit 13e616
----------------
Packit 13e616
Packit 13e616
Table 2 - Qualified IB Stacks
Packit 13e616
=============================
Packit 13e616
Packit 13e616
Stack                                    | Version
Packit 13e616
-----------------------------------------|--------------------------
Packit 13e616
OFED                                     |   1.2
Packit 13e616
OFED                                     |   1.1
Packit 13e616
OFED                                     |   1.0
Packit 13e616
OpenIB Gen2 (IBG2 distribution)          |   1.0
Packit 13e616
OpenIB Gen1 (IBGD distribution)          |   1.8.0
Packit 13e616
VAPI (Mellanox InfiniBand HCA Driver)    |   3.2 and later
Packit 13e616
Packit 13e616
Table 3 - Qualified Devices and Corresponding Firmware
Packit 13e616
======================================================
Packit 13e616
Packit 13e616
Mellanox
Packit 13e616
Device  |   FW versions
Packit 13e616
--------|-----------------------------------------------------------
Packit 13e616
MT43132 |   InfiniScale - fw-43132  5.2.0 (and later)
Packit 13e616
MT47396 |   InfiniScale III - fw-47396 0.5.0 (and later)
Packit 13e616
MT23108 |   InfiniHost - fw-23108   3.3.2 (and later)
Packit 13e616
MT25204 |   InfiniHost III Lx - fw-25204  1.0.1i (and later)
Packit 13e616
MT25208 |   InfiniHost III Ex (InfiniHost Mode) - fw-25208  4.6.2 (and later)
Packit 13e616
MT25208 |   InfiniHost III Ex (MemFree Mode) - fw-25218  5.0.1 (and later)
Packit 13e616
Packit 13e616
QLogic/PathScale
Packit 13e616
Device  |   Note
Packit 13e616
--------|-----------------------------------------------------------
Packit 13e616
iPath   | QHT6040 (PathScale InfiniPath HT-460)
Packit 13e616
iPath   | QHT6140 (PathScale InfiniPath HT-465)
Packit 13e616
iPath   | QLE6140 (PathScale InfiniPath PE-880)
Packit 13e616
Packit 13e616
Note: OpenSM does not run on an IBM Galaxy (eHCA) as it does not expose
Packit 13e616
QP0 and QP1. However, it does support it as a device on the subnet.
Packit 13e616