dhodovsk / source-git / pacemaker

Forked from source-git/pacemaker 3 years ago
Clone
Blob Blame History Raw
:compat-mode: legacy
= Resource Agents =

== Resource Agent Actions ==

If one resource depends on another resource via constraints, the cluster will
interpret an expected result as sufficient to continue with dependent actions.
This may cause timing issues if the resource agent start returns before the
service is not only launched but fully ready to perform its function, or if the
resource agent stop returns before the service has fully released all its
claims on system resources. At a minimum, the start or stop should not return
before a status command would return the expected (started or stopped) result.

== OCF Resource Agents ==

=== Location of Custom Scripts ===

indexterm:[OCF Resource Agents]
OCF Resource Agents are found in +/usr/lib/ocf/resource.d/pass:[<replaceable>provider</replaceable>]+

When creating your own agents, you are encouraged to create a new
directory under +/usr/lib/ocf/resource.d/+ so that they are not
confused with (or overwritten by) the agents shipped by existing providers.

So, for example, if you choose the provider name of bigCorp and want
a new resource named bigApp, you would create a resource agent called
+/usr/lib/ocf/resource.d/bigCorp/bigApp+ and define a resource:
 
[source,XML]
----
<primitive id="custom-app" class="ocf" provider="bigCorp" type="bigApp"/>
----

=== Actions ===

All OCF resource agents are required to implement the following actions.

.Required Actions for OCF Agents
[width="95%",cols="3m,3,7",options="header",align="center"]
|=========================================================
|Action
|Description
|Instructions

|start
|Start the resource
|Return 0 on success and an appropriate error code otherwise. Must not
 report success until the resource is fully active.
 indexterm:[start,OCF Action]
 indexterm:[OCF,Action,start]

|stop
|Stop the resource
|Return 0 on success and an appropriate error code otherwise. Must not
 report success until the resource is fully stopped.
 indexterm:[stop,OCF Action]
 indexterm:[OCF,Action,stop]

|monitor
|Check the resource's state 

|Exit 0 if the resource is running, 7 if it is stopped, and anything
 else if it is failed. 
 indexterm:[monitor,OCF Action]
 indexterm:[OCF,Action,monitor]

NOTE: The monitor script should test the state of the resource on the local machine only.

|meta-data
|Describe the resource
|Provide information about this resource as an XML snippet. Exit with 0.
 indexterm:[meta-data,OCF Action]
 indexterm:[OCF,Action,meta-data]

NOTE: This is _not_ performed as root.

|validate-all
|Verify the supplied parameters
|Return 0 if parameters are valid, 2 if not valid, and 6 if resource is not configured.
 indexterm:[validate-all,OCF Action]
 indexterm:[OCF,Action,validate-all]

|=========================================================

Additional requirements (not part of the OCF specification) are placed on
agents that will be used for advanced concepts such as clone resources.

.Optional Actions for OCF Resource Agents
[width="95%",cols="2m,6,3",options="header",align="center"]
|=========================================================

|Action
|Description
|Instructions

|promote
|Promote the local instance of a promotable clone resource to the master (primary) state.
|Return 0 on success
 indexterm:[promote,OCF Action]
 indexterm:[OCF,Action,promote]

|demote
|Demote the local instance of a promotable clone resource to the slave (secondary) state.
|Return 0 on success
 indexterm:[demote,OCF Action]
 indexterm:[OCF,Action,demote]

|notify
|Used by the cluster to send the agent pre- and post-notification
 events telling the resource what has happened and will happen.
|Must not fail. Must exit with 0
 indexterm:[notify,OCF Action]
 indexterm:[OCF,Action,notify]

|=========================================================

One action specified in the OCF specs, +recover+, is not currently used by the
cluster. It is intended to be a variant of the +start+ action that tries to
recover a resource locally.

[IMPORTANT]
====
If you create a new OCF resource agent, use indexterm:[ocf-tester]`ocf-tester`
to verify that the agent complies with the OCF standard properly.
====

=== How are OCF Return Codes Interpreted? ===

The first thing the cluster does is to check the return code against
the expected result.  If the result does not match the expected value,
then the operation is considered to have failed, and recovery action is
initiated.

There are three types of failure recovery:

.Types of recovery performed by the cluster
[width="95%",cols="1m,4,4",options="header",align="center"]
|=========================================================

|Type
|Description
|Action Taken by the Cluster

|soft
|A transient error occurred
|Restart the resource or move it to a new location
indexterm:[soft,OCF error]
indexterm:[OCF,error,soft]

|hard
|A non-transient error that may be specific to the current node occurred
|Move the resource elsewhere and prevent it from being retried on the current node
indexterm:[hard,OCF error]
indexterm:[OCF,error,hard]

|fatal
|A non-transient error that will be common to all cluster nodes (e.g. a bad configuration was specified)
|Stop the resource and prevent it from being started on any cluster node
indexterm:[fatal,OCF error]
indexterm:[OCF,error,fatal]

|=========================================================

[[s-ocf-return-codes]]
=== OCF Return Codes ===

The following table outlines the different OCF return codes and the type of
recovery the cluster will initiate when a failure code is received.
Although counterintuitive, even actions that return 0
(aka. +OCF_SUCCESS+) can be considered to have failed, if 0 was not
the expected return value.

.OCF Return Codes and their Recovery Types
[width="95%",cols="1m,<4m,<6,1m",options="header",align="center"]
|=========================================================

|RC
|OCF Alias
|Description
|RT

|0
|OCF_SUCCESS
|Success. The command completed successfully. This is the expected result for all start, stop, promote and demote commands.
indexterm:[Return Code,OCF_SUCCESS]
indexterm:[Return Code,0,OCF_SUCCESS]
|soft

|1
|OCF_ERR_GENERIC
|Generic "there was a problem" error code.
indexterm:[Return Code,OCF_ERR_GENERIC]
indexterm:[Return Code,1,OCF_ERR_GENERIC]
|soft

|2
|OCF_ERR_ARGS
|The resource's configuration is not valid on this machine. E.g. it refers to a location not found on the node. 
indexterm:[Return Code,OCF_ERR_ARGS]
indexterm:[Return Code,2,OCF_ERR_ARGS]
|hard

|3
|OCF_ERR_UNIMPLEMENTED
|The requested action is not implemented.
indexterm:[Return Code,OCF_ERR_UNIMPLEMENTED]
indexterm:[Return Code,3,OCF_ERR_UNIMPLEMENTED]
|hard

|4
|OCF_ERR_PERM
|The resource agent does not have sufficient privileges to complete the task.
indexterm:[Return Code,OCF_ERR_PERM]
indexterm:[Return Code,4,OCF_ERR_PERM]
|hard

|5
|OCF_ERR_INSTALLED
|The tools required by the resource are not installed on this machine.
indexterm:[Return Code,OCF_ERR_INSTALLED]
indexterm:[Return Code,5,OCF_ERR_INSTALLED]
|hard

|6
|OCF_ERR_CONFIGURED
|The resource's configuration is invalid. E.g. required parameters are missing.
indexterm:[Return Code,OCF_ERR_CONFIGURED]
indexterm:[Return Code,6,OCF_ERR_CONFIGURED]
|fatal

|7
|OCF_NOT_RUNNING
|The resource is safely stopped. The cluster will not attempt to stop a resource that returns this for any action.
indexterm:[Return Code,OCF_NOT_RUNNING]
indexterm:[Return Code,7,OCF_NOT_RUNNING]
|N/A

|8
|OCF_RUNNING_MASTER
|The resource is running in master mode.
indexterm:[Return Code,OCF_RUNNING_MASTER]
indexterm:[Return Code,8,OCF_RUNNING_MASTER]
|soft

|9
|OCF_FAILED_MASTER
|The resource is in master mode but has failed. The resource will be demoted,
stopped and then started (and possibly promoted) again.
indexterm:[Return Code,OCF_FAILED_MASTER]
indexterm:[Return Code,9,OCF_FAILED_MASTER]
|soft

|other
|N/A
|Custom error code.
indexterm:[Return Code,other]
|soft

|=========================================================

Exceptions to the recovery handling described above:

* Probes (non-recurring monitor actions) that find a resource active
  (or in master mode) will not result in recovery action unless it is
  also found active elsewhere.
* The recovery action taken when a resource is found active more than
  once is determined by the resource's +multiple-active+ property.
* Recurring actions that return +OCF_ERR_UNIMPLEMENTED+
  do not cause any type of recovery.

== LSB Resource Agents (Init Scripts) ==

=== LSB Compliance ===

The relevant part of the
http://refspecs.linuxfoundation.org/lsb.shtml[LSB specifications]
includes a description of all the return codes listed here.
    
Assuming `some_service` is configured correctly and currently
inactive, the following sequence will help you determine if it is
LSB-compatible:

. Start (stopped):
+
----
# /etc/init.d/some_service start ; echo "result: $?"
----
+
  .. Did the service start?
  .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Status (running):
+
----
# /etc/init.d/some_service status ; echo "result: $?"
----
+
  .. Did the script accept the command?
  .. Did the script indicate the service was running?
  .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Start (running):
+
----
# /etc/init.d/some_service start ; echo "result: $?"
----
+
  .. Is the service still running?
  .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Stop (running):
+
----
# /etc/init.d/some_service stop ; echo "result: $?"
----
+
  .. Was the service stopped?
  .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Status (stopped):
+
----
# /etc/init.d/some_service status ; echo "result: $?"
----
+
  .. Did the script accept the command?
  .. Did the script indicate the service was not running?
  .. Did the echo command print *result: 3* (in addition to the init script's usual output)?
+
. Stop (stopped):
+
----
# /etc/init.d/some_service stop ; echo "result: $?"
----
+
  .. Is the service still stopped?
  .. Did the echo command print *result: 0* (in addition to the init script's usual output)?
+
. Status (failed):
+
.. This step is not readily testable and relies on manual inspection of the script.
+
The script can use one of the error codes (other than 3) listed in the
LSB spec to indicate that it is active but failed. This tells the
cluster that before moving the resource to another node, it needs to
stop it on the existing one first.

If the answer to any of the above questions is no, then the script is
not LSB-compliant. Your options are then to either fix the script or
write an OCF agent based on the existing script.