Key Concepts
This chapter explains the key concepts related to the hardware and software components of the Sun Cluster system that you need to understand before using with Sun Cluster systems.
You can obtain more details at:
http://docs.oracle.com/cd/E19787-01/820-2553/concepts-1/index.html
Cluster Nodes
A cluster node is a computer that runs both the Solaris software and Sun Cluster software. The Sun Cluster software enables a cluster to have two to eight nodes.
Cluster nodes are generally connected to one or more disks. Nodes that are not attached to disks use the cluster file system to access the multi-host disks. In parallel database configurations, nodes can concurrently access to some or all disks.
Every node in the cluster knows when another node joins or leaves the cluster. Also, every node in the cluster knows the resources that are running locally as well as the resources that are running on the other cluster nodes.
Nodes in the same cluster must have similar processing capabilities, memory, and I/O capacity to enable failover without significant decrease in performance. Because of the possibility of failover, each node must have sufficient capacity to meet service level agreements if a node fails.
Cluster Interconnection
Cluster interconnection is the physical configuration of devices that are used to transfer cluster-dedicated communications and data service communications between cluster nodes.
Redundant interconnections enable operations to continue over the surviving interconnection when system administrators isolate failures and repair communication. The Sun Cluster software detects, repairs, and automatically re-initiates communication over a repaired interconnection.
Cluster Membership
The Cluster Membership Monitor (CMM) is a set of distributed agents that exchange messages over the cluster interconnection to complete the following tasks:
- Ensuring a consistent membership view on all nodes (quorum)
- Synchronizing configurations in response to membership changes
- Handling cluster partitioning
- Ensuring full connectivity among all cluster members by leaving failed nodes out of the cluster until they are repaired
The main function of the CMM is to establish cluster membership, which requires a cluster-wide agreement between nodes that participate in the cluster at any time.
The CMM detects major cluster status changes on each node, such as communication interruption of one or more nodes.
The CMM relies on the transport kernel module to generate heartbeats across the transport medium to other nodes in the cluster. When the CMM does not detect a heartbeat from a node within a defined time-out period, it considers that the node has failed and initiates a cluster reconfiguration to renegotiate cluster membership.
To determine cluster membership and data integrity, the CMM performs the following tasks:
- Describing a change in cluster membership, such as a node joining or leaving the cluster
- Ensuring that a failed node leaves the cluster
- Ensuring that a failed node remains inactive until it is repaired
- Preventing the cluster from being divided into node subsets
Cluster Configuration Repository
The Cluster Configuration Repository (CCR) is a private, cluster-wide, and distributed database for storing information about the configuration and status of the cluster. To avoid corrupting configuration data, each node must be aware of the current state of cluster resources. The CCR ensures that you have a consistent view of the cluster from all nodes. The CCR is updated when an error occurs, recovery is implemented, or the general status of the cluster changes.
The CCR structure contains the following information:
- Cluster and node names
- Cluster transport configuration
- The names of SVM, disk sets, or Veritas disk groups
- A list of nodes that can master each disk group
- Valid parameter values for data services
- Paths to callback methods for data services
- DID device configuration
- Current cluster status
Fault Monitors
The Sun Cluster system makes all components on the "path" between users and data highly available by monitoring applications, file systems, and network ports.
The Sun Cluster software detects a node failure quickly and creates an equivalent server for resources on the failed node. The Sun Cluster software ensures that resources unaffected by the failed node are constantly available during the recovery and that resources of the failed node are available as soon as they are recovered.
Data Services Monitoring
Each Sun Cluster data service has a fault monitor that periodically probes the data service to determine its health. A default monitor checks whether application daemons are running and that clients are being served. Based on the information returned through detection, pre-defined actions, such as restarting daemons or causing a failover, can be initiated.
Disk Path Monitoring
Sun Cluster software supports disk-path monitoring (DPM). DPM improves the overall reliability of failover and switchover by reporting the failure of a secondary disk-path. You can use two methods for monitoring paths to disks. The first method is provided by running the scdpm command. This command enables you to monitor, unmonitor, or display the status of paths to disks in your cluster.
The second method is provided by the SunPlex Manager GUI. SunPlex Manager provides a topological view of the monitored paths to disks. The view is updated every 10 minutes to provide information about the number of failed force replies.
IP Multipath Monitoring
Each cluster node has its own IP network multipathing configuration, which may vary with cluster nodes. IP network multipathing monitors the following network communication failures:
- The transmit and receive path of network adapter has stopped transmitting packets.
- The network adapter is disconnected with links.
- Ports on switches do not transmit or receive packets.
- Physical interfaces in a group are unavailable during system boot.
Quorum Devices
A quorum device is a disk that is shared by two or more nodes and is used to check whether a cluster is running by votes. A cluster can operate only when a quorum of votes is available. The quorum device is used to determine which nodes constitute a new cluster when a cluster is divided into separate sets of nodes.
Both cluster nodes and quorum devices vote to form quorum. By default, cluster nodes acquire a quorum vote count of one when they boot and become cluster members. Nodes can have a vote count of zero when a node is being installed, or when an administrator has set a node to the maintenance state.
Quorum devices acquire quorum vote counts that are based on the number of nodes connecting to the devices. When you set up a quorum device, it acquires the maximum vote count of N-1 where N is the number of connected votes to the quorum device. For example, a quorum device that is connected to two nodes with nonzero vote counts has a quorum count of one (two minus one).
Data Integrity
The Sun Cluster system attempts to prevent data corruption and ensure data integrity. A cluster must never be split into separate zones that are active at the same time, because cluster nodes share data and resources. The CMM guarantees that only one cluster is operational at any time.
Two types of problems may arise from cluster partitions: split brain and amnesia. Split brain occurs when the cluster interconnection between nodes is lost, the cluster is partitioned into subclusters, and each subcluster believes that it is the unique partition. A subcluster that is not aware of other subclusters may cause a conflict in shared resources, such as duplicate network addresses and data corruption.
Amnesia occurs if all nodes leave the cluster in staggered groups. An example is a two-node cluster with nodes A and B. If the node A is shut down, only CCR configuration data stored in the node B instead of the node A is updated. If the node B is shut down and the node A is rebooted, the node A may run with original CCR configuration data. This state is called amnesia, which may cause the operation of a cluster based on original configuration data.
You can avoid split brain and amnesia by giving each node one vote and mandating a majority of votes for an operational cluster. A partition with a majority of votes takes up a quorum and is enabled to operate. This multi-vote mechanism applies to a cluster that contains more than two nodes. Most two-node clusters have two partitions. If such a cluster is partitioned, an external vote enables a partition to gain quorum. This external vote is provided by a quorum device. A quorum device can be any disk that is shared between two nodes.
Table 10-1 describes how the Sun Cluster software uses quorum to avoid split brain and amnesia.
Partition Type |
Quorum Solution |
---|---|
Split brain |
Enables only a partition (subcluster) with a majority of votes to run as the cluster (only one partition can exist with a majority of votes). After a node loses the race for quorum, it is out of service. |
Amnesia |
Guarantees that when a cluster is booted, it has at least one node that is one of latest cluster members (latest configuration data is provided). |
Failure Fencing
A major issue for clusters is a failure that causes the partitioning of a cluster (called split brain). When this situation occurs, not all nodes can be used for communication. A single node or subset may attempt to form a separate cluster or subset. Each subset or partition may "believe" that it has the sole access and ownership to multiple host disks. Attempts by multiple nodes to write data to disks may cause data corruption.
Failure fencing limits node access to multiple host disks by preventing access to disks. When a node leaves a cluster that either fails or is partitioned, failure fencing ensures that the node cannot access disks any longer. Only the current member node has access to disks, ensuring data integrity.
The Sun Cluster system uses SCSI disk reservations to implement failure fencing. Failed nodes are "fenced" away from multiple host disks through SCSI reservations, preventing themselves from accessing those disks.
When a cluster member detects that another node does not interconnect with it, it initiates a failure fencing procedure to prevent the failed node from accessing shared disks. When this failure fencing occurs, the fenced node is out of service and a "reservation conflict" message is displayed on its console.
Failfast Mechanism for Failure Fencing
The failfast mechanism disables failed nodes but does not prevent failed nodes from being rebooted. Failed nodes may be rebooted and rejoined in a cluster after being disabled.
If a node disconnects with other nodes in a cluster and the node is not part of a partition that can achieve quorum, it is forcibly removed by another node from the cluster. Any node that is part of a partition that can achieve quorum places reservations on shared disks. Nodes whose quantity cannot reach a quorum are disabled based on the failfast mechanism.
Devices
A global file system makes all files in a cluster equally accessible and visible to all nodes. Similarly, the Sun Cluster software makes all devices in a cluster accessible and visible throughout the cluster. That is, the I/O subsystem can access any device in the cluster from any node, without regard to the place where the device is physically attached. This access is called global device access.
Global Devices
The Sun Cluster system uses global devices to provide cluster-wide, highly available access to any device in a cluster from any node. Generally, if a node cannot be used to access a global device, Sun Cluster software switches the current path to another path to the device and redirects access to the path. This redirection is easy with global devices regardless of the path, because devices are provided with the same name. Accessing to a remote device is like accessing a local device of the same name. In addition, APIs used for accessing global devices in a cluster are the same as those for accessing local devices.
Sun Cluster global devices include disks, CD-ROMs, and tapes. However, disks are the unique multi-port global devices supported.
A cluster assigns a unique ID to a disk, CD-ROM, or tape. This assignment enables consistent access to each device from any node in a cluster.
Device ID
Sun Cluster software manages global devices using a device ID (DID) driver. The driver automatically assigns a unique ID to each device in a cluster, such as multiple host disks, tape drives, and CD-ROMs.
The DID driver is an integral part of accessing global devices in a cluster. It can detect all cluster nodes and build a list of unique disk devices. Additionally, the DID driver assigns unique and consistent primary number and secondary number to each device in all cluster nodes. The DID driver assigns a unique DID instead of Solaris DID to access global devices.
Local Devices
Sun Cluster software also manages local devices. These local devices can only be accessed by nodes that are running and physically connect to the cluster. Local devices are superior to global devices in terms of performance, because they do not need to concurrently replicate the status information of multiple nodes. If a device domain fails, the device cannot be accessed until it is shared by multiple nodes.
Device Groups
Disk device groups enable volume manager disk groups to be global, because they provide multiple paths and hosts for the underlying disks. Each cluster node that physically connects to multiple host disks provides a path to disk device groups.
In the Sun Cluster system, you can control multiple host disks that use Sun Cluster software by registering multiple host disks as disk device groups. This makes the Sun Cluster system detect which volume manager disk group is mapped to a node. The Sun Cluster software creates a raw disk device group for each disk device and tape device in a cluster. These cluster device groups keep in the offline state until you access them as global devices either by mounting a global file system or accessing a raw database file.
Data Services
A data service is the combination of software and configuration files. It helps applications run normally without the modification of Sun Cluster configuration. When running in a Sun Cluster configuration, an application runs as a resource under the control of Resource Group Manager (RGM). Data services enable you to configure applications such as Oracle databases on a cluster instead of a single server.
Data service software provides the following Sun Cluster management methods for operating applications:
- Starting an application
- Stopping an application
- Monitoring application faults and removing these faults
The configuration file for data services defines the properties of resources that represent applications in RGM.
RGM is responsible for processing failover of a cluster and scalable data services. RGM is used to enable and disable data services on a node selected from a cluster, responding to change of cluster members. RGM ensures data service applications to use the cluster framework.
RGM manages data services as resources. A cluster administrator can create and manage resources in a container called resource group. RGM and administrator operations can ensure that resources and the resource group are switched between the online state and the offline state.
Description of a Resource Type
A resource type is a collection of properties that describe applications in a cluster. The collection contains information about how to enable, disable, and monitor applications on a cluster node. Resource type also includes application-specific properties, which need to be defined when you run these applications in a cluster. Sun Cluster data services are pre-defined with multiple resource types. For example, the resource type of Sun Cluster HA for Oracle is SUNW.oracle-server. The resource type of Sun Cluster HA for Apache is SUNW.apache.
Description of a Resource
Resources are instances of resource types defined within a cluster range. You can install multiple application instances on a cluster based on resource types. When you initialize resources, RGM assigns values to application-specific properties and the resources inherit any property on the resource type level.
Data services utilize multiple types of resources. Applications, such as Apache Web Server or Sun Java System Web Server, utilize network IP addresses that they rely on, including logical host names and shared IP addresses. Applications and network resources form a basic unit managed by RGM.
Description of a Resource Group
Resources managed by RGM are stored in a resource group and managed as a unit. A resource group is a group consisting of associated or inter-dependent resources. For example, resources derived from SUNW.LogicalHostname may be stored in the same resource group together with resources derived from an Oracle database. If you enable failover or switchover of a resource group, the resource group is transplanted as a unit.
Data Service Types
Data services enable applications to be highly available and scalable, preventing important cluster applications from being interrupted in case of a single point of failure.
When configuring a data service, you must configure it as one of the following types:
- Failover data service
- Scalable data service
- Parallel data service