Reliability Architecture
The reliability architecture of the eSight consists of three layers: application layer, data layer, and collection layer.
Figure 6-1 describes the overall software architecture.
The reliability measures taken for the three layers of the eSight are described as follows:
- Application layer
- Services can be upgraded by process. The upgrade of part of the services will not interrupt services on the whole.
- The maintenance tool automatically restarts a process when it detects that the process is abnormal.
- When a service module detects that core threads and process resources are abnormal, it automatically stops the process. The maintenance tool then starts the process.
- Core services (security, log, license, and alarm) remain available when the database is faulty. Data inconsistency is allowed when the database is faulty. After the database is restored, data is synchronized to the database.
- Backup and key configurations check are automatically performed. When invalid configurations or loss of configurations is found, an alarm is reported and the configurations are automatically restored.
- The maintenance tool detects key resource usage (CPU usage, memory usage, disk usage, database tablespace usage, and certificate validity period) and reports an alarm when a threshold is crossed.
- Data layer
- The maintenance tool automatically restarts the database when it detects that the database is abnormal.
- The maintenance tool detects database resource usage, including the number of database connections, tablespace size, space for archiving logs, and rollback logs, and reports an alarm when a threshold is crossed.
- The maintenance tool provides manual and automatic backup of NMS data (configuration files and data in the database) for system restoration.
- An alarm is reported when database data fails to be dumped.
- Data is not damaged and services are quickly restored when an abnormal logout or power outage occurs.
- Collection layer
- Region-based clusters are supported. Requests are handled by corresponding instances based on the routing information. These clusters are deployed in N+1 mode.
- The connection status between the active and the standby node is checked using the heartbeat detection. If a disconnection is detected, a reconnection is triggered.
- In occasions where NMSs are restarted or devices are disconnected from the NMSs, device data is recollected and synchronized to ensure data integrity.
- Traffic control is adopted when large-scale faults cause an alarm storm.