No relevant resource is found in the selected language.

This site uses cookies. By continuing to browse the site you are agreeing to our use of cookies. Read our privacy policy>

Reminder

To have a better experience, please upgrade your IE browser.

upgrade

Failure result when trying to restore from e-Backup for a production server

Publication Date:  2017-10-05 Views:  64 Downloads:  0
Issue Description

Getting failure result when trying to restore from e-Backup for a production server

 

 

Alarm Information

Failure result when trying to restore from e-Backup for a production server.

 

Handling Process

Issue was reported on eBackup and need t check eBackup portal and then FusionCompute portal.

The failed backup/restore tasks are related to different CNA and different Datastore.

Tried to restore VM to another FusionCompute cluster and after that, trie to restore the VM to original FusionCompute cluster.

 

Root Cause

The tasks of eBackup was failed because of write data blocks via socket failed.

The parameter DPS_CHECK_DELAY_TIME is just a factor which used to control how many seconds that FusionCompute need to minus from the timeout count. For example, if we change  DPS_CHECK_DELAY_TIME to 10, FusionCompute minus 10 from current timeout count in each detection cycle(1 second). Which means, if eBackup doesn’t update the timeout count in time, the task will be closed in 6 minutes. No other impact to the system.

Solution

eBackup have a backup task timeout count which is 3600 seconds in maximum. eBackup check the backup/restore task progress and remain timeout count every 3 minutes. If the  ackup/restore task can’t be  finished in remain count time, it will change the timeout to 3600 again, to  ensure the backup/restore task work properly.

In the meanwhile, FusionCompute check the timeout count too, and it minus the timeout count every second. If FusionCompute find the timeout count is zero, it will close backup/restore task as well as the socket between eBackup and FusionCompute .

The problem is, current FusionCompute cluster set a parameter called DPS_CHECK_DELAY_TIME as 20. In this case, FusionCompute minus the timeout count of backup/restore task by 20 every time, which means the backup/restore task is timeout in 3 minutes in FusionCompute, instead of 60minutes.

FusionCompute closed the backup/restore task before eBackup update timeout count. eBackup will find the socket closed before backup/restore task complete, and interrupt the  ackup/restore task.

Suggestions

This issue it’s collaboration problem between FusionCompute and eBackup.

During VM backup and restore, eBackup just call interface of FusionCompute to finish the task. So, FusionCompute also need to “manage” the task. For example, FusionCompute need to close  the task after eBackup crash or network interrupted, avoid the backup/restore task lose control.

END