oracle RAC一个节点频繁重启解决


oracle RAC一个节点频繁重启解决
 
故障现象：
 
 2011年的一次问题，oracle 11gr2 rac + redhat linux ，2节点rac中的其中一个节点频繁重启；
 
原因分析：
 
主机日志
 
VIP发生了漂移，重启后又归位
 
node1
 
Nov 23 18:22:27 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2.
 
Nov 23 18:22:31 dtydb2 avahi-daemon[13096]: Withdrawing address record for 169.254.188.250 on bond1.
 
Nov 23 18:23:10 dtydb2 avahi-daemon[13096]: Registering new address record for 169.254.188.250 on bond1.
 
Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2.
 
Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2.
 
Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2.
 
Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Withdrawing address record for 10.4.124.242 on bond2.
 
Nov 23 18:23:35 dtydb2 avahi-daemon[13096]: Registering new address record for 10.4.124.242 on bond2.
 
node2
 
Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2.
 
Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2.
 
Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2.
 
Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2.
 
Nov 23 18:22:31 dtydb1 avahi-daemon[13132]: Registering new address record for 10.4.124.242 on bond2.
 
Nov 23 18:23:34 dtydb1 avahi-daemon[13132]: Withdrawing address record for 10.4.124.242 on bond2.
 
数据库日志
 
不能连接ASM，所有重启
 
ORA-15064: communication failure with ASMinstance
 
ORA-03113: end-of-file on communicationchannel
 
 
 
ASM日志
 
and the ASM instance has the alert info
 
Wed Nov 23 18:22:29 2011
 
NOTE: client exited [13858]
 
Wed Nov 23 18:22:29 2011
 
NOTE: ASMB process exiting, either shutdown is in progress
 
NOTE: or foreground connected to ASMB was killed.
 
Wed Nov 23 18:22:29 2011
 
PMON (ospid: 13797): terminating the instance due to error 481
 
Wed Nov 23 18:22:29 2011
 
ORA-1092 : opitsk aborting process
 
Wed Nov 23 18:22:30 2011
 
ORA-1092 : opitsk aborting process
 
Wed Nov 23 18:22:30 2011
 
ORA-1092 : opitsk aborting process
 
Wed Nov 23 18:22:30 2011
 
ORA-1092 : opitsk aborting process
 
Wed Nov 23 18:22:30 2011
 
License high water mark = 16
 
Instance terminated by PMON, pid = 13797
 
USER (ospid: 9488): terminating the instance
 
Instance terminated by USER, pid = 948
 
 
 
ocssd.log：has a disk HB, but no network HB,
 
2011-11-23 18:22:20.512: [    CSSD][1111939392]clssnmPollingThread: node dtydb1 (1) is impending reconfig, flag 394254, misstime 15910
 
2011-11-23 18:22:20.512: [    CSSD][1111939392]clssnmPollingThread: local diskTimeout set to 27000 ms, remote disk timeout set to 27000, impending reconfig status(1)
 
2011-11-23 18:22:20.512: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004978, LATS 1030715744, lastSeqNo 946497, uniqueness 1321449141, timestamp 1322043740/933687024
 
2011-11-23 18:22:21.515: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004980, LATS 1030716744, lastSeqNo 1004978, uniqueness 1321449141, timestamp 1322043741/933688024
 
2011-11-23 18:22:22.518: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004982, LATS 1030717754, lastSeqNo 1004980, uniqueness 1321449141, timestamp 1322043742/933689044
 
2011-11-23 18:22:23.520: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004984, LATS 1030718754, lastSeqNo 1004982, uniqueness 1321449141, timestamp 1322043743/933690044
 
2011-11-23 18:22:24.140: [    CSSD][1113516352]clssnmSendingThread: sending status msg to all nodes
 
2011-11-23 18:22:24.141: [    CSSD][1113516352]clssnmSendingThread: sent 4 status msgs to all nodes
 
2011-11-23 18:22:24.523: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004986, LATS 1030719754, lastSeqNo 1004984, uniqueness 1321449141, timestamp 1322043744/933691044
 
2011-11-23 18:22:25.525: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004988, LATS 1030720754, lastSeqNo 1004986, uniqueness 1321449141, timestamp 1322043745/933692044
 
2011-11-23 18:22:26.527: [    CSSD][1106696512]clssnmvDHBValidateNCopy: node 1, dtydb1, has a disk HB, but no network HB, DHB has rcfg 216519746, wrtcnt, 1004990, LATS 1030721764, lastSeqNo 1004988, uniqueness 1321449141, timestamp 1322043746/933693044
 
经过部署监控脚本，ping日志
 
从18：21：56开始丢包（117-150包丢失）
 
64 bytes from 192.168.100.1: icmp_seq=114 ttl=64 time=0.342 ms
 
64 bytes from 192.168.100.1: icmp_seq=115 ttl=64 time=0.444 ms
 
64 bytes from 192.168.100.1: icmp_seq=116 ttl=64 time=0.153 ms
 
--- 192.168.100.1 ping statistics ---
 
150 packets transmitted, 116 received, 22% packet loss, time 149054ms
 
rtt min/avg/max/mdev = 0.084/0.246/0.485/0.099 ms
 
Wed Nov 23 18:22:31 CST 2011
 
继续分析
 
经过以上分析，原因基本确认为RAC节点私有网络丢包，导致一个节点主机重启；但为什么会丢包呢？在检查主机网络配置没有问题的情况下，只能请网络工程师协助解决了
 
网络专家通过网络抓包，发现如下现象
 
观察到几个现象，内容来自回复的邮件：
 
1.        4:02:09，192.168.100.1在e4cc这块网卡上发出的ping请求，192.168.100.2没有把回应包送到e4cc；
 
2.        192.168.100.2发出的ping请求数据包，没有送到192.168.