PowerHA网关丢失故障处理一例

4385阅读 0评论2011-03-16 blue_stone
分类:

今天配置一个PowerHA 6.1集群, 2台AIX5.3的服务器, 每台机器的1块网卡做服务网卡,服务地址采用别名的方式.配置完成后, 双机可以正常启动, 但是资源组切换后, 主机不能再访问其他网段的机器, 也就是说默认路由不起作用了. 和这个[1]帖子的描述完全一样. 资源组切换前的route信息为:

# netstat -rn
Routing tables
Destination        Gateway           Flags   Refs     Use  If   Exp  Groups
Route Tree for Protocol Family 2 (Internet):
default            10.209.3.62       UG        0         5 en2      -      -   
10.209.3.0         10.209.3.45       UHSb      0         0 en2      -      -   =>
10.209.3/26        10.209.3.45       U         1         0 en2      -      -   
10.209.3.45        127.0.0.1         UGHS      0         0 lo0      -      -   
10.209.3.63        10.209.3.45       UHSb      0         0 en2      -      -   
127/8              127.0.0.1         U         8     24709 lo0      -      -   
192.168.2.0        192.168.2.45      UHSb      0         0 en2      -      -   =>
192.168.2/26       192.168.2.45      U         2      3240 en2      -      -   
192.168.2.45       127.0.0.1         UGHS      0      5739 lo0      -      -   
192.168.2.63       192.168.2.45      UHSb      0      1395 en2      -      -   
192.168.3.0        192.168.3.45      UHSb      0         0 en0      -      -   =>
192.168.3/26       192.168.3.45      U         2      3869 en0      -      -   
192.168.3.45       127.0.0.1         UGHS      0     13164 lo0      -      -   
192.168.3.63       192.168.3.45      UHSb      0       768 en0      -      -   
192.168.100.0      192.168.100.36    UHSb      0         0 en3      -      -   =>
192.168.100/26     192.168.100.36    U         1      7078 en3      -      -   
192.168.100.36     127.0.0.1         UGHS      0      6343 lo0      -      -   
192.168.100.63     192.168.100.36    UHSb      0         4 en3      -      -   
Route Tree for Protocol Family 24 (Internet v6):
::1                ::1               UH        0       591 lo0      -      -   

资源组切换之后的路由变成了:

# netstat -rn
Routing tables
Destination        Gateway           Flags   Refs     Use  If   Exp  Groups
Route Tree for Protocol Family 2 (Internet):
default            10.209.3.62       U         0         0 en0      -      -   
10.209.3.0         10.209.3.45       UHSb      0         0 en0      -      -   =>
10.209.3/26        10.209.3.45       U         0         1 en0      -      -   
10.209.3.45        127.0.0.1         UGHS      0         1 lo0      -      -   
10.209.3.63        10.209.3.45       UHSb      0         0 en0      -      -   
127/8              127.0.0.1         U         4     25581 lo0      -      -   
192.168.2.0        192.168.2.45      UHSb      0         0 en2      -      -   =>
192.168.2/26       192.168.2.45      U         0      3476 en2      -      -   
192.168.2.45       127.0.0.1         UGHS      0      5850 lo0      -      -   

以上路由信息来自资料[1]. 

发生变化的主要是默认路由, 由:

default            10.209.3.62       UG        0         5 en2      -      -   

变成了:

default            10.209.3.62       U         0         0 en0      -      -   

AIX route的输出结果的FLAG列中的U表示其状态为UP, G表示这是一个GATEWAY, 

观察HACMP的日志其中有如下信息:

+filetrans_rg:clifconfig[207] ifconfig en10 delete 192.168.0.13
+filetrans_rg:cl_swap_IP_address[+1280] [[ -n  ]]
+filetrans_rg:cl_swap_IP_address[+1303] /usr/es/sbin/cluster/.restore_routes
+filetrans_rg:.restore_routes[+9] date
+filetrans_rg:.restore_routes[+9] : Starting /usr/es/sbin/cluster/.restore_routes at Wed Mar 16 17:04:44 BEIST 2011
+filetrans_rg:.restore_routes[+11] cl_route_change default 127.0.0.1 192.168.0.254 inet
+filetrans_rg:cl_swap_IP_address[+1304] : Completed /usr/es/sbin/cluster/.restore_routes with return code 0.
+filetrans_rg:cl_swap_IP_address[+1304] [[ __AIX__ = __AIX__ ]]
+filetrans_rg:cl_swap_IP_address[+1305] enable_pmtu_gated
Setting tcp_pmtu_discover to 1
Setting udp_pmtu_discover to 1
+filetrans_rg:cl_swap_IP_address[+1308] cl_hats_adapter en10 -d 192.168.0.13 alias
+filetrans_rg:cl_hats_adapter[+50] [[ high = high ]]
+filetrans_rg:cl_hats_adapter[+50] version=1.40
+filetrans_rg:cl_hats_adapter[+51] +filetrans_rg:cl_hats_adapter[+51] cl_get_path
HA_DIR=es
+filetrans_rg:cl_hats_adapter[+52] +filetrans_rg:cl_hats_adapter[+52] cl_get_path -S


可以看出, HACMP中负责恢复路由任务的是/usr/es/sbin/cluster/.restore_routes, 该脚本内容如下:

#cat /usr/es/sbin/cluster/.restore_routes
#!/bin/ksh
#
# Script created by cl_swap_IP_address on Wed Mar 16 17:04:44 BEIST 2011
#
PATH=/usr/es/sbin/cluster:/usr/es/sbin/cluster/utilities:/usr/es/sbin/cluster/events:
/usr/es/sbin/cluster/events/utils:/usr/es/sbin/cluster/events/cmd:/usr/es/sbin/cluster/diag:
/usr/es/sbin/cluster/etc:/usr/es/sbin/cluster/sbin:/usr/es/sbin/cluster/cspoc:
/usr/es/sbin/cluster/conversion:/usr/es/sbin/cluster/events/emulate:
/usr/es/sbin/cluster/events/emulate/driver:/usr/es/sbin/cluster/events/emulate/utils:
/usr/es/sbin/cluster/tguides/bin:/usr/es/sbin/cluster/tguides/classes:
/usr/es/sbin/cluster/tguides/images:/usr/es/sbin/cluster/tguides/scripts:
/usr/es/sbin/cluster/glvm/utils:/usr/es/sbin/cluster/wpar:/usr/bin:/etc:
/usr/sbin:/usr/ucb:/usr/bin/X11:/sbin
PS4='${GROUPNAME:++$GROUPNAME}:${PROGNAME:-${0##*/}}${PS4_TIMER:+($SECONDS)}${PS4_LOOP:+:$PS4_LOOP}[${ERRNO:+${PS4_FUNC:-}+}$LINENO] '
export VERBOSE_LOGGING=${VERBOSE_LOGGING:-"high"}
[[ "$VERBOSE_LOGGING" = "high" ]] && set -x
: Starting $0 at $(date)
#
cl_route_change default 127.0.0.1 192.168.0.254 inet

实际上负责路由改变的是cl_route_change命令,这是一个二进制文件, 在IBM和google中搜索cl_route_change, 可以搜到结果[2][3][4][5], 从这些文章确认这是HACMP的一个BUG, 通过打efax iz63775或者升级到PowerHA 6 SP01. 

因为之前从未使用过efax, 今天顺手玩了一把, 记录如下:

[/tmp/hacmp]#emgr -e IZ63775.epkg.Z 
+-----------------------------------------------------------------------------+
Efix Manager Initialization
+-----------------------------------------------------------------------------+
Initializing log /var/adm/ras/emgr.log ...
Efix package file is: /tmp/hacmp/IZ63775.epkg.Z
MD5 generating command is /usr/bin/csum
MD5 checksum is 8ba66435963cf3318502e7953bfebf8a
Accessing efix metadata ...
Processing efix label "IZ63775" ...
Verifying efix control file ...

+-----------------------------------------------------------------------------+
Installp Prerequisite Verification
+-----------------------------------------------------------------------------+
Verifying prerequisite file ...
Checking prerequisites ...

Prerequisite Number: 1
   Fileset: cluster.es.server.events
   Minimal Level: 6.1.0.0
   Maximum Level: 6.1.0.0
   Actual Level: 6.1.0.0
   Type: PREREQ
   Requisite Met: yes

All prerequisites have been met.

+-----------------------------------------------------------------------------+
Processing APAR reference file
+-----------------------------------------------------------------------------+
APAR reference set to NONE.  Interim fix is not enabled for automatic removal.

+-----------------------------------------------------------------------------+
Efix Attributes
+-----------------------------------------------------------------------------+
LABEL:            IZ63775
PACKAGING DATE:   Fri Oct 23 12:22:46 CDT 2009
ABSTRACT:         Deflt route prblm in base HA 610
PACKAGER VERSION: 7
VUID:             00CCCC5B4C00102312104609
REBOOT REQUIRED:  no
BUILD BOOT IMAGE: no
PRE-REQUISITES:   yes
SUPERSEDE:        no
PACKAGE LOCKS:    no
E2E PREREQS:      no
FIX TESTED:       no
ALTERNATE PATH:   None
EFIX FILES:       1

Install Scripts:
   PRE_INSTALL:   no
   POST_INSTALL:  no
   PRE_REMOVE:    no
   POST_REMOVE:   no

File Number:      1
   LOCATION:      /usr/es/sbin/cluster/events/utils/cl_route_change
   FILE TYPE:     Standard (file or executable)
   INSTALLER:     installp
   SIZE:          76
   ACL:           DEFAULT
   CKSUM:         44210
   PACKAGE:       cluster.es.server.events
   MOUNT INST:    no

+-----------------------------------------------------------------------------+
Efix Description
+-----------------------------------------------------------------------------+
This is a fix to cl_route_change for a problem introduced
in base PowerHA 610.

+-----------------------------------------------------------------------------+
Efix Lock Management
+-----------------------------------------------------------------------------+
Checking locks for file /usr/es/sbin/cluster/events/utils/cl_route_change ...

All files have passed lock checks.

+-----------------------------------------------------------------------------+
Space Requirements
+-----------------------------------------------------------------------------+
Checking space requirements ...

Space statistics (in 512 byte-blocks):
File system: /usr, Free: 16042168, Required: 1288, Deficit: 0.
File system: /tmp, Free: 7191192, Required: 2570, Deficit: 0.

+-----------------------------------------------------------------------------+
Efix Installation Setup
+-----------------------------------------------------------------------------+
Unpacking efix package file ...
Initializing efix installation ...

+-----------------------------------------------------------------------------+
Efix State
+-----------------------------------------------------------------------------+
Setting efix state to: INSTALLING

+-----------------------------------------------------------------------------+
File Archiving
+-----------------------------------------------------------------------------+
Saving all files that will be replaced ...
Save directory is: /usr/emgrdata/efixdata/IZ63775/save
File 1: Saving /usr/es/sbin/cluster/events/utils/cl_route_change as EFSAVE1 ...

+-----------------------------------------------------------------------------+
Efix File Installation
+-----------------------------------------------------------------------------+
Installing all efix files:
Installing efix file #1 (File: /usr/es/sbin/cluster/events/utils/cl_route_change) ...
/usr/sbin/emgr[160]: query:  not found.

Total number of efix files installed is 1.
All efix files installed successfully.

+-----------------------------------------------------------------------------+
Package Locking
+-----------------------------------------------------------------------------+
Processing package locking for all files.
File 1: locking installp fileset cluster.es.server.events.

All package locks processed successfully.

+-----------------------------------------------------------------------------+
Reboot Processing
+-----------------------------------------------------------------------------+
Reboot is not required by this efix package.

+-----------------------------------------------------------------------------+
Efix State
+-----------------------------------------------------------------------------+
Setting efix state to: STABLE

+-----------------------------------------------------------------------------+
Operation Summary
+-----------------------------------------------------------------------------+
Log file is /var/adm/ras/emgr.log

EPKG NUMBER       LABEL               OPERATION              RESULT            
===========       ==============      =================      ==============    
1                 IZ63775             INSTALL                SUCCESS           

Return Status = SUCCESS


[/tmp]#emgr -l

ID  STATE LABEL      INSTALL TIME       ABSTRACT
=== ===== ========== ================== ======================================
1    S    IZ63775    03/16/11 16:08:28  Deflt route prblm in base HA 610    

STATE codes:
 S = STABLE
 M = MOUNTED
 U = UNMOUNTED
 Q = REBOOT REQUIRED
 B = BROKEN
 I = INSTALLING
 R = REMOVING
 T = TESTED


[/tmp]#emgr -l            
There is no efix data on this system.


[/tmp]#emgr -r -L  IZ63775 
+-----------------------------------------------------------------------------+
Efix Manager Initialization
+-----------------------------------------------------------------------------+
Initializing log /var/adm/ras/emgr.log ...
Accessing efix metadata ...
Processing efix label "IZ63775" ...

+-----------------------------------------------------------------------------+
Efix Attributes
+-----------------------------------------------------------------------------+
LABEL:            IZ63775
INSTALL DATE:     03/16/11 16:08:28
STATE:            STABLE
ABSTRACT:         Deflt route prblm in base HA 610
PACKAGER VERSION: 7
VUID:             00CCCC5B4C00102312104609
REBOOT REQUIRED:  no
BUILD BOOT IMAGE: no
PRE-REQUISITES:   yes
SUPERSEDE:        no
PACKAGE LOCKS:    no
E2E PREREQS:      no
FIX TESTED:       no
ALTERNATE PATH:   None
EFIX FILES:       1

Install Scripts:
   PRE_INSTALL:   no
   POST_INSTALL:  no
   PRE_REMOVE:    no
   POST_REMOVE:   no

File Number:      1
   LOCATION:      /usr/es/sbin/cluster/events/utils/cl_route_change
   FILE TYPE:     Standard (file or executable)
   INSTALLER:     installp
   SIZE:          76
   ACL:           DEFAULT
   CKSUM:         44210
   PACKAGE:       cluster.es.server.events
   MOUNT INST:    no

+-----------------------------------------------------------------------------+
Efix Description
+-----------------------------------------------------------------------------+
This is a fix to cl_route_change for a problem introduced
in base PowerHA 610.

+-----------------------------------------------------------------------------+
Space Requirements
+-----------------------------------------------------------------------------+
Checking space requirements ...

Space statistics (in 512 byte-blocks):
File system: /usr, Free: 16041936, Required: 1247, Deficit: 0.

+-----------------------------------------------------------------------------+
Efix State
+-----------------------------------------------------------------------------+
Setting efix state to: REMOVING

+-----------------------------------------------------------------------------+
Package Locking
+-----------------------------------------------------------------------------+
Processing package unlocking for all files.
File 1: unlocking installp fileset cluster.es.server.events.

All package locks processed successfully.

+-----------------------------------------------------------------------------+
Efix File Removal
+-----------------------------------------------------------------------------+
Setting up for removal of efix files ...
Removing all efix files (in reverse order of installation):
Removing efix file #1 (File: /usr/es/sbin/cluster/events/utils/cl_route_change) ...

Total number of efix files removed is 1.

+-----------------------------------------------------------------------------+
Reboot Processing
+-----------------------------------------------------------------------------+
Reboot is not required by this efix package.

+-----------------------------------------------------------------------------+
Operation Summary
+-----------------------------------------------------------------------------+
Log file is /var/adm/ras/emgr.log

EFIX NUMBER       LABEL               OPERATION              RESULT            
===========       ==============      =================      ==============    
1                 IZ63775             REMOVE                 SUCCESS           

Return Status = SUCCESS


系统工程师的三大法宝: 重启, 重装, 打补丁, 还是很有道理的. 

[1] 
[2] 
[3] 
[4] 
[5] 

上一篇:<<你是阳光>>
下一篇:参加ACOUG 3月份聚会