cloudstack下libvirtd服务无响应问题
在cloudstack4.5.2版本下,偶尔出现libvirtd服务无响应的情况,导致virsh命令无法使用,同时伴随cloudstackmaster丢失该slave主机连接的情况。最初怀疑是libvirtd服务或版本的问题,经过分析和排查最终确定是cloudstack-agent的问题。但是在官网上并没有找到类似的bug提交,该问题可能还存在于更高的版本,需要时间进一步从根本上分析。下面是该问题的处理过程,在此记录下,关注和使用cloudstack的朋友可以参考。
众所周知,cloudstack的社区热度远不如openstack,为什么还要选择clcoudstack?这个问题以后有机会再和大家聊。言归正传。
环境交代
宿主机操作系统:centos6.5x64(2.6.32-431.el6.x86_64)
cloudstack版本:4.5.2
libvirt版本:libvirt-0.10.2-54.el6_7.2.x86_64
问题描述
通过cloudstackapilistHosts报警信息显示:
node5.cloud.rtmap:192.168.14.20stateisDownat2016-05-13T07:19:04+0800
#有关cloudstackapi的使用方法在其它文章中总结,不在此处说明。
登陆问题宿主服务器检查:
[root@node5log]#virshlist--all
没有响应ctrl^c退出
这时的vm可以正常工作,但处于失控状态
尝试重启启动libvirtd服务:
[root@node5log]#servicelibvirtdstop
正在关闭libvirtd守护进程: [失败]#无法关闭libvirtd服务
尝试重启启动cloudstack-agent服务:
[root@node5libvirt]#servicecloudstack-agentrestart StoppingCloudAgent: StartingCloudAgent:
libvirtd故障依旧
简单维护
[root@node5ping]#libvirtd-d-l--config/etc/libvirt/libvirtd.conf
libvirtd:错误:Unabletoinitializenetworksockets。查看/var/log/messages或者运行不带--daemon的命令查看更多信息。
[root@node5log]#libvirtd-d
可以执行成功,这时执行virshlist--all可以查看和操作vm
[root@node5log]#virshlist--all Id名称状态 ---------------------------------------------------- 2i-4-185-VMrunning
虽然vm运行正常,现在也可以通过命令正常管理了。但是对于cloudstack平台而言,宿主机处于down状态,vm处于失控状态。
临时解决办法是在其它大的升级和维护过程中重启服务器解决,根本解决还要具体问题具体分析。
分析与排查
检查进程
[root@node5log]#psax|greplibvirtd 6485?R863:37libvirtd--daemon-l#该服务始终处于run状态
[root@node5log]#top-p6485 top-p6485 top-09:19:41up12days,22:27,1user,loadaverage:3.05,5.07,6.64 Tasks:1total,0running,1sleeping,0stopped,0zombie Cpu(s):4.8%us,1.4%sy,0.0%ni,93.1%id,0.6%wa,0.0%hi,0.1%si,0.0%st Mem:264420148ktotal,182040780kused,82379368kfree,834232kbuffers Swap:8388600ktotal,92kused,8388508kfree,100453708kcached PIDUSERPRNIVIRTRESSHRS%CPU%MEMTIME+COMMAND 6485root200984m12m4440R100.20.0844:22.68libvirtd#cpu占用100%,无法释放,影响系统稳定性
杀进程
[root@node5log]#kill-96485 [root@node5log]#kill-96485 [root@masterlog]#psax|greplibvirtd#检查进程依然存在 6485?R863:37libvirtd--daemon-l [root@node5~]#libvirtd-d-l--config/etc/libvirt/libvirtd.conf libvirtd:错误:Unabletoinitializenetworksockets。查看/var/log/messages或者运行不带--daemon的命令查看更多信息。 [root@node5~]#netstat-antp|grep16509 tcp000.0.0.0:165090.0.0.0:*LISTEN3658/libvirtd tcp10192.168.14.25:16509192.168.14.22:8717CLOSE_WAIT- tcp10192.168.14.25:16509192.168.14.20:5152CLOSE_WAIT- tcp10192.168.14.25:16509192.168.14.10:39359CLOSE_WAIT- tcp00:::16509:::*LISTEN3658/libvirtd tcp390::1:16509::1:19715CLOSE_WAIT-
经过上述操作,初步判断libvirtd陷入了hang死状态。
追踪进程
[root@node5log]#strace-flibvirtd [pid107570]close(23058)=-1EBADF(Badfiledescriptor) [pid107570]close(23059)=-1EBADF(Badfiledescriptor) [pid107570]close(23060)=-1EBADF(Badfiledescriptor) [pid107570]close(23061)=-1EBADF(Badfiledescriptor) [pid107570]close(23062)=-1EBADF(Badfiledescriptor) [pid107570]close(23063)=-1EBADF(Badfiledescriptor) [pid107570]close(23064)=-1EBADF(Badfiledescriptor) [pid107570]close(23065)=-1EBADF(Badfiledescriptor) [pid107570]close(23066)=-1EBADF(Badfiledescriptor) [pid107570]close(23067)=-1EBADF(Badfiledescriptor) [pid107570]close(23068)=-1EBADF(Badfiledescriptor) [pid107570]close(23069)=-1EBADF(Badfiledescriptor) [pid107570]close(23070)=-1EBADF(Badfiledescriptor) [pid107570]close(23071)=-1EBADF(Badfiledescriptor) ^C[pid107570]close(23072<unfinished...> Process107559detached Process107560detached Process107561detached Process107562detached Process107563detached Process107564detached Process107565detached Process107566detached Process107567detached Process107568detached Process107569detached Process107570detached
父进程6485在不断的产生和关闭子进程,并返回错误信息。Badfiledescriptor的原因(如何触发的,谁触发的)?循环为何无法退出?问题如何再现?
获得更多的线索
官方文档(libvirtd各种故障诊断记录和解决办法非常详尽)
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Virtualization_Deployment_and_Administration_Guide/sect-Troubleshooting-Common_libvirt_errors_and_troubleshooting.html#sect-libvirtd_failed_to_start
开启系统日志
Changelibvirt'sloggingin/etc/libvirt/libvirtd.confbyenablingthelinebelow.Toenablethesettingtheline,openthe/etc/libvirt/libvirtd.conffileinatexteditor,removethehash(or#)symbolfromthebeginningofthefollowingline,andsavethechange:
log_outputs="3:syslog:libvirtd"
参照配置,重启服务器等待下次故障观察日志
...... Jun112:42:26node5abrtd:Newclientconnected Jun112:42:26node5abrtd:Directory'pyhook-2016-06-01-12:42:26-70065'creationdetected Jun112:42:26node5abrt-server[70066]:SavedPythoncrashdumpofpid70065to/var/spool/abrt/pyhook-2016-06-01-12:42:26-70065 Jun112:42:26node5abrtd:Package'cloudstack-common'isn'tsignedwithproperkey Jun112:42:26node5abrtd:'post-create'on'/var/spool/abrt/pyhook-2016-06-01-12:42:26-70065'exitedwith1 Jun112:42:26node5abrtd:Deletingproblemdirectory'/var/spool/abrt/pyhook-2016-06-01-12:42:26-70065' Jun112:43:26node5abrt:detectedunhandledPythonexceptionin'/usr/share/cloudstack-common/scripts/vm/network/security_group.py' ...... Jun610:36:21node5libvirtd:102840:warning:qemuDomainObjBeginJobInternal:878:Cannotstartjob(modify,none)fordomaini-4-30-VM;currentjobis(modify,none)ownedby(102925,0) Jun610:36:21node5libvirtd:102840:error:qemuDomainObjBeginJobInternal:883:Timedoutduringoperation:cannotacquirestatechangelock Jun610:39:59node5libvirtd:114071:info:libvirtversion:0.10.2,package:54.el6_7.2(CentOSBuildSystem<http://bugs.centos.org>,2015-11-10-10:25:08,c6b9.bsys.dev.centos.org) Jun610:39:59node5libvirtd:114071:error:virNetSocketNewListenTCP:312:Unabletobindtoport:地址已在使用 Jun610:40:46node5libvirtd:114147:info:libvirtversion:0.10.2,package:54.el6_7.2(CentOSBuildSystem<http://bugs.centos.org>,2015-11-10-10:25:08,c6b9.bsys.dev.centos.org) Jun610:40:46node5libvirtd:114147:error:virNetSocketNewListenTCP:312:Unabletobindtoport:地址已在使用 Jun610:42:15node5libvirtd:114204:info:libvirtversion:0.10.2,package:54.el6_7.2(CentOSBuildSystem<http://bugs.centos.org>,2015-11-10-10:25:08,c6b9.bsys.dev.centos.org) Jun610:42:15node5libvirtd:114204:error:virNetSocketNewListenTCP:312:Unabletobindtoport:地址已在使用 Jun610:47:05node5libvirtd:114375:info:libvirtversion:0.10.2,package:54.el6_7.2(CentOSBuildSystem<http://bugs.centos.org>,2015-11-10-10:25:08,c6b9.bsys.dev.centos.org) Jun610:47:05node5libvirtd:114375:error:virNetSocketNewListenTCP:312:Unabletobindtoport:地址已在使用 Jun610:47:23node5libvirtd:114412:info:libvirtversion:0.10.2,package:54.el6_7.2(CentOSBuildSystem<http://bugs.centos.org>,2015-11-10-10:25:08,c6b9.bsys.dev.centos.org) Jun610:47:23node5libvirtd:114412:error:virNetSocketNewListenTCP:312:Unabletobindtoport:地址已在使用 ...... Jun1203:08:02node5rsyslogd:[originsoftware="rsyslogd"swVersion="5.8.10"x-pid="3111"x-info="http://www.rsyslog.com"]rsyslogdwasHUPed Jun1209:20:40node5libvirtd:72575:info:libvirtversion:0.10.2,package:54.el6_7.2(CentOSBuildSystem<http://bugs.centos.org>,2015-11-10-10:25:08,c6b9.bsys.dev.centos.org) Jun1209:20:40node5libvirtd:72575:error:virPidFileAcquirePath:410:Failedtoacquirepidfile'/var/run/libvirtd.pid':资源暂时不可用
并未获得致命错误和更多线索。(该日志配置选项还是很有必要打开的,很多问题都可以通过它来定位)
解决过程
解决思路
尝试和找到终止进程、重启服务的方法
提交bug,等待补丁升级
分析源代码,再现问题,解决问题(投入研发和时间)
由于不能再现问题,还是从简入繁吧。触发这些子进程的元凶是谁?还是cloudstack-agent的嫌疑最大,但之前重启过该服务并没有解决问题,那么agent服务是怎么一回事呢?
看下启动脚本可以基本了解,
[root@node5libvirt]#cat/etc/rc.d/init.d/cloudstack-agent #!/bin/bash #chkconfig:359910 #description:CloudAgent #LicensedtotheApacheSoftwareFoundation(ASF)underone #ormorecontributorlicenseagreements.SeetheNOTICEfile #distributedwiththisworkforadditionalinformation #regardingcopyrightownership.TheASFlicensesthisfile #toyouundertheApacheLicense,Version2.0(the #"License");youmaynotusethisfileexceptincompliance #withtheLicense.YoumayobtainacopyoftheLicenseat # #http://www.apache.org/licenses/LICENSE-2.0 # #Unlessrequiredbyapplicablelaworagreedtoinwriting, #softwaredistributedundertheLicenseisdistributedonan #"ASIS"BASIS,WITHOUTWARRANTIESORCONDITIONSOFANY #KIND,eitherexpressorimplied.SeetheLicenseforthe #specificlanguagegoverningpermissionsandlimitations #undertheLicense. #WARNING:ifthisscriptischanged,thenallotherinitscriptsMUSTBEchangedtomatchitaswell ./etc/rc.d/init.d/functions #setenvironmentvariables SHORTNAME=$(basename$0|sed-e's/^[SK][0-9][0-9]//') PIDFILE=/var/run/"$SHORTNAME".pid LOCKFILE=/var/lock/subsys/"$SHORTNAME" LOGDIR=/var/log/cloudstack/agent LOGFILE=${LOGDIR}/agent.log PROGNAME="CloudAgent" CLASS="com.cloud.agent.AgentShell" JSVC=`whichjsvc2>/dev/null`; #exitifwedon'tfindjsvc if[-z"$JSVC"];then echonojsvcfoundinpath; exit1; fi unsetOPTIONS [-r/etc/sysconfig/"$SHORTNAME"]&&source/etc/sysconfig/"$SHORTNAME" #ThefirstexistingdirectoryisusedforJAVA_HOME(ifJAVA_HOMEisnotdefinedin$DEFAULT) JDK_DIRS="/usr/lib/jvm/jre/usr/lib/jvm/java-7-openjdk/usr/lib/jvm/java-7-openjdk-i386/usr/lib/jvm/java-7-openjdk-amd64/usr/lib/jvm/java-6-openjdk/usr/lib/jvm/java-6-openjdk-i386/usr/lib/jvm/java-6-openjdk-amd64/usr/lib/jvm/java-6-sun" forjdirin$JDK_DIRS;do if[-r"$jdir/bin/java"-a-z"${JAVA_HOME}"];then JAVA_HOME="$jdir" fi done exportJAVA_HOME ACP=`ls/usr/share/cloudstack-agent/lib/*.jar|tr'\n'':'|seds'/.$//'` PCP=`ls/usr/share/cloudstack-agent/plugins/*.jar2>/dev/null|tr'\n'':'|seds'/.$//'` #WeneedtoappendtheJSVCdaemonJARtotheclasspath #AgentShellimplementstheJSVCdaemonmethods exportCLASSPATH="/usr/share/java/commons-daemon.jar:$ACP:$PCP:/etc/cloudstack/agent:/usr/share/cloudstack-common/scripts" start(){ echo-n$"Starting$PROGNAME:" ifhostname--fqdn>/dev/null2>&1;then $JSVC-Xms256m-Xmx2048m-cp"$CLASSPATH"-pidfile"$PIDFILE"\ -errfile$LOGDIR/cloudstack-agent.err-outfile$LOGDIR/cloudstack-agent.out$CLASS RETVAL=$? echo else failure echo echoThehostnamedoesnotresolveproperlytoanIPaddress.Cannotstart"$PROGNAME".>/dev/stderr RETVAL=9 fi [$RETVAL=0]&&touch${LOCKFILE} return$RETVAL } stop(){ echo-n$"Stopping$PROGNAME:" $JSVC-pidfile"$PIDFILE"-stop$CLASS RETVAL=$? echo [$RETVAL=0]&&rm-f${LOCKFILE}${PIDFILE} } case"$1"in start) start ;; stop) stop ;; status) status-p${PIDFILE}$SHORTNAME RETVAL=$? ;; restart) stop sleep3 start ;; condrestart) ifstatus-p${PIDFILE}$SHORTNAME>&/dev/null;then stop sleep3 start fi ;; *) echo$"Usage:$SHORTNAME{start|stop|restart|condrestart|status|help}" RETVAL=3 esac exit$RETVAL
[root@node5libvirt]#psax|grepjsvc.exec 6655?Ss0:00jsvc.exec-Xms256m-Xmx2048m-cp/usr/share/java/commons-daemon.jar:/usr/share/cloudstack-agent/lib/activation-1.1.jar:/usr/share/cloudstack-agent/lib/antisamy-1.4.3.jar:/usr/share/cloudstack-agent/lib/aopalliance-1.0.jar:/usr/share/cloudstack-agent/lib/apache-log4j-extras-1.1.jar:/usr/share/cloudstack-agent/lib/aspectjweaver-1.7.0.jar:/usr/share/cloudstack-agent/lib/aws-java-sdk-1.3.22.jar:/usr/share/cloudstack-agent/lib/batik-css-1.7.jar:/usr/share/cloudstack-agent/lib/batik-ext-1.7.jar:/usr/share/cloudstack-agent/lib/batik-util-1.7.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk15-1.46.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk16-1.46.jar:/usr/share/cloudstack-agent/lib/bsh-core-2.0b4.jar:/usr/share/cloudstack-agent/lib/cglib-nodep-2.2.2.jar:/usr/share/cloudstack-agent/lib/cloud-agent-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-core-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-components-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-schema-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-cluster-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-config-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-db-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-events-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-ipc-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-jobs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-managed-context-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-rest-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-security-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-hypervisor-kvm-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-network-ovs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-server-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-utils-4.5.2.jar:/usr/share/cloudstack-agent/lib/commons-beanutils-core-1.7.0.jar:/usr/share/cloudstack-agent/lib/commons-codec-1.6.jar:/usr/share/cloudstack-agent/lib/commons-collections-3.2.1.jar:/usr/share/cloudstack-agent/lib/commons-configuration-1.8.jar:/usr/share/cloudstack-agent/lib/commons-daemon-1.0.10.jar:/usr/share/cloudstack-agent/lib/commons-dbcp-1.4.jar:/usr/share/cloudstack-agent/lib/commons-fileupload-1.2.jar:/usr/share/cloudstack-agent/lib/commons-httpclient-3.1.jar:/usr/share/cloudstack-agent/lib/commons-io-1.4.jar:/usr/share/cloudstack-agent/lib/commons-lang-2.6.jar:/usr/share/cloudstack-agent/lib/commons-logging-1.1.3.jar:/usr/share/cloudstack-agent/lib/commons-net-3.3.jar:/usr/share/cloudstack-agent/lib/commons-pool-1.6.jar:/usr/share/cloudstack-agent/lib/cxf-bundle-jaxrs-2.7.0.jar:/usr/share/cloudstack-agent/lib/dom4j-1.6.1.jar:/usr/share/cloudstack-agent/lib/ehcache-core-2.6.6.jar:/usr/share/cloudstack-agent/lib/ejb-api-3.0.jar:/usr/share/cloudstack-agent/lib/esapi-2.0.1.jar:/usr/share/cloudstack-agent/lib/geronimo-javamail_1.4_spec-1.7.1.jar:/usr/share/cloudstack-agent/lib/geronimo-servlet_3.0_spec-1.0.jar:/usr/share/cloudstack-agent/lib/gson-1.7.2.jar:/usr/share/cloudstack-agent/lib/guava-14.0-rc1.jar:/usr/share/cloudstack-agent/lib/httpclient-4.3.6.jar:/usr/share/cloudstack-agent/lib/httpcore-4.3.3.jar:/usr/share/cloudstack-agent/lib/jackson-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-databind-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-jaxrs-json-provider-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-mapper-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-module-jaxb-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jasypt-1.9.0.jar:/usr/share/cloudstack-agent/lib/java-ipv6-0.10.jar:/usr/share/cloudstack-agent/lib/javassist-3.12.1.GA.jar:/usr/share/cloudstack-agent/lib/javassist-3.18.1-GA.jar:/usr/share/cloudstack-agent/lib/javax.inject-1.jar:/usr/share/cloudstack-agent/lib/javax.persistence-2.0.0.jar:/usr/share/cloudstack-agent/lib/javax.ws.rs-api-2.0-m10.jar 6657?Sl0:05jsvc.exec-Xms256m-Xmx2048m-cp/usr/share/java/commons-daemon.jar:/usr/share/cloudstack-agent/lib/activation-1.1.jar:/usr/share/cloudstack-agent/lib/antisamy-1.4.3.jar:/usr/share/cloudstack-agent/lib/aopalliance-1.0.jar:/usr/share/cloudstack-agent/lib/apache-log4j-extras-1.1.jar:/usr/share/cloudstack-agent/lib/aspectjweaver-1.7.0.jar:/usr/share/cloudstack-agent/lib/aws-java-sdk-1.3.22.jar:/usr/share/cloudstack-agent/lib/batik-css-1.7.jar:/usr/share/cloudstack-agent/lib/batik-ext-1.7.jar:/usr/share/cloudstack-agent/lib/batik-util-1.7.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk15-1.46.jar:/usr/share/cloudstack-agent/lib/bcprov-jdk16-1.46.jar:/usr/share/cloudstack-agent/lib/bsh-core-2.0b4.jar:/usr/share/cloudstack-agent/lib/cglib-nodep-2.2.2.jar:/usr/share/cloudstack-agent/lib/cloud-agent-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-core-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-components-api-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-engine-schema-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-cluster-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-config-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-db-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-events-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-ipc-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-jobs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-managed-context-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-rest-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-framework-security-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-hypervisor-kvm-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-plugin-network-ovs-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-server-4.5.2.jar:/usr/share/cloudstack-agent/lib/cloud-utils-4.5.2.jar:/usr/share/cloudstack-agent/lib/commons-beanutils-core-1.7.0.jar:/usr/share/cloudstack-agent/lib/commons-codec-1.6.jar:/usr/share/cloudstack-agent/lib/commons-collections-3.2.1.jar:/usr/share/cloudstack-agent/lib/commons-configuration-1.8.jar:/usr/share/cloudstack-agent/lib/commons-daemon-1.0.10.jar:/usr/share/cloudstack-agent/lib/commons-dbcp-1.4.jar:/usr/share/cloudstack-agent/lib/commons-fileupload-1.2.jar:/usr/share/cloudstack-agent/lib/commons-httpclient-3.1.jar:/usr/share/cloudstack-agent/lib/commons-io-1.4.jar:/usr/share/cloudstack-agent/lib/commons-lang-2.6.jar:/usr/share/cloudstack-agent/lib/commons-logging-1.1.3.jar:/usr/share/cloudstack-agent/lib/commons-net-3.3.jar:/usr/share/cloudstack-agent/lib/commons-pool-1.6.jar:/usr/share/cloudstack-agent/lib/cxf-bundle-jaxrs-2.7.0.jar:/usr/share/cloudstack-agent/lib/dom4j-1.6.1.jar:/usr/share/cloudstack-agent/lib/ehcache-core-2.6.6.jar:/usr/share/cloudstack-agent/lib/ejb-api-3.0.jar:/usr/share/cloudstack-agent/lib/esapi-2.0.1.jar:/usr/share/cloudstack-agent/lib/geronimo-javamail_1.4_spec-1.7.1.jar:/usr/share/cloudstack-agent/lib/geronimo-servlet_3.0_spec-1.0.jar:/usr/share/cloudstack-agent/lib/gson-1.7.2.jar:/usr/share/cloudstack-agent/lib/guava-14.0-rc1.jar:/usr/share/cloudstack-agent/lib/httpclient-4.3.6.jar:/usr/share/cloudstack-agent/lib/httpcore-4.3.3.jar:/usr/share/cloudstack-agent/lib/jackson-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-core-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-databind-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-jaxrs-json-provider-2.1.1.jar:/usr/share/cloudstack-agent/lib/jackson-mapper-asl-1.8.9.jar:/usr/share/cloudstack-agent/lib/jackson-module-jaxb-annotations-2.1.1.jar:/usr/share/cloudstack-agent/lib/jasypt-1.9.0.jar:/usr/share/cloudstack-agent/lib/java-ipv6-0.10.jar:/usr/share/cloudstack-agent/lib/javassist-3.12.1.GA.jar:/usr/share/cloudstack-agent/lib/javassist-3.18.1-GA.jar:/usr/share/cloudstack-agent/lib/javax.inject-1.jar:/usr/share/cloudstack-agent/lib/javax.persistence-2.0.0.jar:/usr/share/cloudstack-agent/lib/javax.ws.rs-api-2.0-m10.jar
重启服务
[root@node5bin]#servicecloudstack-agentstatus cloudstack-agent(pid6657)正在运行... [root@node5bin]#servicecloudstack-agentstop StoppingCloudAgent: [root@node5bin]#servicecloudstack-agentstatus cloudstack-agent(pid6657)正在运行..
psax|grepjsvc.exec也验证了进程依然存在
眼前一亮的同时,也发现了之前使用restart带来的问题,stop不成功的问题被掩盖了~~~有没有懊恼?不过来不及反思,接下来的问题还远不是这么简单......
[root@node5bin]#kill-966556657 [root@node5bin]#kill-966556657 -bash:kill:(6655)-没有那个进程 -bash:kill:(6657)-没有那个进程 [root@node5bin]#servicecloudstack-agentstatus cloudstack-agent已死,但pid文件仍存 [root@node5bin]#rm/var/run/cloudstack-agent.pid rm:是否删除普通文件"/var/run/cloudstack-agent.pid"?y [root@node5bin]#servicecloudstack-agentstatus cloudstack-agent已死,但是subsys被锁 [root@node5bin]#servicecloudstack-agentstart [root@node5bin]#servicecloudstack-agentstatus cloudstack-agent(pid109382)正在运行... [root@node5bin]#netstat-antp|grep8250 tcp00192.168.14.20:22220192.168.14.10:8250ESTABLISHED109382/jsvc.exec
处理后状态恢复正常,但是libvirtd仍然无法杀掉,很快netstat-antp|grep8250状态再次消失,cloudstackmaster平台监控主机记录由Up状态转为disconnect状态。不过毕竟不是down状态,较之前已经有了进步。
启动一个libvirtd-d看下,
[root@node5bin]#libvirtd-d [root@node5bin]#psax|greplibvirtd 6485?R863:37libvirtd--daemon-l 130057?Sl0:38libvirtd-d 28904pts/0S+0:00greplibvirtd
然后在cloudstackmaster平台上手工点击强制重新连接该主机,成功了。主机监控状态由disconnect转为Up,这时再次尝试杀掉6485仍然是不成功的,于是又在cloudstackmaster管理平台上尝试着点击操作了一下暂停vm命令,vm成功暂停。再返回服务器上观察原来hung死的libvirtd进程已经消失。
[root@node5bin]#libvirtd-d [root@node5bin]#psax|greplibvirtd 130057?Sl0:38libvirtd-d 28904pts/0S+0:00greplibvirtd
至此既恢复了平台对该主机的管控,也终止了libvirtd异常进程。问题初步归于cloudstack-agent在处理发送个libvirtd的信号上存在些小问题。以后再单独分析下jsvc进程,再现问题和根本解决。
问题反思
在处理服务异常的问题上,命令行参数不要用restart,用stop和kill来调试。说起来都是泪!