You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
Cluster member state is never updated to 'DOWN'.
集群成员的健康状态未被正确更新至DOWN。
Expected behavior
I reduce the pod replicas total of a Nacos deployed on a k8s cluster by one(originally 3). The state of the Nacos node in the pod reduced should then be updated to 'DOWN' and all requests from clients should be dealt properly.
我对k8s集群上部署的Nacos集群做了减pod操作,则Nacos集群应该感知到这个节点的健康状态变化,把其设置为DOWN,也不会再把请求转发到这个节点上去。
Acutally behavior
The node state is updated to 'SUSPICIOUS' but never become 'DOWN'. The requests are still forwarded to the node from others member.
节点状态变为了SUSPICIOUS,但一直保持在这个状态上,未被设置为DOWN。其他节点仍然会把请求转发到这个节点上,导致约1/3请求报错。
How to Reproduce
The Nacos cluster works properly(3 pods).
Reduce its replicas total by one.
Log in the console and check the cluster member state.
部署在k8s集群上的Nacos集群稳定工作
减一个Pod
登录控制台查看节点健康状态,发现最后一个节点一直为SUSPICIOUS
Desktop (please complete the following information):
Version nacos-server 1.3.2
Module core
Additional context
cluster log:
2020-11-23 09:42:32,030 ERROR failed to report new info to target node : xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local:8848, error : caused: xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local;
2020-11-23 09:42:36,032 ERROR failed to report new info to target node : xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local:8848, error : caused: xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local;
2020-11-23 09:42:40,034 ERROR failed to report new info to target node : xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local:8848, error : caused: xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local;
According to my guess, the code below leads to the bug.
/** * Failure processing of the operation on the node. * * @param member {@link Member} * @param ex {@link Throwable} */publicstaticvoidonFail(Membermember, Throwableex) {
MembercloneMember = newMember();
copy(member, cloneMember);
manager.getMemberAddressInfos().remove(member.getAddress());
cloneMember.setState(NodeState.SUSPICIOUS);
cloneMember.setFailAccessCnt(member.getFailAccessCnt() + 1);
intmaxFailAccessCnt = EnvUtil.getProperty("nacos.core.member.fail-access-cnt", Integer.class, 3);
// If the number of consecutive failures to access the target node reaches// a maximum, or the link request is rejected, the state is directly downif (cloneMember.getFailAccessCnt() > maxFailAccessCnt || StringUtils
.containsIgnoreCase(ex.getMessage(), TARGET_MEMBER_CONNECT_REFUSE_ERRMSG)) {
cloneMember.setState(NodeState.DOWN);
}
manager.update(cloneMember);
}
Here we do add failAccessCnt of the cloneMember by one, but when I look into the function
manager.update(cloneMember);
I find that the failAccessCnt of cloneMember isn't copied to the original member object:
/** * member information update. * * @param newMember {@link Member} * @return update is success */publicbooleanupdate(MembernewMember) {
Loggers.CLUSTER.debug("member information update : {}", newMember);
Stringaddress = newMember.getAddress();
if (!serverList.containsKey(address)) {
returnfalse;
}
serverList.computeIfPresent(address, (s, member) -> {
if (NodeState.DOWN.equals(newMember.getState())) {
memberAddressInfos.remove(newMember.getAddress());
}
if (!MemberUtils.fullEquals(newMember, member)) {
newMember.setExtendVal(MemberMetaDataConstants.LAST_REFRESH_TIME, System.currentTimeMillis());
MemberUtils.copy(newMember, member);
// member data changes and all listeners need to be notifiedNotifyCenter.publishEvent(MembersChangeEvent.builder().members(allMembers()).build());
}
returnmember;
});
returntrue;
}
In the method 'copy', we don't copy the field 'failAccessCnt'.
我的nacos是1.3.2 三节点部署同样出现
2020-11-30 19:38:38,234 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
2020-11-30 19:38:42,233 ERROR failed to report new info to target node : 127.0.0.1:8849, error : caused: Connection refused: no further information;
2020-11-30 19:38:44,248 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
请问解决了吗?
我的nacos是1.3.2 三节点部署同样出现
2020-11-30 19:38:38,234 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
2020-11-30 19:38:42,233 ERROR failed to report new info to target node : 127.0.0.1:8849, error : caused: Connection refused: no further information;
2020-11-30 19:38:44,248 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
请问解决了吗?
…the node becomes down (#4371)
* fix: fix issue #4364
* refactor: fixed node changes that could not trigger event publishing
* fix: fixed a problem with frequent releases of events
Describe the bug
Cluster member state is never updated to 'DOWN'.
集群成员的健康状态未被正确更新至DOWN。
Expected behavior
I reduce the pod replicas total of a Nacos deployed on a k8s cluster by one(originally 3). The state of the Nacos node in the pod reduced should then be updated to 'DOWN' and all requests from clients should be dealt properly.
我对k8s集群上部署的Nacos集群做了减pod操作,则Nacos集群应该感知到这个节点的健康状态变化,把其设置为DOWN,也不会再把请求转发到这个节点上去。
Acutally behavior
The node state is updated to 'SUSPICIOUS' but never become 'DOWN'. The requests are still forwarded to the node from others member.
节点状态变为了SUSPICIOUS,但一直保持在这个状态上,未被设置为DOWN。其他节点仍然会把请求转发到这个节点上,导致约1/3请求报错。
How to Reproduce
Desktop (please complete the following information):
Additional context
cluster log:
According to my guess, the code below leads to the bug.
Here we do add failAccessCnt of the cloneMember by one, but when I look into the function
I find that the failAccessCnt of cloneMember isn't copied to the original member object:
In the method 'copy', we don't copy the field 'failAccessCnt'.
maybe alter the method 'copy' as follow would help?
The text was updated successfully, but these errors were encountered: