Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster member state isn't updated to 'DOWN' after the pod becomes down. #4364

Closed
Xezeloh opened this issue Nov 30, 2020 · 3 comments
Closed
Labels
area/Nacos Core contribution welcome kind/bug Category issues or prs related to bug.
Milestone

Comments

@Xezeloh
Copy link

Xezeloh commented Nov 30, 2020

Describe the bug
Cluster member state is never updated to 'DOWN'.
集群成员的健康状态未被正确更新至DOWN。

Expected behavior
I reduce the pod replicas total of a Nacos deployed on a k8s cluster by one(originally 3). The state of the Nacos node in the pod reduced should then be updated to 'DOWN' and all requests from clients should be dealt properly.
我对k8s集群上部署的Nacos集群做了减pod操作,则Nacos集群应该感知到这个节点的健康状态变化,把其设置为DOWN,也不会再把请求转发到这个节点上去。

Acutally behavior
The node state is updated to 'SUSPICIOUS' but never become 'DOWN'. The requests are still forwarded to the node from others member.
节点状态变为了SUSPICIOUS,但一直保持在这个状态上,未被设置为DOWN。其他节点仍然会把请求转发到这个节点上,导致约1/3请求报错。

How to Reproduce

  1. The Nacos cluster works properly(3 pods).
  2. Reduce its replicas total by one.
  3. Log in the console and check the cluster member state.
  • 部署在k8s集群上的Nacos集群稳定工作
  • 减一个Pod
  • 登录控制台查看节点健康状态,发现最后一个节点一直为SUSPICIOUS

Desktop (please complete the following information):

  • Version nacos-server 1.3.2
  • Module core

Additional context
cluster log:

2020-11-23 09:42:32,030 ERROR failed to report new info to target node : xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local:8848, error : caused: xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local;

2020-11-23 09:42:36,032 ERROR failed to report new info to target node : xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local:8848, error : caused: xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local;

2020-11-23 09:42:40,034 ERROR failed to report new info to target node : xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local:8848, error : caused: xxxx-sts-nacos-2.xxxx-svc-nacos.yyyy-xxxx.svc.cluster.local;

According to my guess, the code below leads to the bug.

    /**
     * Failure processing of the operation on the node.
     *
     * @param member {@link Member}
     * @param ex     {@link Throwable}
     */
    public static void onFail(Member member, Throwable ex) {
        Member cloneMember = new Member();
        copy(member, cloneMember);
        manager.getMemberAddressInfos().remove(member.getAddress());
        cloneMember.setState(NodeState.SUSPICIOUS);
        cloneMember.setFailAccessCnt(member.getFailAccessCnt() + 1);
        int maxFailAccessCnt = EnvUtil.getProperty("nacos.core.member.fail-access-cnt", Integer.class, 3);
        
        // If the number of consecutive failures to access the target node reaches
        // a maximum, or the link request is rejected, the state is directly down
        if (cloneMember.getFailAccessCnt() > maxFailAccessCnt || StringUtils
                .containsIgnoreCase(ex.getMessage(), TARGET_MEMBER_CONNECT_REFUSE_ERRMSG)) {
            cloneMember.setState(NodeState.DOWN);
        }
        manager.update(cloneMember);
    }

Here we do add failAccessCnt of the cloneMember by one, but when I look into the function

        manager.update(cloneMember);

I find that the failAccessCnt of cloneMember isn't copied to the original member object:

    /**
     * member information update.
     *
     * @param newMember {@link Member}
     * @return update is success
     */
    public boolean update(Member newMember) {
        Loggers.CLUSTER.debug("member information update : {}", newMember);
        
        String address = newMember.getAddress();
        if (!serverList.containsKey(address)) {
            return false;
        }
        
        serverList.computeIfPresent(address, (s, member) -> {
            if (NodeState.DOWN.equals(newMember.getState())) {
                memberAddressInfos.remove(newMember.getAddress());
            }
            if (!MemberUtils.fullEquals(newMember, member)) {
                newMember.setExtendVal(MemberMetaDataConstants.LAST_REFRESH_TIME, System.currentTimeMillis());
                MemberUtils.copy(newMember, member);
                // member data changes and all listeners need to be notified
                NotifyCenter.publishEvent(MembersChangeEvent.builder().members(allMembers()).build());
            }
            return member;
        });
        
        return true;
    }

In the method 'copy', we don't copy the field 'failAccessCnt'.

    /**
     * Information copy.
     *
     * @param newMember {@link Member}
     * @param oldMember {@link Member}
     */
    public static void copy(Member newMember, Member oldMember) {
        oldMember.setIp(newMember.getIp());
        oldMember.setPort(newMember.getPort());
        oldMember.setState(newMember.getState());
        oldMember.setExtendInfo(newMember.getExtendInfo());
        oldMember.setAddress(newMember.getAddress());
    }

maybe alter the method 'copy' as follow would help?

    /**
     * Information copy.
     *
     * @param newMember {@link Member}
     * @param oldMember {@link Member}
     */
    public static void copy(Member newMember, Member oldMember) {
        oldMember.setIp(newMember.getIp());
        oldMember.setPort(newMember.getPort());
        oldMember.setState(newMember.getState());
        oldMember.setExtendInfo(newMember.getExtendInfo());
        oldMember.setAddress(newMember.getAddress());
        oldMember.setFailAccessCnt(newMember.getFailAccessCnt());
    }
@KomachiSion KomachiSion added the kind/bug Category issues or prs related to bug. label Nov 30, 2020
@KomachiSion KomachiSion added this to the 1.4.1 milestone Nov 30, 2020
@Xezeloh
Copy link
Author

Xezeloh commented Nov 30, 2020

@i will solve it@

@chuntaojun chuntaojun assigned chuntaojun and unassigned chuntaojun Nov 30, 2020
@pipipapi
Copy link

@i will solve it@

我的nacos是1.3.2 三节点部署同样出现
2020-11-30 19:38:38,234 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
2020-11-30 19:38:42,233 ERROR failed to report new info to target node : 127.0.0.1:8849, error : caused: Connection refused: no further information;
2020-11-30 19:38:44,248 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
请问解决了吗?

@chuntaojun
Copy link
Collaborator

@i will solve it@

我的nacos是1.3.2 三节点部署同样出现
2020-11-30 19:38:38,234 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
2020-11-30 19:38:42,233 ERROR failed to report new info to target node : 127.0.0.1:8849, error : caused: Connection refused: no further information;
2020-11-30 19:38:44,248 ERROR failed to report new info to target node : 127.0.0.1:8850, error : caused: Connection refused: no further information;
请问解决了吗?

这是节点的网络不同的问题,自己排查下集群cluster.conf的设置以及节点监听的端口和IP

chuntaojun added a commit to chuntaojun/nacos that referenced this issue Nov 30, 2020
KomachiSion pushed a commit that referenced this issue Dec 16, 2020
…the node becomes down (#4371)

* fix: fix issue #4364

* refactor: fixed node changes that could not trigger event publishing

* fix: fixed a problem with frequent releases of events
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/Nacos Core contribution welcome kind/bug Category issues or prs related to bug.
Projects
None yet
Development

No branches or pull requests

4 participants