elasticsearch源码分析-05分片分配

🕗 发布于 2024-07-17 20:43 elasticsearch 大数据

分片分配

上面的集群状态和索引级状态已经恢复完成，开始分配索引分片，恢复分片级元数据，构建routingTable路由表
allocationService的创建在ClusterModule的构造函数中

public ClusterModule(Settings settings, ClusterService clusterService, List<ClusterPlugin> clusterPlugins,
                     ClusterInfoService clusterInfoService) {
    this.clusterPlugins = clusterPlugins;
    //分配选择器
    this.deciderList = createAllocationDeciders(settings, clusterService.getClusterSettings(), clusterPlugins);
    this.allocationDeciders = new AllocationDeciders(deciderList);
    //分片分配
    this.shardsAllocator = createShardsAllocator(settings, clusterService.getClusterSettings(), clusterPlugins);
    this.clusterService = clusterService;
    this.indexNameExpressionResolver = new IndexNameExpressionResolver();
    //分配服务
    this.allocationService = new AllocationService(allocationDeciders, shardsAllocator, clusterInfoService);
}

在这里初始化了分配选择器

Map<Class, AllocationDecider> deciders = new LinkedHashMap<>();
addAllocationDecider(deciders, new MaxRetryAllocationDecider());
addAllocationDecider(deciders, new ResizeAllocationDecider());
addAllocationDecider(deciders, new ReplicaAfterPrimaryActiveAllocationDecider());
addAllocationDecider(deciders, new RebalanceOnlyWhenActiveAllocationDecider());
addAllocationDecider(deciders, new ClusterRebalanceAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new ConcurrentRebalanceAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new EnableAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new NodeVersionAllocationDecider());
addAllocationDecider(deciders, new SnapshotInProgressAllocationDecider());
addAllocationDecider(deciders, new RestoreInProgressAllocationDecider());
addAllocationDecider(deciders, new FilterAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new SameShardAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new DiskThresholdDecider(settings, clusterSettings));
addAllocationDecider(deciders, new ThrottlingAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new ShardsLimitAllocationDecider(settings, clusterSettings));
addAllocationDecider(deciders, new AwarenessAllocationDecider(settings, clusterSettings));

ES中分配分片有两种一个是allocators分配器和deciders决策器，allocators主要负责尝试寻找最优的节点来分配分片，deciders则判断当前分配方案是否可行
deciders都继承AllocationDecider类，这个抽象类有几种决策方法

canRebalance：分片是否可以平衡到给的的allocation
canAllocate：分片是否可以分配到给定的节点
canRemain：分片是否可以保留在给定的节点
shouldAutoExpandToNode：分片是否可以自动扩展到给定节点
canForceAllocatePrimary：主分片是否可以强制分配在给定节点
决策结果分为ALWAYS、YES、NO、THROTTLE，默认值都是always。
这些决策器可以分为一下几类：
负载均衡类

//相同的shard不能分配到同一节点
SameShardAllocationDecider
//一个节点可以存在同一index下的分片数量限制
ShardsLimitAllocationDecider
//机架感知，尽量将分片分配到不同的机架上
AwarenessAllocationDecider

并发控制类

//重新负载并发控制
ConcurrentRebalanceAllocationDecider
//根据磁盘空间进行分配决策
DiskThresholdDecider
//恢复节点限速控制
ThrottlingAllocationDecider

条件限制类

//所有分片都处于活跃状态才能rebalance
RebalanceOnlyWhenActiveAllocationDecider
//可以配置根据节点ip或名称，过滤节点
FilterAllocationDecider
//只有在主分片分配后再分配
ReplicaAfterPrimaryActiveAllocationDecider
//根据active的shards来决定是否执行rebalance
ClusterRebalanceAllocationDecider

接着回到ClusterModule的构造函数中，这里创建shardsAllocator，就是BalancedShardsAllocator

//分片分配
        this.shardsAllocator = createShardsAllocator(settings, clusterService.getClusterSettings(), clusterPlugins);
 //分片平衡分配
        allocators.put(BALANCED_ALLOCATOR, () -> new BalancedShardsAllocator(settings, clusterSettings));

回到reroute方法

allocationService.reroute(newState, "state recovered");

public ClusterState reroute(ClusterState clusterState, String reason) {
    ClusterState fixedClusterState = adaptAutoExpandReplicas(clusterState);

    RoutingNodes routingNodes = getMutableRoutingNodes(fixedClusterState);
    // shuffle the unassigned nodes, just so we won't have things like poison failed shards
    //重新平衡未分配的节点
    routingNodes.unassigned().shuffle();
    RoutingAllocation allocation = new RoutingAllocation(allocationDeciders, routingNodes, fixedClusterState,
                                                         clusterInfoService.getClusterInfo(), currentNanoTime());
    //分片
    reroute(allocation);
    if (fixedClusterState == clusterState && allocation.routingNodesChanged() == false) {
        return clusterState;
    }
    return buildResultAndLogHealthChange(clusterState, allocation, reason);
}

首先对未分配分片进行重新平衡，然后执行分配

private void reroute(RoutingAllocation allocation) {
    //移除延迟分片的分片
    removeDelayMarkers(allocation);
    //尝试先分配现有的分片副本
    allocateExistingUnassignedShards(allocation);  // try to allocate existing shard copies first
    //负载均衡的分配
    shardsAllocator.allocate(allocation);
    assert RoutingNodes.assertShardStats(allocation.routingNodes());
}

移除延迟分配的分片，然后尝试分配现有分片

private void allocateExistingUnassignedShards(RoutingAllocation allocation) {
        //按优先顺序排序
        allocation.routingNodes().unassigned().sort(PriorityComparator.getAllocationComparator(allocation)); // sort for priority ordering

        for (final ExistingShardsAllocator existingShardsAllocator : existingShardsAllocators.values()) {
            existingShardsAllocator.beforeAllocation(allocation);
        }
        //先分配已经分配过的主分片
        final RoutingNodes.UnassignedShards.UnassignedIterator primaryIterator = allocation.routingNodes().unassigned().iterator();
        while (primaryIterator.hasNext()) {
            final ShardRouting shardRouting = primaryIterator.next();
            if (shardRouting.primary()) {
                //分配主分片
                getAllocatorForShard(shardRouting, allocation)
                    .allocateUnassigned(shardRouting, allocation, primaryIterator);
            }
        }

        for (final ExistingShardsAllocator existingShardsAllocator : existingShardsAllocators.values()) {
            existingShardsAllocator.afterPrimariesBeforeReplicas(allocation);
        }
        //执行未分配分片分配
        final RoutingNodes.UnassignedShards.UnassignedIterator replicaIterator = allocation.routingNodes().unassigned().iterator();
        while (replicaIterator.hasNext()) {
            final ShardRouting shardRouting = replicaIterator.next();
            if (shardRouting.primary() == false) {
                getAllocatorForShard(shardRouting, allocation)
                    .allocateUnassigned(shardRouting, allocation, replicaIterator);
            }
        }
    }

执行GatewayAllocator执行已存在分片的分片

//已存在的分片分配
public void allocateUnassigned(ShardRouting shardRouting, final RoutingAllocation allocation,UnassignedAllocationHandler unassignedAllocationHandler) {
    assert primaryShardAllocator != null;
    assert replicaShardAllocator != null;
    innerAllocatedUnassigned(allocation, primaryShardAllocator, replicaShardAllocator, shardRouting, unassignedAllocationHandler);
}

protected static void innerAllocatedUnassigned(RoutingAllocation allocation,
                                               PrimaryShardAllocator primaryShardAllocator,
                                               ReplicaShardAllocator replicaShardAllocator,
                                               ShardRouting shardRouting,
                                               ExistingShardsAllocator.UnassignedAllocationHandler unassignedAllocationHandler) {
    assert shardRouting.unassigned();
    if (shardRouting.primary()) {//分配主shard
        primaryShardAllocator.allocateUnassigned(shardRouting, allocation, unassignedAllocationHandler);
    } else {//分配副本shard
        replicaShardAllocator.allocateUnassigned(shardRouting, allocation, unassignedAllocationHandler);
    }
}

主分分配器primaryShardAllocator和replicaShardAllocator都继承了BaseGatewayShardAllocator方法，在执行allocateUnassigned时主分片分配和副本分片分配会执行相同的方法，只是会执行不同的决策

public void allocateUnassigned(ShardRouting shardRouting, RoutingAllocation allocation,
                                   ExistingShardsAllocator.UnassignedAllocationHandler unassignedAllocationHandler) {
        //获取所有的shard信息，根据决策判断在那个节点可以分配shard，返回结果
        final AllocateUnassignedDecision allocateUnassignedDecision = makeAllocationDecision(shardRouting, allocation, logger);

        if (allocateUnassignedDecision.isDecisionTaken() == false) {
            // no decision was taken by this allocator
            return;
        }
        //所有决策都返回true
        if (allocateUnassignedDecision.getAllocationDecision() == AllocationDecision.YES) {
            unassignedAllocationHandler.initialize(allocateUnassignedDecision.getTargetNode().getId(),
                allocateUnassignedDecision.getAllocationId(),
                shardRouting.primary() ? ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE :
                                         allocation.clusterInfo().getShardSize(shardRouting, ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE),
                allocation.changes());
        } else {
            unassignedAllocationHandler.removeAndIgnore(allocateUnassignedDecision.getAllocationStatus(), allocation.changes());
        }
    }

主分片分配决策

@Override
    public AllocateUnassignedDecision makeAllocationDecision(final ShardRouting unassignedShard,
                                                             final RoutingAllocation allocation,
                                                             final Logger logger) {
        if (isResponsibleFor(unassignedShard) == false) {
            // this allocator is not responsible for allocating this shard
            //此分配器不负责分配此分片
            return AllocateUnassignedDecision.NOT_TAKEN;
        }
        //debug决策过程
        final boolean explain = allocation.debugDecision();
        //获取所有shard信息
        final FetchResult<NodeGatewayStartedShards> shardState = fetchData(unassignedShard, allocation);
        //没有返回数据说明还在获取中
        if (shardState.hasData() == false) {
            allocation.setHasPendingAsyncFetch();
            List<NodeAllocationResult> nodeDecisions = null;
            if (explain) {
                nodeDecisions = buildDecisionsForAllNodes(unassignedShard, allocation);
            }
            return AllocateUnassignedDecision.no(AllocationStatus.FETCHING_SHARD_DATA, nodeDecisions);
        }

        // don't create a new IndexSetting object for every shard as this could cause a lot of garbage
        // on cluster restart if we allocate a boat load of shards
        final IndexMetadata indexMetadata = allocation.metadata().getIndexSafe(unassignedShard.index());
        //根据分片id，获取保持数据同步的shard集合
        final Set<String> inSyncAllocationIds = indexMetadata.inSyncAllocationIds(unassignedShard.id());
        //从快照恢复
        final boolean snapshotRestore = unassignedShard.recoverySource().getType() == RecoverySource.Type.SNAPSHOT;

        assert inSyncAllocationIds.isEmpty() == false;
        // use in-sync allocation ids to select nodes
        //使用同步分配 ID 来选择节点
        final NodeShardsResult nodeShardsResult = buildNodeShardsResult(unassignedShard, snapshotRestore,
            allocation.getIgnoreNodes(unassignedShard.shardId()), inSyncAllocationIds, shardState, logger);
        //是否已经有分配选择
        final boolean enoughAllocationsFound = nodeShardsResult.orderedAllocationCandidates.size() > 0;
        logger.debug("[{}][{}]: found {} allocation candidates of {} based on allocation ids: [{}]", unassignedShard.index(),
            unassignedShard.id(), nodeShardsResult.orderedAllocationCandidates.size(), unassignedShard, inSyncAllocationIds);

        if (enoughAllocationsFound == false) {
            if (snapshotRestore) {
                // let BalancedShardsAllocator take care of allocating this shard
                logger.debug("[{}][{}]: missing local data, will restore from [{}]",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard.recoverySource());
                return AllocateUnassignedDecision.NOT_TAKEN;
            } else {
                // We have a shard that was previously allocated, but we could not find a valid shard copy to allocate the primary.
                // We could just be waiting for the node that holds the primary to start back up, in which case the allocation for
                // this shard will be picked up when the node joins and we do another allocation reroute
                logger.debug("[{}][{}]: not allocating, number_of_allocated_shards_found [{}]",
                             unassignedShard.index(), unassignedShard.id(), nodeShardsResult.allocationsFound);
                return AllocateUnassignedDecision.no(AllocationStatus.NO_VALID_SHARD_COPY,
                    explain ? buildNodeDecisions(null, shardState, inSyncAllocationIds) : null);
            }
        }
        //构建节点分配分片
        NodesToAllocate nodesToAllocate = buildNodesToAllocate(
            allocation, nodeShardsResult.orderedAllocationCandidates, unassignedShard, false
        );
        DiscoveryNode node = null;
        String allocationId = null;
        boolean throttled = false;
        //可以分配
        if (nodesToAllocate.yesNodeShards.isEmpty() == false) {
            //获取第一个
            DecidedNode decidedNode = nodesToAllocate.yesNodeShards.get(0);
            logger.debug("[{}][{}]: allocating [{}] to [{}] on primary allocation",
                         unassignedShard.index(), unassignedShard.id(), unassignedShard, decidedNode.nodeShardState.getNode());
            //节点
            node = decidedNode.nodeShardState.getNode();
            //分片分配id
            allocationId = decidedNode.nodeShardState.allocationId();
        } else if (nodesToAllocate.throttleNodeShards.isEmpty() && !nodesToAllocate.noNodeShards.isEmpty()) {
            // The deciders returned a NO decision for all nodes with shard copies, so we check if primary shard
            // can be force-allocated to one of the nodes.
            //一个节点可以强制分配
            nodesToAllocate = buildNodesToAllocate(allocation, nodeShardsResult.orderedAllocationCandidates, unassignedShard, true);
            if (nodesToAllocate.yesNodeShards.isEmpty() == false) {
                final DecidedNode decidedNode = nodesToAllocate.yesNodeShards.get(0);
                final NodeGatewayStartedShards nodeShardState = decidedNode.nodeShardState;
                logger.debug("[{}][{}]: allocating [{}] to [{}] on forced primary allocation",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeShardState.getNode());
                node = nodeShardState.getNode();
                allocationId = nodeShardState.allocationId();
            } else if (nodesToAllocate.throttleNodeShards.isEmpty() == false) {
                logger.debug("[{}][{}]: throttling allocation [{}] to [{}] on forced primary allocation",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard, nodesToAllocate.throttleNodeShards);
                throttled = true;
            } else {
                logger.debug("[{}][{}]: forced primary allocation denied [{}]",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard);
            }
        } else {
            // we are throttling this, since we are allowed to allocate to this node but there are enough allocations
            // taking place on the node currently, ignore it for now
            logger.debug("[{}][{}]: throttling allocation [{}] to [{}] on primary allocation",
                         unassignedShard.index(), unassignedShard.id(), unassignedShard, nodesToAllocate.throttleNodeShards);
            throttled = true;
        }

        List<NodeAllocationResult> nodeResults = null;
        if (explain) {
            //构建分配决策
            nodeResults = buildNodeDecisions(nodesToAllocate, shardState, inSyncAllocationIds);
        }
        //如果异步获取还没结束
        if (allocation.hasPendingAsyncFetch()) {
            return AllocateUnassignedDecision.no(AllocationStatus.FETCHING_SHARD_DATA, nodeResults);
        } else if (node != null) {
            return AllocateUnassignedDecision.yes(node, allocationId, nodeResults, false);
        } else if (throttled) {
            return AllocateUnassignedDecision.throttle(nodeResults);
        } else {
            return AllocateUnassignedDecision.no(AllocationStatus.DECIDERS_NO, nodeResults, true);
        }
    }

如果发现没有分片的状态数据时需要发送请求到其他所有节点获取分片信息，因为不知道哪个节点有shard的数据，启动时候需要遍历所有shard发起fetchData请求，如果集群规模比较大，shard分片数量比较多的的时候需要发送大量请求

//获取所有shard信息
final FetchResult<NodeGatewayStartedShards> shardState = fetchData(unassignedShard, allocation);

@Override
protected AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShards.NodeGatewayStartedShards>
                                                                        fetchData(ShardRouting shard, RoutingAllocation allocation) {
            AsyncShardFetch<TransportNodesListGatewayStartedShards.NodeGatewayStartedShards> fetch =
                asyncFetchStarted.computeIfAbsent(shard.shardId(),
                    shardId -> new InternalAsyncFetch<>(logger, "shard_started", shardId,
                        //分配索引的配置位置
                        IndexMetadata.INDEX_DATA_PATH_SETTING.get(allocation.metadata().index(shard.index()).getSettings()),
                        startedAction));
            //分片元数据
            AsyncShardFetch.FetchResult<TransportNodesListGatewayStartedShards.NodeGatewayStartedShards> shardState =
                    fetch.fetchData(allocation.nodes(), allocation.getIgnoreNodes(shard.shardId()));
            //处理返回的数据
            if (shardState.hasData()) {
                shardState.processAllocation(allocation);
            }
            return shardState;
        }

进行异步发送

void asyncFetch(final DiscoveryNode[] nodes, long fetchingRound) {
    logger.trace("{} fetching [{}] from {}", shardId, type, nodes);
    //获取shard的信息
    action.list(shardId, customDataPath, nodes, new ActionListener<BaseNodesResponse<T>>() {
        //所有节点已经返回了结果，开始处理
        @Override
        public void onResponse(BaseNodesResponse<T> response) {
            //处理获取的结果
            processAsyncFetch(response.getNodes(), response.failures(), fetchingRound);
        }

        @Override
        public void onFailure(Exception e) {
            List<FailedNodeException> failures = new ArrayList<>(nodes.length);
            for (final DiscoveryNode node: nodes) {
                failures.add(new FailedNodeException(node.getId(), "total failure in fetching", e));
            }
            processAsyncFetch(null, failures, fetchingRound);
        }
    });
}

private final TransportNodesListGatewayStartedShards startedAction;
asyncFetchStarted.computeIfAbsent(shard.shardId(),
shardId -> new InternalAsyncFetch<>(logger, "shard_started", shardId,
//分配索引的配置位置
IndexMetadata.INDEX_DATA_PATH_SETTING.get(allocation.metadata().index(shard.index()).getSettings()),startedAction));

action就是TransportNodesListGatewayStartedShards，调用list方法发送请求处理action为

public static final String ACTION_NAME = "internal:gateway/local/started_shards";

调用TransportNodesAction的doExecute方法

@Override
protected void doExecute(Task task, NodesRequest request, ActionListener<NodesResponse> listener) {
    //执行
    new AsyncAction(task, request, listener).start();
}

对端对此的处理在TransportNodesListGatewayStartedShards的nodeOperation方法

@Override
protected NodeGatewayStartedShards nodeOperation(NodeRequest request) {
    try {
        //shard标识
        final ShardId shardId = request.getShardId();
        logger.trace("{} loading local shard state info", shardId);
        //加载shard分配元数据
        ShardStateMetadata shardStateMetadata = ShardStateMetadata.FORMAT.loadLatestState(logger, namedXContentRegistry,
                                                                                          nodeEnv.availableShardPaths(request.shardId));
        if (shardStateMetadata != null) {
            if (indicesService.getShardOrNull(shardId) == null) {
                final String customDataPath;
                if (request.getCustomDataPath() != null) {
                    customDataPath = request.getCustomDataPath();
                } else {
                    // TODO: Fallback for BWC with older ES versions. Remove once request.getCustomDataPath() always returns non-null
                    final IndexMetadata metadata = clusterService.state().metadata().index(shardId.getIndex());
                    if (metadata != null) {
                        customDataPath = new IndexSettings(metadata, settings).customDataPath();
                    } else {
                        logger.trace("{} node doesn't have meta data for the requests index", shardId);
                        throw new ElasticsearchException("node doesn't have meta data for index " + shardId.getIndex());
                    }
                }
                // we don't have an open shard on the store, validate the files on disk are openable
                ShardPath shardPath = null;
                try {
                    shardPath = ShardPath.loadShardPath(logger, nodeEnv, shardId, customDataPath);
                    if (shardPath == null) {
                        throw new IllegalStateException(shardId + " no shard path found");
                    }
                    Store.tryOpenIndex(shardPath.resolveIndex(), shardId, nodeEnv::shardLock, logger);
                } catch (Exception exception) {
                    final ShardPath finalShardPath = shardPath;
                    logger.trace(() -> new ParameterizedMessage(
                        "{} can't open index for shard [{}] in path [{}]",
                        shardId,
                        shardStateMetadata,
                        (finalShardPath != null) ? finalShardPath.resolveIndex() : ""),
                                 exception);
                    String allocationId = shardStateMetadata.allocationId != null ?
                        shardStateMetadata.allocationId.getId() : null;
                    return new NodeGatewayStartedShards(clusterService.localNode(), allocationId, shardStateMetadata.primary,
                                                        exception);
                }
            }

            logger.debug("{} shard state info found: [{}]", shardId, shardStateMetadata);
            String allocationId = shardStateMetadata.allocationId != null ?
                shardStateMetadata.allocationId.getId() : null;
            return new NodeGatewayStartedShards(clusterService.localNode(), allocationId, shardStateMetadata.primary);
        }
        //没有shard元数据
        logger.trace("{} no local shard info found", shardId);
        return new NodeGatewayStartedShards(clusterService.localNode(), null, false);
    } catch (Exception e) {
        throw new ElasticsearchException("failed to load started shards", e);
    }
}

这里加载本地shard级元数据然后返回。接收到对端返回数据后将数据放入cache中以便后面执行reroute的时候不用重复获取，然后执行reroute(shardId, “post_response”)方法

@Override
protected void reroute(ShardId shardId, String reason) {
    logger.trace("{} scheduling reroute for {}", shardId, reason);
    assert rerouteService != null;
    rerouteService.reroute("async_shard_fetch", Priority.HIGH, ActionListener.wrap(
        r -> logger.trace("{} scheduled reroute completed for {}", shardId, reason),
        e -> logger.debug(new ParameterizedMessage("{} scheduled reroute failed for {}", shardId, reason), e)));
}

提交一个任务clusterService.submitStateUpdateTask

//cluster_reroute
            clusterService.submitStateUpdateTask(CLUSTER_UPDATE_TASK_SOURCE + "(" + reason + ")",
                new ClusterStateUpdateTask(priority) {

                    @Override
                    public ClusterState execute(ClusterState currentState) {
                        final boolean currentListenersArePending;
                        synchronized (mutex) {
                            assert currentListeners.isEmpty() == (pendingRerouteListeners != currentListeners)
                                : "currentListeners=" + currentListeners + ", pendingRerouteListeners=" + pendingRerouteListeners;
                            currentListenersArePending = pendingRerouteListeners == currentListeners;
                            if (currentListenersArePending) {
                                pendingRerouteListeners = null;
                            }
                        }
                        if (currentListenersArePending) {
                            logger.trace("performing batched reroute [{}]", reason);
                            return reroute.apply(currentState, reason);
                        } else {
                            logger.trace("batched reroute [{}] was promoted", reason);
                            return currentState;
                        }
                    }

                  ......
                });

调用execute方法执行return reroute.apply(currentState, reason)，reroute在node构造函数中进行设置就是AllocationService的reroute

//集群重新平衡服务
final RerouteService rerouteService
                = new BatchedRerouteService(clusterService, clusterModule.getAllocationService()::reroute);

这里继续调用AllocationService的reroute方法，当前cache已经有了shard元数据，可以进行主分片分配流程
继续回到PrimaryShardAllocator.makeAllocationDecision方法中

final IndexMetadata indexMetadata = allocation.metadata().getIndexSafe(unassignedShard.index());
        //根据分片id，获取保持数据同步的shard集合
        final Set<String> inSyncAllocationIds = indexMetadata.inSyncAllocationIds(unassignedShard.id());

获取具有同步副本的shard

//使用同步分配 ID 来选择节点
final NodeShardsResult nodeShardsResult = buildNodeShardsResult(unassignedShard, snapshotRestore,                                             allocation.getIgnoreNodes(unassignedShard.shardId()), inSyncAllocationIds, shardState, logger);

protected static NodeShardsResult buildNodeShardsResult(ShardRouting shard, boolean matchAnyShard,
                                                            Set<String> ignoreNodes, Set<String> inSyncAllocationIds,
                                                            FetchResult<NodeGatewayStartedShards> shardState,
                                                            Logger logger) {
        List<NodeGatewayStartedShards> nodeShardStates = new ArrayList<>();
        int numberOfAllocationsFound = 0;
        for (NodeGatewayStartedShards nodeShardState : shardState.getData().values()) {
            DiscoveryNode node = nodeShardState.getNode();
            String allocationId = nodeShardState.allocationId();

            if (ignoreNodes.contains(node.getId())) {
                continue;
            }
            //没有发生异常
            if (nodeShardState.storeException() == null) {
                if (allocationId == null) {
                    logger.trace("[{}] on node [{}] has no shard state information", shard, nodeShardState.getNode());
                } else {
                    logger.trace("[{}] on node [{}] has allocation id [{}]", shard, nodeShardState.getNode(), allocationId);
                }
            } else {
                final String finalAllocationId = allocationId;
                if (nodeShardState.storeException() instanceof ShardLockObtainFailedException) {
                    logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be " +
                        "opened as it's locked, treating as valid shard", shard, nodeShardState.getNode(), finalAllocationId),
                        nodeShardState.storeException());
                } else {
                    logger.trace(() -> new ParameterizedMessage("[{}] on node [{}] has allocation id [{}] but the store can not be " +
                        "opened, treating as no allocation id", shard, nodeShardState.getNode(), finalAllocationId),
                        nodeShardState.storeException());
                    allocationId = null;
                }
            }

            if (allocationId != null) {
                assert nodeShardState.storeException() == null ||
                    nodeShardState.storeException() instanceof ShardLockObtainFailedException :
                    "only allow store that can be opened or that throws a ShardLockObtainFailedException while being opened but got a " +
                        "store throwing " + nodeShardState.storeException();
                //记录发现分片被分配的次数
                numberOfAllocationsFound++;
                //发回的节点shard是保持数据同步的
                if (matchAnyShard || inSyncAllocationIds.contains(nodeShardState.allocationId())) {
                    nodeShardStates.add(nodeShardState);
                }
            }
        }

        final Comparator<NodeGatewayStartedShards> comparator; // allocation preference
        if (matchAnyShard) {
            // prefer shards with matching allocation ids
            //匹配具有最新数据的节点靠前
            Comparator<NodeGatewayStartedShards> matchingAllocationsFirst = Comparator.comparing(
                (NodeGatewayStartedShards state) -> inSyncAllocationIds.contains(state.allocationId())).reversed();
            comparator = matchingAllocationsFirst.thenComparing(NO_STORE_EXCEPTION_FIRST_COMPARATOR)
                .thenComparing(PRIMARY_FIRST_COMPARATOR);
        } else {
            //没有发生异常的靠前
            comparator = NO_STORE_EXCEPTION_FIRST_COMPARATOR.thenComparing(PRIMARY_FIRST_COMPARATOR);
        }

        nodeShardStates.sort(comparator);

        if (logger.isTraceEnabled()) {
            logger.trace("{} candidates for allocation: {}", shard, nodeShardStates.stream().map(s -> s.getNode().getName())
                .collect(Collectors.joining(", ")));
        }
        return new NodeShardsResult(nodeShardStates, numberOfAllocationsFound);
    }

构造节点列表，如果atchAnyShard 设置为 false只会将同步副本中的shard放入列表，否则任何具有分配的节点都会被添加到列表中，但是同步副本shard在列表前面

//构建节点分配分片
NodesToAllocate nodesToAllocate = buildNodesToAllocate(
allocation, nodeShardsResult.orderedAllocationCandidates, unassignedShard, false
);

private static NodesToAllocate buildNodesToAllocate(RoutingAllocation allocation,
                                                        List<NodeGatewayStartedShards> nodeShardStates,
                                                        ShardRouting shardRouting,
                                                        boolean forceAllocate) {
        List<DecidedNode> yesNodeShards = new ArrayList<>();
        List<DecidedNode> throttledNodeShards = new ArrayList<>();
        List<DecidedNode> noNodeShards = new ArrayList<>();
        for (NodeGatewayStartedShards nodeShardState : nodeShardStates) {
            RoutingNode node = allocation.routingNodes().node(nodeShardState.getNode().getId());
            if (node == null) {
                continue;
            }
            //是否开启强制分配主节点
            Decision decision = forceAllocate ? allocation.deciders().canForceAllocatePrimary(shardRouting, node, allocation) :
                                                allocation.deciders().canAllocate(shardRouting, node, allocation);
            //记录分片分配的决策
            DecidedNode decidedNode = new DecidedNode(nodeShardState, decision);
            if (decision.type() == Type.THROTTLE) {
                throttledNodeShards.add(decidedNode);
            } else if (decision.type() == Type.NO) {
                noNodeShards.add(decidedNode);
            } else {
                yesNodeShards.add(decidedNode);
            }
        }
        return new NodesToAllocate(Collections.unmodifiableList(yesNodeShards), Collections.unmodifiableList(throttledNodeShards),
                                      Collections.unmodifiableList(noNodeShards));
    }

遍历所有deciders将决策结果分类保存THROTTLE、NO和YES三类然后返回

//可以分配
        if (nodesToAllocate.yesNodeShards.isEmpty() == false) {
            //获取第一个
            DecidedNode decidedNode = nodesToAllocate.yesNodeShards.get(0);
            logger.debug("[{}][{}]: allocating [{}] to [{}] on primary allocation",
                         unassignedShard.index(), unassignedShard.id(), unassignedShard, decidedNode.nodeShardState.getNode());
            //节点
            node = decidedNode.nodeShardState.getNode();
            //分片分配id
            allocationId = decidedNode.nodeShardState.allocationId();
        } else if (nodesToAllocate.throttleNodeShards.isEmpty() && !nodesToAllocate.noNodeShards.isEmpty()) {
            // The deciders returned a NO decision for all nodes with shard copies, so we check if primary shard
            // can be force-allocated to one of the nodes.
            //可以强制分配
            nodesToAllocate = buildNodesToAllocate(allocation, nodeShardsResult.orderedAllocationCandidates, unassignedShard, true);
            if (nodesToAllocate.yesNodeShards.isEmpty() == false) {
                final DecidedNode decidedNode = nodesToAllocate.yesNodeShards.get(0);
                final NodeGatewayStartedShards nodeShardState = decidedNode.nodeShardState;
                logger.debug("[{}][{}]: allocating [{}] to [{}] on forced primary allocation",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard, nodeShardState.getNode());
                node = nodeShardState.getNode();
                allocationId = nodeShardState.allocationId();
            } else if (nodesToAllocate.throttleNodeShards.isEmpty() == false) {
                logger.debug("[{}][{}]: throttling allocation [{}] to [{}] on forced primary allocation",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard, nodesToAllocate.throttleNodeShards);
                throttled = true;
            } else {
                logger.debug("[{}][{}]: forced primary allocation denied [{}]",
                             unassignedShard.index(), unassignedShard.id(), unassignedShard);
            }
        } else {
            // we are throttling this, since we are allowed to allocate to this node but there are enough allocations
            // taking place on the node currently, ignore it for now
            logger.debug("[{}][{}]: throttling allocation [{}] to [{}] on primary allocation",
                         unassignedShard.index(), unassignedShard.id(), unassignedShard, nodesToAllocate.throttleNodeShards);
            throttled = true;
        }

        List<NodeAllocationResult> nodeResults = null;
        if (explain) {
            //构建分配决策
            nodeResults = buildNodeDecisions(nodesToAllocate, shardState, inSyncAllocationIds);
        }
        //如果异步获取还没结束
        if (allocation.hasPendingAsyncFetch()) {
            return AllocateUnassignedDecision.no(AllocationStatus.FETCHING_SHARD_DATA, nodeResults);
        } else if (node != null) {
            return AllocateUnassignedDecision.yes(node, allocationId, nodeResults, false);
        } else if (throttled) {
            return AllocateUnassignedDecision.throttle(nodeResults);
        } else {
            return AllocateUnassignedDecision.no(AllocationStatus.DECIDERS_NO, nodeResults, true);
        }

如果node可以分配主shard，则在该节点分配主分片。如果没有node可以选择分配主分片则查看是否可以强制分配主分片。如果没有节点可以分配主分片且返回需要限速节点则返回限速节点结果
回到BaseGatewayShardAllocator的allocateUnassigned方法

public void allocateUnassigned(ShardRouting shardRouting, RoutingAllocation allocation,
                                   ExistingShardsAllocator.UnassignedAllocationHandler unassignedAllocationHandler) {
        //获取所有的shard信息，根据决策判断在那个节点可以分配shard，返回结果
        final AllocateUnassignedDecision allocateUnassignedDecision = makeAllocationDecision(shardRouting, allocation, logger);

        if (allocateUnassignedDecision.isDecisionTaken() == false) {
            // no decision was taken by this allocator
            return;
        }
        //所有决策都返回true的node
        if (allocateUnassignedDecision.getAllocationDecision() == AllocationDecision.YES) {
            unassignedAllocationHandler.initialize(allocateUnassignedDecision.getTargetNode().getId(),
                allocateUnassignedDecision.getAllocationId(),
                shardRouting.primary() ? ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE :
                                         allocation.clusterInfo().getShardSize(shardRouting, ShardRouting.UNAVAILABLE_EXPECTED_SHARD_SIZE),
                allocation.changes());
        } else {
            unassignedAllocationHandler.removeAndIgnore(allocateUnassignedDecision.getAllocationStatus(), allocation.changes());
        }
    }

调用initialize开始初始化主分片信息

public ShardRouting initializeShard(ShardRouting unassignedShard, String nodeId, @Nullable String existingAllocationId,
                                        long expectedSize, RoutingChangesObserver routingChangesObserver) {
        ensureMutable();
        assert unassignedShard.unassigned() : "expected an unassigned shard " + unassignedShard;
        //创建shardRouting
        ShardRouting initializedShard = unassignedShard.initialize(nodeId, existingAllocationId, expectedSize);
        //分配节点添加分配分片，node到shard关系
        node(nodeId).add(initializedShard);
        //活跃shard计数
        inactiveShardCount++;
        if (initializedShard.primary()) {
            inactivePrimaryCount++;
        }
        //添加分配恢复
        addRecovery(initializedShard);
        assignedShardsAdd(initializedShard);
        routingChangesObserver.shardInitialized(unassignedShard, initializedShard);
        return initializedShard;
    }

副本分片分配与主分片分配逻辑相同，调用了replicaShardAllocator.allocateUnassigned方法，调用ReplicaShardAllocator的makeAllocationDecision方法决策副本分片分配。先判断是否可以至少在一个节点上分配，如果不能分配则直接跳过后面的步骤。副本分片分配也需要获取一次shard信息，但是之前主分片分配已经获取了一次数据，副本分片分配可以直接使用上次执行获取分片的结果，如果没有node可以分配则查看是否延迟分配，然后执行initialize方法

public ShardRouting initializeShard(ShardRouting unassignedShard, String nodeId, @Nullable String existingAllocationId,
                                        long expectedSize, RoutingChangesObserver routingChangesObserver) {
        ensureMutable();
        assert unassignedShard.unassigned() : "expected an unassigned shard " + unassignedShard;
        //创建shardRouting
        ShardRouting initializedShard = unassignedShard.initialize(nodeId, existingAllocationId, expectedSize);
        //分配节点添加分配分片，node到shard关系
        node(nodeId).add(initializedShard);
        //活跃shard计数
        inactiveShardCount++;
        if (initializedShard.primary()) {
            inactivePrimaryCount++;
        }
        //添加分配恢复
        addRecovery(initializedShard);
        assignedShardsAdd(initializedShard);
        routingChangesObserver.shardInitialized(unassignedShard, initializedShard);
        return initializedShard;
    }

public ShardRouting initialize(String nodeId, @Nullable String existingAllocationId, long expectedShardSize) {
        assert state == ShardRoutingState.UNASSIGNED : this;
        assert relocatingNodeId == null : this;
        final AllocationId allocationId;
        //如果分配id不存在则新创建一个allocationId
        if (existingAllocationId == null) {
            allocationId = AllocationId.newInitializing();
        } else {
            allocationId = AllocationId.newInitializing(existingAllocationId);
        }
        return new ShardRouting(shardId, nodeId, null, primary, ShardRoutingState.INITIALIZING, recoverySource,
            unassignedInfo, allocationId, expectedShardSize);
    }

shard进入INITIALIZING状态
已存在shard分配结束执行不存在shard的分配

//负载均衡的分配
shardsAllocator.allocate(allocation);

public void allocate(RoutingAllocation allocation) {
        if (allocation.routingNodes().size() == 0) {
            failAllocationOfNewPrimaries(allocation);
            return;
        }
        final Balancer balancer = new Balancer(logger, allocation, weightFunction, threshold);
        //负载均衡的分配，未分配的分配
        balancer.allocateUnassigned();
        //移动分配
        balancer.moveShards();
        //负载平衡
        balancer.balance();
    }

reroute执行最后后会将shardRouting添加到集群状态中然后返回一个新的集群状态，然后会将集群状态广播出去开始执行recovery流程

原文地址：https://blog.csdn.net/GeekerJava/article/details/139702707

免责声明：本站文章内容转载自网络资源，如本站内容侵犯了原著者的合法权益，可联系本站删除。更多内容请关注自学内容网（zxcms.com）！

上一篇：自动化产线搭配数据采集监控平台创新与突破
下一篇：网络开局与 Underlay网络自动化

解决 Spring Boot 中 `Ambiguous mapping. Cannot map ‘xxxController‘ method` 错误
在使用 Spring Boot 开发 Web 应用时，经常会遇到各种各样的错误。其中一种常见的错误是。本文将详细介绍这个错误的原因及解决方法，帮助开发者快速定位并解决问题。解决错误的关键在于确保每个方
阅读更多2024-11-15
数据分析案例-笔记本电脑价格数据可视化分析
本实验数据集来源于Kaggle，原始数据集共有1303条数据，13个变量，各变量含义如下：0 laptop_ID-数字-产品ID1 Company-字符串-笔记本电脑制造商2 Product-字符串-
阅读更多2024-11-15
Linux 如何查看当前系统版本的详细信息？
系统中基本都存在，所以是一种标准的获取系统信息的方式。命令以标准的方式提供详细的发行版信息。发行版的说明文件一般位于。
阅读更多2024-11-15
初级数据结构——栈
数据结构栈（Stack）是一种线性的数据结构，它只允许在序列的一端（称为栈顶）进行插入和删除操作。这种特性使得栈成为许多算法和问题解决中的有力工具。栈是一种简单而强大的数据结构，它遵循后进先出的原则，
阅读更多2024-11-15
网络安全练习之 ctfshow_web
根据前面得到的qq邮箱中的qq号查询用户，目前电脑版的QQ添加好友好像不能查看详细信息，手机版的可以看到对方所在地为陕西西安。路径是：/editor/attached/file/tmp/html/no
阅读更多2024-11-15
Python 三种方式实现自动化任务
本文介绍了用Python实现机器人过程自动化的三个包，方便读者选择、对比学习。
阅读更多2024-11-15
C++基础：Pimpl设计模式的实现
PIMPL （ Private Implementation 或 Pointer to Implementation ）是通过一个私有的成员指针，将指针所指向的类的内部实现数据进行隐藏。
阅读更多2024-11-15
【flutter】flutter2升级到3.
以这个 https://github.com/aa286211636/Flutter_QQ 为例子，升级下看看。只有登录界面能正常显示，别的页面都是报错。flutter这版本变动，基本不能直接ru
阅读更多2024-11-15
【JAVA毕业设计】基于Vue和SpringBoot的宠物咖啡馆平台
基于Vue.js和SpringBoot的宠物咖啡馆平台是一个综合性的在线服务系统，旨在为宠物爱好者提供一个便捷的宠物护理和社交场所。该平台分为管理后台和用户网页端，以满足不同用户群体的需求。管理后台主
阅读更多2024-11-15
认证鉴权框架SpringSecurity-2--重点组件和过滤器链篇
这4个接口中，每一个都是当认证或者授权过程中发生结果后触发，可以是失败的场景也可以是成功后触发。1个成功后执行，3个为失败后促发执行。
阅读更多2024-11-15

elasticsearch源码分析-05分片分配

分片分配

相关文章