From 41a3cf2fb81971657bef2861a77d92e23e57f0e2 Mon Sep 17 00:00:00 2001 From: Wilson Date: Tue, 30 Sep 2025 14:47:39 -0700 Subject: [PATCH 01/12] Add NAP troubleshooting doc Add NAP troubleshooting + FAQ doc --- .../troubleshoot-node-auto-provision.md | 263 ++++++++++++++++++ 1 file changed, 263 insertions(+) create mode 100644 support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md new file mode 100644 index 00000000000..dd740ffbc97 --- /dev/null +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -0,0 +1,263 @@ +--- +title: Troubleshoot the Node Auto-provisioning managed add-on +description: Learn how to troubleshoot Node Auto-provisisioning in Azure Kubernetes Service (AKS). +ms.service: azure-kubernetes-service +ms.date: 09/05/2025 +editor: wdarko1 +ms.reviewer: +#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto-provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS). +ms.custom: sap:Extensions, Policies and Add-Ons +--- + +# Troubleshoot the node auto provisioning (NAP) in Azure Kubernetes Service (AKS) + +This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level. + When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. + +## Prerequisites + +Ensure the following tools are installed and configured. They're used in the following sections. + +- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command. +- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI. +- Confirm you have Node Auto-provisioning enabled on your cluster. + +## Common Issues + +### Nodes Not Being Removed + +**Symptoms**: Underutilized nodes remain in the cluster longer than expected. + +**Debugging Steps**: + +1. **Check node utilization**: +```azurecli-interactive +kubectl top nodes +kubectl describe node +``` + +2. **Look for blocking pods**: +```azurecli-interactive +kubectl get pods --all-namespaces --field-selector spec.nodeName= +``` + +3. **Check for disruption blocks**: +```azurecli-interactive +kubectl get events | grep -i "disruption\|consolidation" +``` + +**Common Causes**: +- Pods without proper tolerations +- DaemonSets preventing drain +- Pod disruption budgets(PDBs) not properly set +- Nodes marked with `do-not-disrupt` annotation + +**Solutions**: +- Add proper tolerations to pods +- Review DaemonSet configurations +- Adjust pod disruption budgets to allow disruption +- Remove `do-not-disrupt` annotations if appropriate + + +## Networking Issues + +### Pod Connectivity Problems + +**Symptoms**: Pods can't communicate with other pods or external services. + +**Debugging Steps**: + +1. **Test basic connectivity**: +```azurecli-interactive +# From within a pod +kubectl exec -it -- ping +kubectl exec -it -- nslookup kubernetes.default +``` + +2. **Check network plugin status**: +```azurecli-interactive +kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy" +``` +3. **If using azure cni with overlay or cilium** +Validate your nodes have these labels + +``` + kubernetes.azure.com/azure-cni-overlay: "true" + kubernetes.azure.com/network-name: aks-vnet- + kubernetes.azure.com/network-resourcegroup: + kubernetes.azure.com/network-subscription: +``` + +4. **Validate the CNI configuration files** + +The CNI conflist files define network plugin configurations. Check which files are present: + +```azurecli-interactive +# List CNI configuration files +ls -la /etc/cni/net.d/ + +# Example output: +# 10-azure.conflist 15-azure-swift-overlay.conflist +``` + +**Understanding conflist files**: +- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with node subnet +- `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode) + +**Inspect the configuration content**: +```azurecli-interactive +# Check the actual CNI configuration +cat /etc/cni/net.d/*.conflist + +# Look for key fields: +# - "type": should be "azure-vnet" for Azure CNI +# - "mode": "bridge" for standard, "transparent" for overlay +# - "ipam": IP address management configuration +``` + +**Common conflist issues**: +- Missing or corrupted configuration files +- Incorrect network mode for your cluster setup +- Mismatched IPAM configuration +- Wrong plugin order in the configuration chain + +5. **Check CNI to CNS communication**: +```azurecli-interactive +# Check CNS logs for IP allocation requests from CNI +kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100 +``` + +**CNI to CNS Troubleshooting**: +- **If CNS logs show "no IPs available"**: This indicates a CNS or aks's watch on the NNCs. +- **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed. + +**Common Causes**: +- Network security group rules +- Incorrect subnet configuration +- CNI plugin issues +- DNS resolution problems + +**Solutions**: +- Review NSG rules for required traffic +- Verify subnet configuration in AKSNodeClass +- Restart CNI plugin pods +- Check CoreDNS configuration + +### DNS Service IP Issues + +**Note**: The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations. + +**Symptoms**: Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures. + +**Debugging Steps**: + +1. **Check kubelet DNS configuration**: +```azurecli-interactive +# SSH to the Karpenter node and check kubelet config +sudo cat /var/lib/kubelet/config.yaml | grep -A 5 clusterDNS + +# Expected output should show the correct DNS service IP +# clusterDNS: +# - "10.0.0.10" # This should match your cluster's DNS service IP +``` + +2. **Verify DNS service IP matches cluster configuration**: +```azurecli-interactive +# Get the actual DNS service IP from your cluster +kubectl get service -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}' + +# Compare with what AKS reports +az aks show --resource-group --name --query "networkProfile.dnsServiceIp" -o tsv +``` + +3. **Test DNS resolution from the node**: +```azurecli-interactive +# SSH to the Karpenter node and test DNS resolution +# Test using the DNS service IP directly +dig @10.0.0.10 kubernetes.default.svc.cluster.local + +# Test using system resolver +nslookup kubernetes.default.svc.cluster.local + +# Test external DNS resolution +dig google.com +``` + +4. **Check DNS pods status**: +```azurecli-interactive +# Verify CoreDNS pods are running +kubectl get pods -n kube-system -l k8s-app=kube-dns + +# Check CoreDNS logs for errors +kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50 +``` + +5. **Validate network connectivity to DNS service**: +```azurecli-interactive +# From the Karpenter node, test connectivity to DNS service +telnet 10.0.0.10 53 # Replace with your actual DNS service IP +# Or using nc if telnet is not available +nc -zv 10.0.0.10 53 +``` + +**Common Causes**: +- Incorrect `--dns-service-ip` parameter in AKSNodeClass +- DNS service IP not in the service CIDR range +- Network connectivity issues between node and DNS service +- CoreDNS pods not running or misconfigured +- Firewall rules blocking DNS traffic + +**Solutions**: +- Verify `--dns-service-ip` matches the actual DNS service: `kubectl get svc -n kube-system kube-dns -o jsonpath='{.spec.clusterIP}'` +- Ensure DNS service IP is within the service CIDR range specified during cluster creation +- Check that Karpenter nodes can reach the service subnet +- Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system` +- Verify NSG rules allow traffic on port 53 (TCP/UDP) + +## Azure-Specific Issues + +### Spot VM Issues + +**Symptoms**: Unexpected node terminations when using spot instances. + +**Debugging Steps**: + +1. **Check node events**: + +```azurecli-interactive +kubectl get events | grep -i "spot\|evict" +``` + +2. **Monitor spot VM pricing**: + +```azurecli-interactive +az vm list-sizes --location --query "[?contains(name, 'Standard_D2s_v3')]" +``` + +**Solutions**: +- Use diverse instance types for better availability +- Implement proper pod disruption budgets +- Consider mixed spot/on-demand strategies +- Use workloads tolerant of node preemption + +### Quota Exceeded + +**Symptoms**: VM creation fails with quota exceeded errors. + +**Debugging Steps**: + +1. **Check current quota usage**: +```azurecli-interactive +az vm list-usage --location --query "[?currentValue >= limit]" +``` + +**Solutions**: +- Request quota increases through Azure portal +- Use different VM sizes with available quota + +[!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)] + + +[aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules +[karpenter-troubleshooting]: h[ttps://keda.sh/docs/latest/troubleshooting/](https://karpenter.sh/docs/troubleshooting/) +[karpenter-faq]: https://karpenter.sh/docs/faq/ From 6723c877e3cad83a92abb5d3499e8a857fae3e4a Mon Sep 17 00:00:00 2001 From: Wilson Date: Thu, 2 Oct 2025 23:33:11 -0700 Subject: [PATCH 02/12] Update troubleshoot-node-auto-provision.md --- .../troubleshoot-node-auto-provision.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index dd740ffbc97..e339e1e8c68 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -3,13 +3,13 @@ title: Troubleshoot the Node Auto-provisioning managed add-on description: Learn how to troubleshoot Node Auto-provisisioning in Azure Kubernetes Service (AKS). ms.service: azure-kubernetes-service ms.date: 09/05/2025 -editor: wdarko1 +editor: bsoghigian ms.reviewer: #Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto-provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS). ms.custom: sap:Extensions, Policies and Add-Ons --- -# Troubleshoot the node auto provisioning (NAP) in Azure Kubernetes Service (AKS) +# Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS) This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level. When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. @@ -26,7 +26,7 @@ Ensure the following tools are installed and configured. They're used in the fol ### Nodes Not Being Removed -**Symptoms**: Underutilized nodes remain in the cluster longer than expected. +**Symptoms**: Underutilized or empty nodes remain in the cluster longer than expected. **Debugging Steps**: @@ -49,8 +49,8 @@ kubectl get events | grep -i "disruption\|consolidation" **Common Causes**: - Pods without proper tolerations - DaemonSets preventing drain -- Pod disruption budgets(PDBs) not properly set -- Nodes marked with `do-not-disrupt` annotation +- Pod disruption budgets(PDBs) are not properly set +- Nodes are marked with `do-not-disrupt` annotation **Solutions**: - Add proper tolerations to pods @@ -81,7 +81,7 @@ kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy" 3. **If using azure cni with overlay or cilium** Validate your nodes have these labels -``` +```azurecli-interactive kubernetes.azure.com/azure-cni-overlay: "true" kubernetes.azure.com/network-name: aks-vnet- kubernetes.azure.com/network-resourcegroup: @@ -145,7 +145,8 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100 ### DNS Service IP Issues -**Note**: The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations. +>[!NOTE] +>The `--dns-service-ip` parameter is only supported for NAP (Node Auto Provisioning) clusters and is not available for self-hosted Karpenter installations. **Symptoms**: Pods can't resolve DNS names or kubelet fails to register with API server due to DNS resolution failures. @@ -259,5 +260,5 @@ az vm list-usage --location --query "[?currentValue >= limit]" [aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules -[karpenter-troubleshooting]: h[ttps://keda.sh/docs/latest/troubleshooting/](https://karpenter.sh/docs/troubleshooting/) +[karpenter-troubleshooting]: https://karpenter.sh/docs/troubleshooting/ [karpenter-faq]: https://karpenter.sh/docs/faq/ From 71a044d4fa6f858b0b46276e4c8e4b85f86eb6b3 Mon Sep 17 00:00:00 2001 From: Wilson Date: Thu, 2 Oct 2025 23:40:00 -0700 Subject: [PATCH 03/12] Add troubleshooting section for node auto provisioning Add NAP troubleshoot doc to TOC --- support/azure/azure-kubernetes/toc.yml | 2 ++ 1 file changed, 2 insertions(+) diff --git a/support/azure/azure-kubernetes/toc.yml b/support/azure/azure-kubernetes/toc.yml index 97d2b16899f..913863e64f9 100644 --- a/support/azure/azure-kubernetes/toc.yml +++ b/support/azure/azure-kubernetes/toc.yml @@ -253,6 +253,8 @@ items: href: extensions/troubleshoot-managed-namespaces.md - name: Troubleshoot network isolated clusters href: extensions/troubleshoot-network-isolated-cluster.md + - name: Troubleshoot node auto provisioning + href: extensions/troubleshoot-node-auto-provision.md - name: KEDA add-on items: - name: Breaking changes in KEDA add-on 2.15 and 2.14 From 5ab640cec68b91e3c5d0e07d102ab266bef40d83 Mon Sep 17 00:00:00 2001 From: Wilson Date: Fri, 3 Oct 2025 09:31:15 -0700 Subject: [PATCH 04/12] Enhance troubleshooting guide with AKS Node Viewer Added information about AKS Node Viewer for visualizing node usage. --- .../extensions/troubleshoot-node-auto-provision.md | 1 + 1 file changed, 1 insertion(+) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index e339e1e8c68..cf29e15db48 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -35,6 +35,7 @@ Ensure the following tools are installed and configured. They're used in the fol kubectl top nodes kubectl describe node ``` +You can also use the open-source [AKS Node Viewer](https://github.com/Azure/aks-node-viewer) tool to visualize node usage. 2. **Look for blocking pods**: ```azurecli-interactive From ab2113d3e165b2292da4bbeb1ee5aafb0bee2f66 Mon Sep 17 00:00:00 2001 From: Wilson Date: Mon, 13 Oct 2025 09:21:56 -0700 Subject: [PATCH 05/12] Enhance troubleshooting steps for node auto-provisioning Updated troubleshooting documentation for Azure Kubernetes node auto-provisioning. Added information on locks and network security group rules. --- .../extensions/troubleshoot-node-auto-provision.md | 11 ++++++++--- 1 file changed, 8 insertions(+), 3 deletions(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index cf29e15db48..fd8f3fa7882 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -52,12 +52,14 @@ kubectl get events | grep -i "disruption\|consolidation" - DaemonSets preventing drain - Pod disruption budgets(PDBs) are not properly set - Nodes are marked with `do-not-disrupt` annotation +- Locks blocking changes **Solutions**: - Add proper tolerations to pods - Review DaemonSet configurations - Adjust pod disruption budgets to allow disruption - Remove `do-not-disrupt` annotations if appropriate +- Review lock configurations ## Networking Issues @@ -75,6 +77,8 @@ kubectl exec -it -- ping kubectl exec -it -- nslookup kubernetes.default ``` +Another option to test node to node or pod to pod connectivity is with the open-source [goldpinger](https://github.com/bloomberg/goldpinger) tool. + 2. **Check network plugin status**: ```azurecli-interactive kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy" @@ -102,7 +106,7 @@ ls -la /etc/cni/net.d/ ``` **Understanding conflist files**: -- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with node subnet +- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with all CNI's not using overlay - `15-azure-swift-overlay.conflist`: Azure CNI with overlay networking (used with Cilium or overlay mode) **Inspect the configuration content**: @@ -133,13 +137,13 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100 - **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed. **Common Causes**: -- Network security group rules +- Network security group(NSG) rules - Incorrect subnet configuration - CNI plugin issues - DNS resolution problems **Solutions**: -- Review NSG rules for required traffic +- Review [Network Sescurity Group][networ-security-group-docs] rules for required traffic - Verify subnet configuration in AKSNodeClass - Restart CNI plugin pods - Check CoreDNS configuration @@ -263,3 +267,4 @@ az vm list-usage --location --query "[?currentValue >= limit]" [aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules [karpenter-troubleshooting]: https://karpenter.sh/docs/troubleshooting/ [karpenter-faq]: https://karpenter.sh/docs/faq/ +[networ-security-group-docs]: /azure/virtual-network/network-security-groups-overview From 53dc8cb911da3b937ef54aef0a10a06e5139b351 Mon Sep 17 00:00:00 2001 From: Wilson Date: Mon, 13 Oct 2025 09:33:47 -0700 Subject: [PATCH 06/12] Update troubleshoot-node-auto-provision.md --- .../extensions/troubleshoot-node-auto-provision.md | 11 +++++++---- 1 file changed, 7 insertions(+), 4 deletions(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index fd8f3fa7882..e8b8f5727df 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -143,10 +143,10 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100 - DNS resolution problems **Solutions**: -- Review [Network Sescurity Group][networ-security-group-docs] rules for required traffic -- Verify subnet configuration in AKSNodeClass +- Review [Network Sescurity Group][network-security-group-docs] rules for required traffic +- Verify subnet configuration in AKSNodeClass. See [AKSNodeClass documentation][aksnodeclass-subnet-config] on subnet configuration - Restart CNI plugin pods -- Check CoreDNS configuration +- Check CoreDNS configuration. See [CoreDNS documentation][coredns-troubleshoot] ### DNS Service IP Issues @@ -267,4 +267,7 @@ az vm list-usage --location --query "[?currentValue >= limit]" [aks-firewall-requirements]: /azure/aks/limit-egress-traffic#azure-global-required-network-rules [karpenter-troubleshooting]: https://karpenter.sh/docs/troubleshooting/ [karpenter-faq]: https://karpenter.sh/docs/faq/ -[networ-security-group-docs]: /azure/virtual-network/network-security-groups-overview +[network-security-group-docs]: /azure/virtual-network/network-security-groups-overview +[aksnodeclass-subnet-config]: /azure/aks/node-autoprovision-aksnodeclass#virtual-network-subnet-configuration +[coredns-troubleshoot]: /azure/aks/coredns-custom#troubleshooting + From d430ebf24c152c727c5452817717521c4b86f948 Mon Sep 17 00:00:00 2001 From: Wilson Date: Mon, 13 Oct 2025 09:46:32 -0700 Subject: [PATCH 07/12] Update NodePool CRD for VM size flexibility Expanded NodePool CRD to allow for more VM sizes to prevent quota errors during VM creation. --- .../extensions/troubleshoot-node-auto-provision.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index e8b8f5727df..c54f62226a6 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -259,7 +259,7 @@ az vm list-usage --location --query "[?currentValue >= limit]" **Solutions**: - Request quota increases through Azure portal -- Use different VM sizes with available quota +- Expand NodePool CRD to more VM sizes. See [NodePool configuration documentation][nap-nodepool-docs] for details. For example, A NodePool specification which allows for D-family virtual machines is less likely to hit quota errors that stop VM creation, compared to a NodePool specification specific to only one exact VM Size. [!INCLUDE [Azure Help Support](../../../includes/azure-help-support.md)] @@ -269,5 +269,6 @@ az vm list-usage --location --query "[?currentValue >= limit]" [karpenter-faq]: https://karpenter.sh/docs/faq/ [network-security-group-docs]: /azure/virtual-network/network-security-groups-overview [aksnodeclass-subnet-config]: /azure/aks/node-autoprovision-aksnodeclass#virtual-network-subnet-configuration +[nap-nodepool-docs]: /azure/aks/node-autoprovision-node-pools [coredns-troubleshoot]: /azure/aks/coredns-custom#troubleshooting From 97673cd189395823d1629959515e3ee2a1a55cab Mon Sep 17 00:00:00 2001 From: Wilson Date: Wed, 15 Oct 2025 13:55:30 -0700 Subject: [PATCH 08/12] Adding link to main NAP doc --- .../extensions/troubleshoot-node-auto-provision.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index c54f62226a6..b6f0d751d7f 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -1,18 +1,18 @@ --- -title: Troubleshoot the Node Auto-provisioning managed add-on -description: Learn how to troubleshoot Node Auto-provisisioning in Azure Kubernetes Service (AKS). +title: Troubleshoot the Node Auto Provisioning managed add-on +description: Learn how to troubleshoot Node Auto Provisisioning in Azure Kubernetes Service (AKS). ms.service: azure-kubernetes-service ms.date: 09/05/2025 editor: bsoghigian ms.reviewer: -#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto-provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS). +#Customer intent: As an Azure Kubernetes Service user, I want to troubleshoot problems that involve Node Auto Provisioining managed add-on so that I can successfully provision, scale, and manage my nodes and workloads on Azure Kubernetes Service (AKS). ms.custom: sap:Extensions, Policies and Add-Ons --- # Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS) This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level. - When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. +When you enable Node Auto Provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. ## Prerequisites @@ -20,7 +20,7 @@ Ensure the following tools are installed and configured. They're used in the fol - [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command. - The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI. -- Confirm you have Node Auto-provisioning enabled on your cluster. +- Confirm you have Node Auto Provisioning enabled on your cluster. For steps on enabling node auto provisioning in your cluster, visit our [node auto provisioning documentation][nap-main-docs]. ## Common Issues @@ -270,5 +270,6 @@ az vm list-usage --location --query "[?currentValue >= limit]" [network-security-group-docs]: /azure/virtual-network/network-security-groups-overview [aksnodeclass-subnet-config]: /azure/aks/node-autoprovision-aksnodeclass#virtual-network-subnet-configuration [nap-nodepool-docs]: /azure/aks/node-autoprovision-node-pools +[nap-main-docs]: /azure/aks/node-autoprovision [coredns-troubleshoot]: /azure/aks/coredns-custom#troubleshooting From 12b02c973c8f4867c58f3b123397a268e8400670 Mon Sep 17 00:00:00 2001 From: Wilson Date: Wed, 15 Oct 2025 14:05:26 -0700 Subject: [PATCH 09/12] Fix typos and update DNS test command --- .../extensions/troubleshoot-node-auto-provision.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index b6f0d751d7f..ae5813c24cc 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -133,7 +133,7 @@ kubectl logs -n kube-system -l k8s-app=azure-cns --tail=100 ``` **CNI to CNS Troubleshooting**: -- **If CNS logs show "no IPs available"**: This indicates a CNS or aks's watch on the NNCs. +- **If CNS logs show "no IPs available"**: This indicates a CNS or AKS' watch on the NNCs. - **If CNI calls don't appear in CNS logs**: You likely have the wrong CNI installed. Verify the correct CNI plugin is deployed. **Common Causes**: @@ -186,7 +186,7 @@ dig @10.0.0.10 kubernetes.default.svc.cluster.local nslookup kubernetes.default.svc.cluster.local # Test external DNS resolution -dig google.com +dig azure.com ``` 4. **Check DNS pods status**: From eb2c00983177e6166931395ecb4103a1fa4fa685 Mon Sep 17 00:00:00 2001 From: Wilson Date: Fri, 7 Nov 2025 08:45:09 -0800 Subject: [PATCH 10/12] Update troubleshoot-node-auto-provision.md --- .../extensions/troubleshoot-node-auto-provision.md | 8 ++++++++ 1 file changed, 8 insertions(+) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index ae5813c24cc..2a585231576 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -64,6 +64,10 @@ kubectl get events | grep -i "disruption\|consolidation" ## Networking Issues +For most Networking related issues, there are two levels available for networking observability +- [Container Network Metrics][aks-container-metrics] (default): Allows for node level metrics +- [Advanced Container Network Metrics][advanced-container-network-metrics]: In addition to node level metrics, you can also observe pod-level metrics including FQDN metrics for troubleshooting. + ### Pod Connectivity Problems **Symptoms**: Pods can't communicate with other pods or external services. @@ -219,6 +223,7 @@ nc -zv 10.0.0.10 53 - Check that Karpenter nodes can reach the service subnet - Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system` - Verify NSG rules allow traffic on port 53 (TCP/UDP) +- Run a connectivyt analysis with the [Azure Virtual Network Verifier][connectivity-tool] tool to validate outbound connectivity ## Azure-Specific Issues @@ -272,4 +277,7 @@ az vm list-usage --location --query "[?currentValue >= limit]" [nap-nodepool-docs]: /azure/aks/node-autoprovision-node-pools [nap-main-docs]: /azure/aks/node-autoprovision [coredns-troubleshoot]: /azure/aks/coredns-custom#troubleshooting +[aks-container-metrics]: /azure/aks/container-network-observability-metrics +[advanced-container-network-metrics]: /azure/aks/advanced-container-networking-services-overview +[connectivity-tool]: /azure/azure-kubernetes/connectivity/basic-troubleshooting-outbound-connections#check-if-azure-network-resources-are-blocking-traffic-to-the-endpoint From ef2505bf7583701f537f64090f45f3c1eeb9569c Mon Sep 17 00:00:00 2001 From: Wilson Date: Fri, 7 Nov 2025 08:51:25 -0800 Subject: [PATCH 11/12] Update troubleshooting steps for Azure CNI --- .../extensions/troubleshoot-node-auto-provision.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index 2a585231576..d68d360644d 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -87,7 +87,7 @@ Another option to test node to node or pod to pod connectivity is with the open- ```azurecli-interactive kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy" ``` -3. **If using azure cni with overlay or cilium** +3. **If using azure cni with overlay** Validate your nodes have these labels ```azurecli-interactive From 35512eaf9ffb273bc4bbc3eba2d804732f8de690 Mon Sep 17 00:00:00 2001 From: Wilson Date: Fri, 7 Nov 2025 09:03:17 -0800 Subject: [PATCH 12/12] Update support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md Co-authored-by: Chase Wilson <31453523+chasewilson@users.noreply.github.com> --- .../extensions/troubleshoot-node-auto-provision.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md index d68d360644d..a5b3c5c9165 100644 --- a/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md +++ b/support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md @@ -223,7 +223,7 @@ nc -zv 10.0.0.10 53 - Check that Karpenter nodes can reach the service subnet - Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system` - Verify NSG rules allow traffic on port 53 (TCP/UDP) -- Run a connectivyt analysis with the [Azure Virtual Network Verifier][connectivity-tool] tool to validate outbound connectivity +- Run a connectivity analysis with the [Azure Virtual Network Verifier][connectivity-tool] tool to validate outbound connectivity ## Azure-Specific Issues