Skip to content

Conversation

@wdarko1
Copy link
Contributor

@wdarko1 wdarko1 commented Oct 3, 2025

Pull request guidance

Thank you for submitting your contribution to our support content! Our team works closely with subject matter experts in CSS and PMs in the product group to review all content requests to ensure technical accuracy and the best customer experience. This process can sometimes take one or more days, so we greatly appreciate your patience.

We also need your help in order to process your request as soon as possible:

  • We won't act on your pull request (PR) until you type "#sign-off" in a new comment in your pull request (PR) to indicate that your changes are complete.

  • After you sign off in your PR, the article will be tech reviewed by the PM or SME if it has more than minor changes. Once the article is approved, it will undergo a final editing pass before being merged.

@prmerger-automator
Copy link

@wdarko1 : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 71a044d:

❌ Validation status: errors

Please follow instructions here which may help to resolve issue.

For more details, please refer to the build report.

Note: Your PR may contain errors or warnings or suggestions unrelated to the files you changed. This happens when external dependencies like GitHub alias, Microsoft alias, cross repo links are updated. Please use these instructions to resolve them.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 34df0e0:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

Comment on lines +33 to +36
1. **Check node utilization**:
```azurecli-interactive
kubectl top nodes
kubectl describe node <node-name>

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention aks-node-viewer as well? https://github.com/Azure/aks-node-viewer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For sure, just added a reference on line 38 for aks node viewer

Added information about AKS Node Viewer for visualizing node usage.
@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 5ab640c:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

# Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS)

This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].
When you enable Node Auto-provisioning, you might experience problems associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].

Comment on lines 17 to 23
## Prerequisites

Ensure the following tools are installed and configured. They're used in the following sections.

- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
- Confirm you have Node Auto-provisioning enabled on your cluster.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
## Prerequisites
Ensure the following tools are installed and configured. They're used in the following sections.
- [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command.
- The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI.
- Confirm you have Node Auto-provisioning enabled on your cluster.
## Prerequisites
You must have the following installed and configured before you begin:
- **Azure CLI**: [Install](/cli/azure/install-azure-cli) and configure the Azure CLI. Make sure it’s up to date using [`az upgrade`](/cli/azure/reference-index#az-upgrade).
- **kubectl**: Install the [Kubernetes command-line tool](https://kubernetes.io/docs/reference/kubectl/overview/). You can install it with [`az aks install-cli`](/cli/azure/aks#az-aks-install-cli).
- **Node Auto-Provisioning (NAP)**: Verify that NAP is enabled on your cluster.

This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level.
When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting].

## Prerequisites

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might not be standard for troubleshooting docs so disregard if it's implicit. The set up doesn't give instructions on connecting to cluster or anything. else. i.e. az aks get-credentials

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a link to the main NAP doc in the set-up section here

```

**Understanding conflist files**:
- `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with node subnet

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might want to clarify that's its not just node subnet but all non-overlay CNI's

- Pods without proper tolerations
- DaemonSets preventing drain
- Pod disruption budgets(PDBs) are not properly set
- Nodes are marked with `do-not-disrupt` annotation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Locks, maybe as well? Although not common lifecycle its something customers set explicitly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to reference locks also

Comment on lines +71 to +75
1. **Test basic connectivity**:
```azurecli-interactive
# From within a pod
kubectl exec -it <pod-name> -- ping <target-ip>
kubectl exec -it <pod-name> -- nslookup kubernetes.default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/bloomberg/goldpinger is what i used to test node to node + Pod to pod

Copy link
Contributor Author

@wdarko1 wdarko1 Oct 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added line 88 referencing goldpinger tool as an option

Comment on lines 135 to 139
**Common Causes**:
- Network security group rules
- Incorrect subnet configuration
- CNI plugin issues
- DNS resolution problems

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have docs for each of these? NSG Rules in particular seems to be a common way people shoot themselves in the foot

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docs added in the solution section for CoreDNS, NSG Rules (not NAP specific), and AKSNodeClass docs

- DNS resolution problems

**Solutions**:
- Review NSG rules for required traffic

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a broad statement, what NSG rules? What port ranges are required for different networking modes?


**Solutions**:
- Review NSG rules for required traffic
- Verify subnet configuration in AKSNodeClass

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe link to the docs here for AKSNodeClass subnet configuration.

Comment on lines 245 to 258
### Quota Exceeded

**Symptoms**: VM creation fails with quota exceeded errors.

**Debugging Steps**:

1. **Check current quota usage**:
```azurecli-interactive
az vm list-usage --location <region> --query "[?currentValue >= limit]"
```

**Solutions**:
- Request quota increases through Azure portal
- Use different VM sizes with available quota
Copy link

@Bryce-Soghigian Bryce-Soghigian Oct 6, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the error they hit on the nap side? How is this surfaced to the customer? Lets include the event it returns in the kube-system namespace, and maybe link to a mitigation guide in the error message.

This should be a rare case for nap unless they are purposefully constraining their nodepools?

  • Use different VM sizes with available quota

rather than just leaving a 1 line statement, maybe provide the actual changes they need to make to their nodepool, and explain how nap is supposed to solve this if configured with many sizes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this line to reference the NodePool doc and a brief explanation.

btw A customer possibly hit this issue recently, where their nodepool scope was too limited.

Updated troubleshooting documentation for Azure Kubernetes node auto-provisioning. Added information on locks and network security group rules.
@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit ab2113d:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 53dc8cb:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

Expanded NodePool CRD to allow for more VM sizes to prevent quota errors during VM creation.
@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit d430ebf:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 97673cd:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 12b02c9:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

@wdarko1 wdarko1 requested a review from chasewilson October 31, 2025 03:42
```azurecli-interactive
kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy"
```
3. **If using azure cni with overlay or cilium**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the guidance below is cilium specific.

- Review lock configurations


## Networking Issues

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have either default metrics for node level network metrics if the customer doesn't have ACNS enabled and then if ACNS is enabled, they can look at pod level metrics and even get down to things like FQDN metrics for troubleshooting

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note referencing the two options with links

- Check that Karpenter nodes can reach the service subnet
- Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system`
- Verify NSG rules allow traffic on port 53 (TCP/UDP)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also recommend VNV tooling to validate outbound connectivity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with a note recommending the VNV tool with links

@prmerger-automator
Copy link

@wdarko1 : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit eb2c009:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit ef2505b:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

…to-provision.md

Co-authored-by: Chase Wilson <31453523+chasewilson@users.noreply.github.com>
@wdarko1
Copy link
Contributor Author

wdarko1 commented Nov 7, 2025

#sign-off

@prmerger-automator
Copy link

Invalid command: '#sign-off'. Only the assigned author of one or more file in this PR can sign off. @JarrettRenshaw

@learn-build-service-prod
Copy link
Contributor

Learn Build status updates of commit 35512ea:

✅ Validation status: passed

File Status Preview URL Details
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md ✅Succeeded
support/azure/azure-kubernetes/toc.yml ✅Succeeded

For more details, please refer to the build report.

@prmerger-automator
Copy link

PRMerger Results

Issue Description
Added File(s) This PR contains added files. New files require human review.
Yaml File(s) This PR includes changes to .yml file(s) owned by another author.
File Change Percent This PR contains file(s) with more than 30% file change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants