-
Notifications
You must be signed in to change notification settings - Fork 1.1k
NAP TSG Add #1947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
NAP TSG Add #1947
Conversation
Add NAP troubleshooting + FAQ doc
Add NAP troubleshoot doc to TOC
|
@wdarko1 : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change. |
|
Learn Build status updates of commit 71a044d: ❌ Validation status: errorsPlease follow instructions here which may help to resolve issue. For more details, please refer to the build report. Note: Your PR may contain errors or warnings or suggestions unrelated to the files you changed. This happens when external dependencies like GitHub alias, Microsoft alias, cross repo links are updated. Please use these instructions to resolve them. |
|
Learn Build status updates of commit 34df0e0: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
| 1. **Check node utilization**: | ||
| ```azurecli-interactive | ||
| kubectl top nodes | ||
| kubectl describe node <node-name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we mention aks-node-viewer as well? https://github.com/Azure/aks-node-viewer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For sure, just added a reference on line 38 for aks node viewer
Added information about AKS Node Viewer for visualizing node usage.
|
Learn Build status updates of commit 5ab640c: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
| # Troubleshoot node auto provisioning (NAP) in Azure Kubernetes Service (AKS) | ||
|
|
||
| This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level. | ||
| When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. | |
| When you enable Node Auto-provisioning, you might experience problems associated with the configuration of the infrastructure autoscaler. This article helps you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. |
| ## Prerequisites | ||
|
|
||
| Ensure the following tools are installed and configured. They're used in the following sections. | ||
|
|
||
| - [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command. | ||
| - The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI. | ||
| - Confirm you have Node Auto-provisioning enabled on your cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ## Prerequisites | |
| Ensure the following tools are installed and configured. They're used in the following sections. | |
| - [Azure CLI](/cli/azure/install-azure-cli). To install kubectl by using the [Azure CLI](/cli/azure/install-azure-cli), run the [az aks install-cli](/cli/azure/aks#az-aks-install-cli) command. | |
| - The Kubernetes [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/) tool, a Kubernetes command-line client. This is available with the Azure CLI. | |
| - Confirm you have Node Auto-provisioning enabled on your cluster. | |
| ## Prerequisites | |
| You must have the following installed and configured before you begin: | |
| - **Azure CLI**: [Install](/cli/azure/install-azure-cli) and configure the Azure CLI. Make sure it’s up to date using [`az upgrade`](/cli/azure/reference-index#az-upgrade). | |
| - **kubectl**: Install the [Kubernetes command-line tool](https://kubernetes.io/docs/reference/kubectl/overview/). You can install it with [`az aks install-cli`](/cli/azure/aks#az-aks-install-cli). | |
| - **Node Auto-Provisioning (NAP)**: Verify that NAP is enabled on your cluster. |
| This article discusses how to troubleshoot Node auto provisioning(NAP), a managed add-on based on the open source [Karpenter](https://karpenter.sh) project. NAP automatically provisions and manages nodes in response to pending pod pressure, and manages scaling events at the virtual machine, or node level. | ||
| When you enable Node Auto-provisioning, you might experience problems that are associated with the configuration of the infrastructure autoscaler. This article will help you troubleshoot errors and resolve common problems that affect NAP but aren't covered in the official Karpenter [FAQ][karpenter-faq] and [troubleshooting guide][karpenter-troubleshooting]. | ||
|
|
||
| ## Prerequisites |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might not be standard for troubleshooting docs so disregard if it's implicit. The set up doesn't give instructions on connecting to cluster or anything. else. i.e. az aks get-credentials
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a link to the main NAP doc in the set-up section here
| ``` | ||
|
|
||
| **Understanding conflist files**: | ||
| - `10-azure.conflist`: Standard Azure CNI configuration for traditional networking with node subnet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might want to clarify that's its not just node subnet but all non-overlay CNI's
| - Pods without proper tolerations | ||
| - DaemonSets preventing drain | ||
| - Pod disruption budgets(PDBs) are not properly set | ||
| - Nodes are marked with `do-not-disrupt` annotation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Locks, maybe as well? Although not common lifecycle its something customers set explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to reference locks also
| 1. **Test basic connectivity**: | ||
| ```azurecli-interactive | ||
| # From within a pod | ||
| kubectl exec -it <pod-name> -- ping <target-ip> | ||
| kubectl exec -it <pod-name> -- nslookup kubernetes.default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/bloomberg/goldpinger is what i used to test node to node + Pod to pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added line 88 referencing goldpinger tool as an option
| **Common Causes**: | ||
| - Network security group rules | ||
| - Incorrect subnet configuration | ||
| - CNI plugin issues | ||
| - DNS resolution problems |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have docs for each of these? NSG Rules in particular seems to be a common way people shoot themselves in the foot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docs added in the solution section for CoreDNS, NSG Rules (not NAP specific), and AKSNodeClass docs
| - DNS resolution problems | ||
|
|
||
| **Solutions**: | ||
| - Review NSG rules for required traffic |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a broad statement, what NSG rules? What port ranges are required for different networking modes?
|
|
||
| **Solutions**: | ||
| - Review NSG rules for required traffic | ||
| - Verify subnet configuration in AKSNodeClass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe link to the docs here for AKSNodeClass subnet configuration.
| ### Quota Exceeded | ||
|
|
||
| **Symptoms**: VM creation fails with quota exceeded errors. | ||
|
|
||
| **Debugging Steps**: | ||
|
|
||
| 1. **Check current quota usage**: | ||
| ```azurecli-interactive | ||
| az vm list-usage --location <region> --query "[?currentValue >= limit]" | ||
| ``` | ||
|
|
||
| **Solutions**: | ||
| - Request quota increases through Azure portal | ||
| - Use different VM sizes with available quota |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the error they hit on the nap side? How is this surfaced to the customer? Lets include the event it returns in the kube-system namespace, and maybe link to a mitigation guide in the error message.
This should be a rare case for nap unless they are purposefully constraining their nodepools?
- Use different VM sizes with available quota
rather than just leaving a 1 line statement, maybe provide the actual changes they need to make to their nodepool, and explain how nap is supposed to solve this if configured with many sizes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this line to reference the NodePool doc and a brief explanation.
btw A customer possibly hit this issue recently, where their nodepool scope was too limited.
Updated troubleshooting documentation for Azure Kubernetes node auto-provisioning. Added information on locks and network security group rules.
|
Learn Build status updates of commit ab2113d: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
|
Learn Build status updates of commit 53dc8cb: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
Expanded NodePool CRD to allow for more VM sizes to prevent quota errors during VM creation.
|
Learn Build status updates of commit d430ebf: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
|
Learn Build status updates of commit 97673cd: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
|
Learn Build status updates of commit 12b02c9: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
| ```azurecli-interactive | ||
| kubectl get pods -n kube-system | grep -E "azure-cni|kube-proxy" | ||
| ``` | ||
| 3. **If using azure cni with overlay or cilium** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the guidance below is cilium specific.
| - Review lock configurations | ||
|
|
||
|
|
||
| ## Networking Issues |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have either default metrics for node level network metrics if the customer doesn't have ACNS enabled and then if ACNS is enabled, they can look at pod level metrics and even get down to things like FQDN metrics for troubleshooting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a note referencing the two options with links
| - Check that Karpenter nodes can reach the service subnet | ||
| - Restart CoreDNS pods if they're in error state: `kubectl rollout restart deployment/coredns -n kube-system` | ||
| - Verify NSG rules allow traffic on port 53 (TCP/UDP) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could also recommend VNV tooling to validate outbound connectivity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with a note recommending the VNV tool with links
|
@wdarko1 : Thanks for your contribution! The author(s) and reviewer(s) have been notified to review your proposed change. |
|
Learn Build status updates of commit eb2c009: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
support/azure/azure-kubernetes/extensions/troubleshoot-node-auto-provision.md
Outdated
Show resolved
Hide resolved
|
Learn Build status updates of commit ef2505b: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
…to-provision.md Co-authored-by: Chase Wilson <31453523+chasewilson@users.noreply.github.com>
|
#sign-off |
|
Invalid command: '#sign-off'. Only the assigned author of one or more file in this PR can sign off. @JarrettRenshaw |
|
Learn Build status updates of commit 35512ea: ✅ Validation status: passed
For more details, please refer to the build report. |
PRMerger Results
|
Pull request guidance
Thank you for submitting your contribution to our support content! Our team works closely with subject matter experts in CSS and PMs in the product group to review all content requests to ensure technical accuracy and the best customer experience. This process can sometimes take one or more days, so we greatly appreciate your patience.
We also need your help in order to process your request as soon as possible:
We won't act on your pull request (PR) until you type "#sign-off" in a new comment in your pull request (PR) to indicate that your changes are complete.
After you sign off in your PR, the article will be tech reviewed by the PM or SME if it has more than minor changes. Once the article is approved, it will undergo a final editing pass before being merged.